Good practices in the compilation of FOLK, the Research and Teaching Corpus of Spoken German

Schmidt, Thomas

doi:10.1075/ijcl.21.3.05sch

Article published In: Compilation, transcription, markup and annotation of spoken corpora
Edited by John M. Kirk and Gisle Andersen
[International Journal of Corpus Linguistics 21:3] 2016
► pp. 396–418

Get fulltext from our e-platform

Download PDF

Good practices in the compilation of FOLK, the Research and Teaching Corpus of Spoken German

Thomas Schmidt | Institut für Deutsche Sprache

Published online: 29 September 2016

https://doi.org/10.1075/ijcl.21.3.05sch

This paper presents practices in the compilation of FOLK, the Research and Teaching Corpus of Spoken German, a large collection of spontaneous verbal interaction from diverse discourse domains. After introducing the aims and organisational circumstances of the construction of FOLK, the general idea discussed is that good practices cannot be developed without considering methodological, technological and organisational aspects on equal footing. Starting from this idea, this paper inspects more closely some actual practices in FOLK, namely the handling of legal (especially privacy protection) issues, the decisions taken for the transcription and annotation workflow, and the question of how to best disseminate a corpus like FOLK. The final section sketches some possible future improvements for practices in FOLK.

Keywords: oral corpora, corpus interface, transcription, spoken language

References (41)

Baude, O., Blanche-Benveniste, C., Calas, M.-F., Cappeau, P., Corderereix, P., Goury, L., Jacobson, M., de Lambertierie, I., Marchello-Nizia, C., & Mondada, L. (2006). Corpus Oraux: Guide des Bonnes Pratiques. Orléans: Presses Universitaires d’Orléans. Retrieved from [URL] (last accessed October 2014).

Berens, F.-J., Jäger, K.-H., Schank, G., & Schwitalla, J. (1976). Projekt Dialogstrukturen. Ein Arbeitsbericht. Heutiges Deutsch, I(12), 1–147.

Bird, S., & Liberman, M. (2001). A formal framework for linguistic annotation. Speech Communication, 33(1,2), 23–60.

Bird, S., & Simons, G. (2002). Seven dimensions of portability for language documentation and description. Language, 79(3), 557–582.

Brinckmann, C., Kleiner, S., Knöbl, R., & Berend, N. (2008). German today: An areally extensive corpus of spoken standard German. Proceedings 6th International Conference on Language Resources and Evaluation (LREC 2008) , Marrakesch, Marokko (pp. 3185–3191). Retrieved from [URL] (last accessed November 2015).

Carletta, J., Kilgour, J., O’Donnell, T., Evert, S., & Voorman, H. (2003). The NITE object model library for handling structured linguistic annotation on multimodaldata sets. Proceedings of the EACL Workshop on Language Technology and the Semantic Web. Budapest (pp. 17–24). Retrieved from [URL] (last accessed November 2015).

CLARIN (2010). Interoperability and standards. CLARIN deliverable D5.C-3. Retrieved from [URL] (last accessed November 2015).

Deppermann, A., & Hartung, M. (2011). Was gehört in ein nationales Gesprächskorpus? Kriterien, Probleme und Prioritäten der Stratifikation des ‘Forschungs- und Lehrkorpus Gesprochenes Deutsch’ (FOLK) am Institut für Deutsche Sprache (Mannheim). In E. Felder, M. Müller, & F. Vogel, F.. (Eds.), Korpuspragmatik. Thematische Korpora als Basis diskurslinguistischer Analysen (pp. 414–450). Berlin: de Gruyter.

Deppermann, A., & Proske, N. (2015). Grundeinheiten der Sprache und des Sprechens. In C. Dürscheid & J.-G. Schneider (Eds.), Satz, Äußerung, Schema (pp. 17–47). Berlin: de Gruyter,

Fandrych, C., Meißner, C., & Slavcheva, A. (2012). The GeWiss Corpus: Comparing spoken academic German, English and Polish. In T. Schmidt & K. Wörner (Eds.), Multilingual Corpora and Multilingual Corpus Analysis (pp. 319–337). Amsterdam: John Benjamins.

Goldman, J., Renals, S., Bird, S., de Jong, F., Federico, M., Fleischhauer, C., Kornbluh, M., Lamel, L., Oard, D.W., Stewart, C., & Wright, R. (2005). Accessing the spoken word. International Journal on Digital Libraries, 5(4), 287–298.

Habscheid, S. (2014). Haben sich Sprach- und Literaturwissenschaft heute noch etwas zu sagen? Eine Antwort aus sprachwissenschaftlicher Perspektive – am Beispiel eines gesprächslinguistischen Forschungsprojekts über Pausengespräche im Theater. In H.-R. Fluck & J. Zhu (Eds.), Vielfalt und Interkulturalität der internationalen Germanistik. Festgabe für Siegfried Grosse zum 90. Geburtstag (pp. 73–85). Tübingen: Stauffenburg,.

Hedeland, H., Lehmberg, T., Schmidt, T., & Wörner, K. (2014). Multilingual corpora at the Hamburg Centre for Language Corpora. In S. Ruhi, M. Haugh, T. Schmidt & K. Wörner (Eds.), Best Practices for Spoken Corpora in Linguistic Research (pp. 208–224). Newcastle-upon-Tyne: Cambridge Scholars Press.

Hee, K. (2012). Polizeivernehmungen von Migranten: Eine gesprächsanalytische Studie interkultureller Interaktionen in Institutionen. Heidelberg: Universitätsverlag Winter.

IDS [Institut für Deutsche Sprache] (1975). Gesprochene Sprache. Tübingen: Narr.

Kellner, B., Lehmberg, T., Schröder, I., & Wörner, K. (2008). Data structures for the analysis of regional language variation. In A. Storrer, A. Geyken, A. Siebert & K.-M. Würzner (Eds.), Text Resources and Lexical Knowledge (pp. 53–63). Berlin: Walter de Gruyter.

Kupietz, M., & Schmidt, T. (2015). Schriftliche und mündliche Korpora am IDS als Grundlage für die empirische Forschung. In L.M. Eichinger, (Ed.), Sprachwissenschaft im Fokus: Positionsbestimmungen und Perspektiven (pp. 297–322). Berlin: De Gruyter Mouton.

Kucharczik, K. (no date). Korpus der gesprochenen Sprache im Ruhrgebiet (KgSR). Retrieved from [URL] (last accessed January 2014).

Leech, G., Myers, G., & Thomas, J. (Eds.) (1995). Spoken English on Computer: Transcription, Markup and Application. Harlow: Longman.

MacWhinney, B. (2000). The CHILDES Project: Tools for Analyzing Talk. Mahwah, NJ: Lawrence Erlbaum.

Ochs, E. (1979). Transcription as theory. In E. Ochs & B.B. Schieffelin (Eds.) Developmental Pragmatics (pp. 43–72). New York, NY: Academic Press.

O’Connell, D., & Kowal, S. (1994). Some current transcription systems for spoken discourse: A critical analysis. Pragmatics, 4(1), 81–107.

. (2000). Are transcripts reproducible? Pragmatics, 10(2), 247–269.

Oostdijk, N., & Broeder, D. (2003). The Spoken Dutch Corpus and its exploitation environment. In A. Abeille, S. Hansen-Schirra & H. Uszkoreit (Eds.) Proceedings of the 4th International Workshop on Linguistically Interpreted Corpora (LINC-03). 14 April, 2003. Budapest, Hungary (pp. 93–101).

Parisse, C., & Morgenstern, A. (2010). A multi-software integration platform and support for multimedia transcripts of language. In M. Kipp, J.C. Martin, P. Paggio & D. Heylen (Eds.), Proceedings of the LREC 2010 Workshop on Multimodal Corpora: Advances in Capturing, Coding and Analyzing Multimodality, (pp. 106–110). Retrieved from [URL] (last accessed November 2015).

Rehbein, J., Grießhaber, W., Löning, P., Hartung, M., & Bührig, K. (1993). Manual für das computergestützte Transkribieren mit dem Programm syncWRITER nach dem Verfahren der Halbinterpretativen Arbeitstranskriptionen (HIAT). Hamburg: Universität Hamburg.

Rehbein, J., Schmidt, T., Meyer, B., Watzke, F., & Herkenrath, A. (2004) Handbuch für das computergestützte Transkribieren nach HIAT. Retrieved from [URL] (last accessed November 2015).

Rohlfing, K., Loehr, D., Duncan, S., Brown, A., Franklin, A., Kimbara, I., Milde, J.-T., Parrill, F., Rose, T., Schmidt, T., Sloetjes, H., & Thies, A. (2006). Comparison of multimodal annotation tools: Workshop report. In Gesprächsforschung: Online-Zeitschrift zur verbalen Interaktion 71, 99–123.

Schmid, H. (1995). Improvements in part-of-speech tagging with an application to German. Proceedings of the ACL SIGDAT-Workshop . Dublin, Ireland. Retrieved from [URL] (last accessed November 2015).

Schmidt, T., & Schütte, W. (2010). FOLKER: An annotation tool for efficient transcription of natural, multi-party interaction. In Proceedings of the Seventh Conference on International Language Resources and Evaluation (LREC10) , Valletta, Malta (pp. 2091–2096). Retrieved from [URL] (last accessed November 2015).

Schmidt, T. (2011). A TEI-based approach to standardising spoken language transcription. Journal of the Text Encoding Initiative 11. Retrieved from [URL] (last accessed November 2015).

. (2012). EXMARaLDA and the FOLK tools. In Proceedings of the Eighth Conference on International Language Resources and Evaluation (LREC’10) , Istanbul, Turkey: European Language Resources Association (ELRA), (pp. 236–240). Retrieved from [URL] (last accessed November 2015).

. (2014). The Database for Spoken German – DGD2. In Proceedings of the Ninth conference on International Language Resources and Evaluation (LREC’14) , Reykjavik, Iceland: European Language Resources Association (ELRA) (pp. 1451–1457). Retrieved from [URL] (last accessed November 2015).

Schmidt, T., Dickgießer S., & Gasch, J. (2013). Die Datenbank für Gesprochenes Deutsch (DGD2). Mannheim: Institut für Deutsche Sprache. Retrieved from [URL] (last accessed November 2015).

Schmidt, T., & Wörner, K. (2014). EXMARaLDA. In J. Durand, U. Gut & G. Kristoffersen (Eds.), The Oxford Handbook of Corpus Phonology (pp. 402–419.). Oxford: Oxford University Press.

Selting, M., Auer, P., Barden, B. Bergmann, J., Couper-Kuhlen, E., Günthner, S., Meier, C., Quasthoff, U., Schlobinski, P., & Uhmann, S. (1998). Gesprächsanalytisches Transkriptionssystem (GAT). Linguistische Berichte, 1731, 91–122.

Selting, M., Auer, P., Barth-Weingarten, D., Bergmann, J., Bergmann P., Birkner, K., Couper-Kuhlen, E., Deppermann, A., Gilles, P., Günthner, S., & Hartung, M. (2009). Gesprächsanalytisches Transkriptionssystem 2 (GAT 2). In Gesprächsforschung: Online-Zeitschrift zur verbalen Interaktion,101, 353–402.

Stift, U.-M., & Schmidt, T. (2014). Mündliche Korpora am IDS: Vom Deutschen Spracharchiv zur Datenbank für Gesprochenes Deutsch. In Ansichten und Einsichten. 50 Jahre Institut für Deutsche Sprache (pp. 360–375). Mannheim: Institut für Deutsche Sprache (IDS).

Thompson, P. (2005). Spoken language corpora. In M. Wynne (Ed.), Developing Linguistic Corpora: A Guide to Good Practice (pp. 59–70). Oxford: Oxbow Books. Retrieved from [URL] (last accessed November 2015).

Westpfahl, S., & Schmidt, T. (2013). POS für(s) FOLK: Part of Speech Tagging des Forschungs- und Lehrkorpus Gesprochenes Deutsch. Journal for Language Technology and Computational Linguistics, 28(1), 139–156.

Wiese, H., Freywald, U., Schalowski, S., & Mayr, K. (2012). Das KiezDeutsch- Korpus. Spontansprachliche Daten Jugendlicher aus urbanen Wohngebieten. Deutsche Sprache 401, 97–123.

Cited by (29)

Cited by 29 other publications

Order by:

Betz, Emma & Alexandra Gubina

2025. On stance-taking with one-sided vs. two-sided shoulder lifts in German talk-in-interaction. Frontiers in Psychology 16

Bridwell, Keiko & Katherine Ireland

2025. Do You Reckon?. In Data-Intensive Investigations of English, ► pp. 48 ff.

Frick, Elena & Thomas Schmidt

2025. 339Querying spoken language data. In Harmonizing language data, ► pp. 339 ff.

Gubina, Alexandra

2025. Countering Prior Interactional Conduct with Responsive doch in German Talk-in-Interaction . Research on Language and Social Interaction 58:1 ► pp. 85 ff.

Gubina, Alexandra

2025. Structurally ‘incomplete’ social action formats in the grammar of talk-in-interaction?. In Grammar in Action [Studies in Language and Social Interaction, 37], ► pp. 116 ff.

Schubert, Mojenn

2025. Pointing at others as an embodied connection device: linking back to prior talk in multi-party interaction. Discourse Processes 62:4 ► pp. 257 ff.

Deppermann, Arnulf, Alexandra Gubina, Katharina König & Martin Pfeiffer

2024. Request for confirmation sequences in German. Open Linguistics 10:1

Gubina, Alexandra & Arnulf Deppermann

2024. Rejecting the validity of inferred attributions of incompetence in German talk-in-interaction. Journal of Pragmatics 221 ► pp. 150 ff.

Hashimoto, Brett & Kyra Nelson

2024. Recent trends in corpus design and reporting: A methodological synthesis. Research in Corpus Linguistics 12:1 ► pp. 59 ff.

Yu, Guodong, Yaxin Wu, Paul Drew & Chase Wesley Raymond

2024. The DIG Mandarin Conversations (DMC) Corpus. Chinese Language and Discourse. An International and Interdisciplinary Journal 15:1 ► pp. 105 ff.

Hanks, Elizabeth

2023. Review of Love (2020): Overcoming challenges in corpus construction: The Spoken British National Corpus 2014. Register Studies 5:1 ► pp. 136 ff.

Helmer, Henrike

2023. Ad-hoc-compounds in spoken German. Interactional Linguistics 3:1-2 ► pp. 67 ff.

Hirschmann, Hagen & Thomas Schmidt

2022. Gesprochene Lernerkorpora: Methodisch-technische Aspekte der Erhebung, Erschließung und Nutzung. Zeitschrift für germanistische Linguistik 50:1 ► pp. 36 ff.

Love, Robbie, Claire Dembry, Andrew Hardie, Vaclav Brezina & Tony McEnery

2022. The Spoken BNC2014. International Journal of Corpus Linguistics ► pp. 319 ff.

Stratton, James M.

2022. Tapping into German Adjective Variation: A Variationist Sociolinguistic Approach. Journal of Germanic Linguistics 34:1 ► pp. 63 ff.

Deppermann, Arnulf & Alexandra Gubina

2021. Positionally-sensitive action-ascription. Interactional Linguistics 1:2 ► pp. 183 ff.

Deppermann, Arnulf & Alexandra Gubina

2025. Coding actions in social interaction: Potentials and problems. Research on Language and Social Interaction 58:3 ► pp. 258 ff.

Gubina, Alexandra & Emma Betz

2021. What Do Newsmark-Type Responses Invite? The Response Space After German echt . Research on Language and Social Interaction 54:4 ► pp. 374 ff.

Gubina, Alexandra & Emma Betz

2025. Responding to new information with negative discourse particles nein/nee/nö in German talk-in-interaction. Journal of Pragmatics 250 ► pp. 174 ff.

Knight, Dawn, Steve Morris, Laura Arman, Jennifer Needs & Mair Rees

2021. Processing and (Re)presenting Corpora. In Building a National Corpus, ► pp. 105 ff.

Põldvere, Nele, Johan Frid, Victoria Johansson & Carita Paradis

2021. Challenges of releasing audio material for spoken data: The case of the London-Lund Corpus 2. Research in Corpus Linguistics 9:1 ► pp. 35 ff.

PÕLDVERE, NELE, VICTORIA JOHANSSON & CARITA PARADIS

2021. OnThe London–Lund Corpus 2: design, challenges and innovations. English Language and Linguistics 25:3 ► pp. 459 ff.

Saccone, Valentina & Chiara Trombetta

2021. Parenthetical Units and Structures in Italian and German spoken language: Prosodic and textual analysis. CHIMERA: Revista de Corpus de Lenguas Romances y Estudios Lingüísticos 8 ► pp. 1 ff.

Chen, Yu-Hua & Radovan Bruncak

2020. Transcribear – Introducing a secure online transcription and annotation tool. Digital Scholarship in the Humanities 35:2 ► pp. 265 ff.

Ghyselen, Anne-Sophie, Anne Breitbarth, Melissa Farasyn, Jacques Van Keymeulen & Arjan van Hessen

2020. Clearing the Transcription Hurdle in Dialect Corpus Building: The Corpus of Southern Dutch Dialects as Case Study. Frontiers in Artificial Intelligence 3

Deppermann, Arnulf & Elwys De Stefani

2019. Defining in talk-in-interaction: Recipient-design through negative definitional components. Journal of Pragmatics 140 ► pp. 140 ff.

Batinić, Dolores & Thomas Schmidt

2018. Reconstruction of Separable Particle Verbs in a Corpus of Spoken German. In Language Technologies for the Challenges of the Digital Age [Lecture Notes in Computer Science, 10713], ► pp. 3 ff.

Meliss, Meike, Christine Möhrs & Maria Ribeiro Silveira

2018. Erwartungen an eine korpusbasierte lexikografische Ressource zur ‚Lexik des gesprochenen Deutsch in der Interaktion‘: Ergebnisse aus zwei empirischen Studien. Zeitschrift für Angewandte Linguistik 2018:68 ► pp. 103 ff.

Meliss, Meike, Christine Möhrs & Maria Ribeiro Silveira

2019. Anforderungen und Erwartungen an eine lexikografische Ressource des gesprochenen Deutsch aus der L2-Lernerperspektive. Lexicographica 34:2018 ► pp. 89 ff.

This list is based on CrossRef data as of 12 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.