A value-sensitive metadata schema for interpreting corpora: Implementation on the Unified Interpreting Corpus (UNIC) platform

Liu, Nannan; Russo, Mariachiara

doi:10.1075/intp.00123.liu

Article published In: Interpreting
Vol. 27:2 (2025) ► pp.157–196

Get fulltext from our e-platform

Download PDF

Download EPUB

A value-sensitive metadata schema for interpreting corpora

Implementation on the Unified Interpreting Corpus (UNIC) platform

Nannan Liu | University of Bologna

Mariachiara Russo | University of Bologna

Published online: 29 August 2025

https://doi.org/10.1075/intp.00123.liu

Abstract

Interpreting corpora serve as the descriptive foundation of research and the ‘ground truth’ against which machine interpreting technologies are evaluated. However, access to corpora remains a critical bottleneck in interpreting studies due to data collection and processing challenges and the absence of interpreting- and translation-specific corpus publication venues. In this article, we present two technical infrastructures that facilitate corpus access: a metadata schema which standardises corpus description and the Unified Interpreting Corpus (UNIC) platform for data and metadata search and publication. Guided by the internationally established FAIR (findability, accessibility, interoperability and reusability) and CARE (collective benefit, authority to control, responsibility and ethics) principles for scientific data management and stewardship, we designed the infrastructures based on a review of 125 spoken and signed language interpreting corpora, relevant international standards and community knowledge and also by using open-source technologies. Feedback obtained from interpreting students, researchers and interpreters demonstrates greater perceived usefulness of and satisfaction with UNIC compared to general-purpose search portals. Overall, we illustrate a value- and consensus-driven path towards optimising the use of interpreting corpora and the careful curation of new ones, which avoids the duplication of effort, helps to chart research directions and fosters co-design with communities.

Keywords: interpreting corpora, metadata, research infrastructures, value-sensitive design, co-design

Article outline

1.Introduction
2.Conceptual, technical and empirical literature on corpus management
- 2.1Conceptual foundations
- 2.2Technical infrastructures
- 2.3Empirical basis
3.Value-sensitive design (VSD)
4.Methods
- 4.1Corpus collection
- 4.2Metadata FAIRness assessments
- 4.3Designing the metadata schema and UNIC
- 4.4Stakeholder feedback collection
5.Findings
- 5.1Stakeholder personas
- 5.2Interpreting corpora
- 5.3Results of metadata FAIR assessments
  - 5.3.1Automatic evaluation
  - 5.3.2Overlap and gap analyses
- 5.4The metadata schema and UNIC
- 5.5Stakeholder feedback
6.Conclusions and future directions
Acknowledgments
Notes
References

References (66)

References

Adolph, K. E., Gilmore, R. O., Freeman, C., Sanderson, P. & Millman, D. (2012). Toward open behavioral science. Psychological Inquiry 23 (3), 244–247.

Albanie, S., Varol, G., Momeni, L., Afouras, T., Chung, J. S., Fox, N. & Zisserman, A. (2021). BSL-1K: Scaling up co-articulated sign language recognition using mouthing cues. In A. Vedaldi, H. Bischof, T. Brox & J.-M. Frahm (Eds.), Proceedings of the 16th European Conference on Computer Vision 2020. Glasgow: Springer, 35–53.

Australian FAIR Access Working Group (2017). Policy statement on F.A.I.R. access to Australia’s research outputs. [URL] (accessed 6 June 2024).

Bendazzoli, C. (2010). Il corpus DIRSI: Creazione e sviluppo di un corpus elettronico per lo studio della direzionalità in interpretazione simultanea. PhD thesis, University of Bologna.

(2018). Corpus-based interpreting studies: Past, present and future developments of a (wired) cottage industry. In M. Russo, C. Bendazzoli & B. Defrancq (Eds.), Making way in corpus-based interpreting studies. Singapore: Springer, 1–19.

(2021). Corpus studies in conference interpreting. In M. Albl-Mikasa & E. Tiselius (Eds.), The Routledge handbook of conference interpreting. London: Routledge, 443–456.

Bendazzoli, C., Bertozzi, M. & Russo, M. (2020). Du texte aux ressources multimodales: Faire avancer la recherche en interprétation à partir d’un corpus déjà existant. Meta 65 (1), 211–236.

Bernardini, S., Ferraresi, A. & Miličević, M. (2016). From EPIC to EPTIC: Exploring simplification in interpreting and translation from an intermodal perspective. Target 28 (1), 61–86.

Bird, S. & Simons, G. (2003). Seven dimensions of portability for language documentation and description. Language 79 (3), 557–582.

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A. et al. (2020). Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan & H. Lin (Eds.), Advances in neural information processing systems. Red Hook, NY: Curran Associates, Inc., 1877–1901.

Bührig, K., Kliche, O., Meyer, B. & Pawlack, B. (2012). The corpus ‘Interpreting in Hospitals’: Possible applications for research and communication training. In T. Schmidt & K. Wörner (Eds.), Multilingual corpora and multilingual corpus analysis. Amsterdam: John Benjamins, 305–315.

Camgöz, N. C., Hadfield, S., Koller, O., Ney, H. & Bowden, R. (2018). Neural sign language translation. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Salt Lake City, UT: Institute of Electrical and Electronics Engineers (IEEE), 7784–7793.

Carroll, S. R., Garba, I., Figueroa-Rodríguez, O. L., Holbrook, J., Lovett, R., Materechera, S., Parsons, M., Raseroka, K., Rodriguez-Lonebear, D., Rowe, R. et al. (2020). The CARE principles for Indigenous data governance. Data Science Journal 19 (1), 1–12.

Chmiel, A., Janikowski, P., Kajzer-Wietrzny, M., Koržinek, D. & Jakubowski, D. (2021). EU Parliament Speech Corpus. CLARIN-PL digital repository. [URL]

CLARIN (n.d.). National consortia. [URL] (accessed 16 June 2025).

Defrancq, B. & Verliefde, S. (2023). A Dutch discourse marker in interpreter-mediated police interviewing with drafting: A corpus-based approach to dialogue interpreting. Research in Corpus Linguistics 11 (2), 50–78.

Department for General Assembly and Conference Management (2024). Speech bank for interpretation training. United Nations. [URL] (accessed 4 February 2025).

Directorate-General for Research and Innovation (2021). Horizon Europe, open science: Early knowledge and data sharing, and open collaboration. Publications Office of the European Union.

Egbert, J., Biber, D. & Gray, B. (2022). Designing and evaluating language corpora: A practical framework for corpus representativeness. Cambridge: Cambridge University Press.

El-Kishky, A., Chaudhary, V., Guzmán, F. & Koehn, P. (2020). CCAligned: A massive collection of cross-lingual web-document pairs. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020). Online: Association for Computational Linguistics, 5960–5969.

Everaert, M., Musgrave, S. & Dimitriadis, A. (Eds.) (2009). The use of databases in cross-linguistic studies. Berlin: De Gruyter Mouton.

Fišer, D. & Witt, A. (Eds.) (2022). CLARIN: The infrastructure for language resources. Berlin: De Gruyter.

Franco Aixelá, J. (2001–2023). BITRA (Bibliography of Interpreting and Translation).

Friedman, B. & Hendry, D. G. (2019). Value sensitive design: Shaping technology with moral imagination. Cambridge, MA: MIT Press.

Gambier, Y. & van Doorslaer, L. (Eds.) (2024). Translation studies bibliography (TSB). John Benjamins.

Gile, D. (2024). CIRIN bibliography. [URL]

GO FAIR Initiative (n.d.). FAIR principles. [URL] (accessed 7 June 2024).

Gorgolewski, K. J., Auer, T., Calhoun, V. D., Craddock, R. C., Das, S., Duff, E. P., Flandin, G., Ghosh, S. S., Glatard, T., Halchenko, Y. O. et al. (2016). The brain imaging data structure, a format for organizing and describing outputs of neuroimaging experiments. Scientific Data 3 (1), 160044.

Jennings, L., Anderson, T., Martinez, A., Sterling, R., Chavez, D. D., Garba, I., Hudson, M., Garrison, N. A. & Carroll, S. R. (2023). Applying the ‘CARE principles for Indigenous data governance’ to ecology and biodiversity research. Nature Ecology & Evolution 7 (10), 1547–1551.

Jiang, Z., Müller, M., Ebling, S., Moryossef, A. & Ribback, R. (2023). SRF DSGS Daily news broadcast: Video and original subtitle data. LaRS — Language Repository of Switzerland.

Joshi, P., Santy, S., Budhiraja, A., Bali, K. & Choudhury, M. (2020). The state and fate of linguistic diversity and inclusion in the NLP world. In D. Jurafsky, J. Chai, N. Schluter & J. Tetreault (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, 6282–6293.

Joze, H. R. V. & Koller, O. (2019). MS-ASL: A large-scale data set and benchmark for understanding American Sign Language. In Proceedings of the 30th British Machine Vision Conference 2019. Cardiff, UK: British Machine Vision Association.

Kreutzer, J., Caswell, I., Wang, L., Wahab, A., van Esch, D., Ulzii-Orshikh, N., Tapo, A., Subramani, N., Sokolov, A., Sikasote, C. et al. (2022). Quality at a glance: An audit of web-crawled multilingual datasets. Transactions of the Association for Computational Linguistics 101, 50–72.

Liceras, J. M., Fernández Fuertes, R., Perales, S., Pérez-Tattam, R. & Spradlin, K. T. (2008). Gender and gender agreement in bilingual native and non-native grammars: A view from child and adult functional–lexical mixings. Lingua 118 (6), 827–851.

Liu, N. (2023). Speaking in the first-person singular or plural: A multifactorial, speech corpus-based analysis of institutional interpreters. Interpreting 25 (2), 239–273.

Lösch, A., Mapelli, V., Piperidis, S., Vasiļjevs, A., Smal, L., Declerck, T., Schnur, E., Choukri, K. & van Genabith, J. (2018). European Language Resource Coordination: Collecting language resources for public sector multilingual information management. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). Miyazaki, Japan: European Language Resources Association (ELRA), 1339–1343.

Lušicky, V. & Wissik, T. (2017). Discovering resources in the VLO: A pilot study with students of translation studies. In Selected papers from the CLARIN Annual Conference 2016. Linköping: Linköping University Electronic Press, 63–75.

Macháček, D., Žilinec, M. & Bojar, O. (2024). ESIC 1.1 — Europarl Simultaneous Interpreting Corpus (2024-02-05). LINDAT/CLARIAH-CZ. [URL]

Marsden, E. & Mackey, A. (2014). IRIS: A new resource for second language research. Linguistic Approaches to Bilingualism 4 (1), 125–130.

National Science Foundation (2023). NSF public access plan 2.0: Ensuring open, immediate and equitable access to National Science Foundation funded research. [URL] (accessed 25 June 2024).

Paullada, A., Raji, I. D., Bender, E. M., Denton, E. & Hanna, A. (2021). Data and its (dis)contents: A survey of dataset development and use in machine learning research. Patterns 2 (11), 100336.

Pernet, C. R., Appelhoff, S., Gorgolewski, K. J., Flandin, G., Phillips, C., Delorme, A. & Oostenveld, R. (2019). EEG-BIDS, an extension to the brain imaging data structure for electroencephalography. Scientific Data 6 (1), 103.

Pöchhacker, F. (2022). Introducing interpreting studies (3rd ed.). London/New York: Routledge.

(2024). Is machine interpreting interpreting? Translation Spaces (Online First).

Pruitt, J. & Grudin, J. (2003). Personas: Practice and theory. In Proceedings of the 2003 Conference on Designing for User Experiences. New York, NY: Association for Computing Machinery, 1–15.

Rehm, G., Piperidis, S., Bontcheva, K., Hajic, J., Arranz, V., Vasiļjevs, A., Backfried, G., Gomez-Perez, J. M., Germann, U., Calizzano, R. et al. (2021). European Language Grid: A joint platform for the European language technology community. In D. Gkatzia & D. Seddah (Eds.), Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System demonstrations. Online: Association for Computational Linguistics, 221–230.

Russo, M., Bendazzoli, C., Sandrelli, A. & Spinolo, N. (2012). The European Parliament Interpreting Corpus (EPIC): Implementation and developments. In F. Straniero Sergio & C. Falbo (Eds.), Breaking ground in corpus-based interpreting studies. Bern: Peter Lang, 53–90.

Saunders, B., Camgöz, N. C. & Bowden, R. (2022). Signing at scale: Learning to co-articulate signs for large-scale photo-realistic sign language production. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans, LA: Institute of Electrical and Electronics Engineers (IEEE), 5131–5141.

Seeber, K. G. (2006). SIMON: An online clearing house for interpreter training materials. In C. M. Crawford, R. Carlsen, K. McFerrin, J. Price, R. Weber & D. A. Willis (Eds.), Proceedings of Society for Information Technology & Teacher Education International Conference 2006. Orlando, FL: Association for the Advancement of Computing in Education (AACE), 2403–2408.

Setton, R. (2011). Corpus-based interpretation studies (CIS): Reflections and prospects. In A. Kruger, K. Wallmach & J. Munday (Eds.), Corpus-based translation studies: Research and applications. London: Continuum, 33–75.

Shlesinger, M. (1998). Corpus-based interpreting studies as an offshoot of corpus-based translation studies. Meta 43 (4), 486–493.

Surrey Research Park (2024). Signapse’s sign language technology advances Deaf accessibility. [URL] (accessed 23 August 2024).

Technical Committee ISO/TC 37/SC 2 (2023). Code for individual languages and language groups. Technical Report ISO 639:2023, Geneva: International Organization for Standardization.

Technical Committee ISO/TC 37/SC 4 (2015). Language resource management — Component Metadata Infrastructure (CMDI) — Part 1: The component metadata model. Technical Report ISO 24622-1, Geneva: International Organization for Standardization.

(2019). Language resource management — Component Metadata Infrastructure (CMDI) — Part 2: Component metadata specification language. Technical Report ISO 24622-2:2019, Geneva: International Organization for Standardization.

Technical Committee ISO/TC 46/SC 4 (2017). Information and documentation — The Dublin Core metadata element set Part 1: Core elements. Technical Report ISO 15836-1:2017, Geneva: International Organization for Standardization.

Technical Committee ISO/TC 154 (2019). Date and time — Representations for information interchange Part 1: Basic rules. Technical report ISO 8601-1:2019, Geneva: International Organization for Standardization.

Temnikova, I., Abdelali, A., Hedaya, S., Vogel, S. & Al Daher, A. (2017). Interpreting strategies annotation in the WAW corpus. In Proceedings of the First Workshop on Human-informed Translation and Interpreting Technology (HiT-IT). Varna, Bulgaria: Incoma Ltd., 36–43.

Thompson, B., Dhaliwal, M., Frisch, P., Domhan, T. & Federico, M. (2024). A shocking amount of the web is machine translated: Insights from multi-way parallelism. In L.-W. Ku, A. Martins & V. Srikumar (Eds.), Findings of the Association for Computational Linguistics: ACL 2024. Bangkok, Thailand: Association for Computational Linguistics, 1763–1775.

Ticca, A. C. (2008). L’interprete ad hoc nel dialogo mediato medico-paziente: processi interazionali in una clinica dello Yucatan indigeno. PhD thesis, University of Pisa.

Vandeghinste, V., Van Dyck, B., De Coster, M., Goddefroy, M. & Dambre, J. (2022). BeCoS corpus: Belgian Covid-19 Sign Language corpus. A corpus for training sign language recognition and translation. Computational Linguistics in the Netherlands Journal 121, 7–17.

Wallmach, K. (2000). Examining simultaneous interpreting norms and strategies in a South African legislative context: A pilot corpus analysis. Language Matters 31 (1), 198–221.

Wang, B. (2012). A descriptive study of norms in interpreting based on the Chinese–English consecutive interpreting corpus of Chinese premier press conferences. Meta 57 (1), 198–212.

Wehrmeyer, E. (2019). A corpus for signed language interpreting research. Interpreting 21 (1), 62–90.

Wilkinson, M. D., Dumontier, M., Aalbersberg, Ij. J., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J.-W., da Silva Santos, L. B., Bourne, P. E. et al. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data 3 (1), 1–9.

Wilkinson, M. D., Dumontier, M., Sansone, S.-A., Bonino da Silva Santos, L. O., Prieto, M., Batista, D., McQuilton, P., Kuhn, T., Rocca-Serra, P., Crosas, M. et al. (2019). Evaluating FAIR maturity through a scalable, automated, community-governed framework. Scientific Data 6 (1), 174.