Article published In: Interpreting
Vol. 27:2 (2025) ► pp.157–196
A value-sensitive metadata schema for interpreting corpora
Implementation on the Unified Interpreting Corpus (UNIC) platform
Published online: 29 August 2025
https://doi.org/10.1075/intp.00123.liu
https://doi.org/10.1075/intp.00123.liu
Abstract
Interpreting corpora serve as the descriptive foundation of research and the ‘ground truth’ against which machine
interpreting technologies are evaluated. However, access to corpora remains a critical bottleneck in interpreting studies due to
data collection and processing challenges and the absence of interpreting- and translation-specific corpus publication venues. In
this article, we present two technical infrastructures that facilitate corpus access: a metadata schema which standardises corpus
description and the Unified Interpreting Corpus (UNIC) platform for data and metadata search and publication. Guided by the
internationally established FAIR (findability, accessibility, interoperability and reusability) and CARE (collective benefit,
authority to control, responsibility and ethics) principles for scientific data management and stewardship, we designed the
infrastructures based on a review of 125 spoken and signed language interpreting corpora, relevant international standards and
community knowledge and also by using open-source technologies. Feedback obtained from interpreting students, researchers and
interpreters demonstrates greater perceived usefulness of and satisfaction with UNIC compared to general-purpose search portals.
Overall, we illustrate a value- and consensus-driven path towards optimising the use of interpreting corpora and the careful
curation of new ones, which avoids the duplication of effort, helps to chart research directions and fosters co-design with
communities.
Article outline
- 1.Introduction
- 2.Conceptual, technical and empirical literature on corpus management
- 2.1Conceptual foundations
- 2.2Technical infrastructures
- 2.3Empirical basis
- 3.Value-sensitive design (VSD)
- 4.Methods
- 4.1Corpus collection
- 4.2Metadata FAIRness assessments
- 4.3Designing the metadata schema and UNIC
- 4.4Stakeholder feedback collection
- 5.Findings
- 5.1Stakeholder personas
- 5.2Interpreting corpora
- 5.3Results of metadata FAIR assessments
- 5.3.1Automatic evaluation
- 5.3.2Overlap and gap analyses
- 5.4The metadata schema and UNIC
- 5.5Stakeholder feedback
- 6.Conclusions and future directions
- Acknowledgments
- Notes
References
References (66)
Adolph, K. E., Gilmore, R. O., Freeman, C., Sanderson, P. & Millman, D. (2012). Toward
open behavioral science. Psychological
Inquiry 23 (3), 244–247.
Albanie, S., Varol, G., Momeni, L., Afouras, T., Chung, J. S., Fox, N. & Zisserman, A. (2021). BSL-1K:
Scaling up co-articulated sign language recognition using mouthing
cues. In A. Vedaldi, H. Bischof, T. Brox & J.-M. Frahm (Eds.), Proceedings
of the 16th European Conference on Computer Vision
2020. Glasgow: Springer, 35–53.
Australian FAIR Access Working
Group (2017). Policy statement on F.A.I.R. access to Australia’s research
outputs. [URL] (accessed 6 June 2024).
Bendazzoli, C. (2010). Il
corpus DIRSI: Creazione e sviluppo di un corpus elettronico per lo studio della direzionalità in interpretazione
simultanea. PhD thesis, University of Bologna.
(2018). Corpus-based
interpreting studies: Past, present and future developments of a (wired) cottage
industry. In M. Russo, C. Bendazzoli & B. Defrancq (Eds.), Making
way in corpus-based interpreting
studies. Singapore: Springer, 1–19.
(2021). Corpus
studies in conference interpreting. In M. Albl-Mikasa & E. Tiselius (Eds.), The
Routledge handbook of conference
interpreting. London: Routledge, 443–456.
Bendazzoli, C., Bertozzi, M. & Russo, M. (2020). Du
texte aux ressources multimodales: Faire avancer la recherche en interprétation à partir d’un corpus déjà
existant. Meta 65 (1), 211–236.
Bernardini, S., Ferraresi, A. & Miličević, M. (2016). From
EPIC to EPTIC: Exploring simplification in interpreting and translation from an intermodal
perspective. Target 28 (1), 61–86.
Bird, S. & Simons, G. (2003). Seven
dimensions of portability for language documentation and
description. Language 79 (3), 557–582.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A. et al. (2020). Language
models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan & H. Lin (Eds.), Advances
in neural information processing systems. Red Hook, NY: Curran Associates, Inc., 1877–1901.
Bührig, K., Kliche, O., Meyer, B. & Pawlack, B. (2012). The
corpus ‘Interpreting in Hospitals’: Possible applications for research and communication
training. In T. Schmidt & K. Wörner (Eds.), Multilingual
corpora and multilingual corpus
analysis. Amsterdam: John Benjamins, 305–315.
Camgöz, N. C., Hadfield, S., Koller, O., Ney, H. & Bowden, R. (2018). Neural
sign language translation. In 2018 IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR). Salt Lake City, UT: Institute of Electrical and Electronics Engineers (IEEE), 7784–7793.
Carroll, S. R., Garba, I., Figueroa-Rodríguez, O. L., Holbrook, J., Lovett, R., Materechera, S., Parsons, M., Raseroka, K., Rodriguez-Lonebear, D., Rowe, R. et al. (2020). The
CARE principles for Indigenous data governance. Data Science
Journal 19 (1), 1–12.
Chmiel, A., Janikowski, P., Kajzer-Wietrzny, M., Koržinek, D. & Jakubowski, D. (2021). EU
Parliament Speech Corpus. CLARIN-PL digital repository. [URL]
CLARIN (n.d.). National
consortia. [URL] (accessed 16 June 2025).
Defrancq, B. & Verliefde, S. (2023). A
Dutch discourse marker in interpreter-mediated police interviewing with drafting: A corpus-based approach to dialogue
interpreting. Research in Corpus
Linguistics 11 (2), 50–78.
Department for General Assembly and Conference
Management (2024). Speech bank for interpretation
training. United Nations. [URL] (accessed 4
February 2025).
Directorate-General for Research and
Innovation (2021). Horizon Europe, open science: Early knowledge and data sharing,
and open collaboration. Publications Office of the European Union.
Egbert, J., Biber, D. & Gray, B. (2022). Designing
and evaluating language corpora: A practical framework for corpus
representativeness. Cambridge: Cambridge University Press.
El-Kishky, A., Chaudhary, V., Guzmán, F. & Koehn, P. (2020). CCAligned:
A massive collection of cross-lingual web-document
pairs. In Proceedings of the 2020 Conference on Empirical Methods in
Natural Language Processing (EMNLP 2020). Online: Association for Computational Linguistics, 5960–5969.
Everaert, M., Musgrave, S. & Dimitriadis, A. (Eds.) (2009). The
use of databases in cross-linguistic studies. Berlin: De Gruyter Mouton.
Fišer, D. & Witt, A. (Eds.) (2022). CLARIN:
The infrastructure for language resources. Berlin: De Gruyter.
Friedman, B. & Hendry, D. G. (2019). Value
sensitive design: Shaping technology with moral imagination. Cambridge, MA: MIT Press.
Gambier, Y. & van Doorslaer, L. (Eds.) (2024). Translation
studies bibliography (TSB). John Benjamins.
Gile, D. (2024). CIRIN
bibliography. [URL]
GO FAIR
Initiative (n.d.). FAIR principles. [URL] (accessed 7 June 2024).
Gorgolewski, K. J., Auer, T., Calhoun, V. D., Craddock, R. C., Das, S., Duff, E. P., Flandin, G., Ghosh, S. S., Glatard, T., Halchenko, Y. O. et al. (2016). The
brain imaging data structure, a format for organizing and describing outputs of neuroimaging
experiments. Scientific
Data 3 (1), 160044.
Jennings, L., Anderson, T., Martinez, A., Sterling, R., Chavez, D. D., Garba, I., Hudson, M., Garrison, N. A. & Carroll, S. R. (2023). Applying
the ‘CARE principles for Indigenous data governance’ to ecology and biodiversity
research. Nature Ecology &
Evolution 7 (10), 1547–1551.
Jiang, Z., Müller, M., Ebling, S., Moryossef, A. & Ribback, R. (2023). SRF
DSGS Daily news broadcast: Video and original subtitle data. LaRS — Language Repository of Switzerland.
Joshi, P., Santy, S., Budhiraja, A., Bali, K. & Choudhury, M. (2020). The
state and fate of linguistic diversity and inclusion in the NLP
world. In D. Jurafsky, J. Chai, N. Schluter & J. Tetreault (Eds.), Proceedings
of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, 6282–6293.
Joze, H. R. V. & Koller, O. (2019). MS-ASL:
A large-scale data set and benchmark for understanding American Sign
Language. In Proceedings of the 30th British Machine Vision
Conference 2019. Cardiff, UK: British Machine Vision Association.
Kreutzer, J., Caswell, I., Wang, L., Wahab, A., van Esch, D., Ulzii-Orshikh, N., Tapo, A., Subramani, N., Sokolov, A., Sikasote, C. et al. (2022). Quality
at a glance: An audit of web-crawled multilingual datasets. Transactions of the Association for
Computational
Linguistics 101, 50–72.
Liceras, J. M., Fernández Fuertes, R., Perales, S., Pérez-Tattam, R. & Spradlin, K. T. (2008). Gender
and gender agreement in bilingual native and non-native grammars: A view from child and adult functional–lexical
mixings. Lingua 118 (6), 827–851.
Liu, N. (2023). Speaking
in the first-person singular or plural: A multifactorial, speech corpus-based analysis of institutional
interpreters. Interpreting 25 (2), 239–273.
Lösch, A., Mapelli, V., Piperidis, S., Vasiļjevs, A., Smal, L., Declerck, T., Schnur, E., Choukri, K. & van Genabith, J. (2018). European
Language Resource Coordination: Collecting language resources for public sector multilingual information
management. In Proceedings of the Eleventh International Conference
on Language Resources and Evaluation (LREC 2018). Miyazaki, Japan: European Language Resources Association (ELRA), 1339–1343.
Lušicky, V. & Wissik, T. (2017). Discovering
resources in the VLO: A pilot study with students of translation
studies. In Selected papers from the CLARIN Annual Conference
2016. Linköping: Linköping University Electronic Press, 63–75.
Macháček, D., Žilinec, M. & Bojar, O. (2024). ESIC
1.1 — Europarl Simultaneous Interpreting Corpus
(2024-02-05). LINDAT/CLARIAH-CZ. [URL]
Marsden, E. & Mackey, A. (2014). IRIS:
A new resource for second language research. Linguistic Approaches to
Bilingualism 4 (1), 125–130.
National Science
Foundation (2023). NSF public access plan 2.0: Ensuring open, immediate and
equitable access to National Science Foundation funded research. [URL] (accessed 25 June 2024).
Paullada, A., Raji, I. D., Bender, E. M., Denton, E. & Hanna, A. (2021). Data
and its (dis)contents: A survey of dataset development and use in machine learning
research. Patterns 2 (11), 100336.
Pernet, C. R., Appelhoff, S., Gorgolewski, K. J., Flandin, G., Phillips, C., Delorme, A. & Oostenveld, R. (2019). EEG-BIDS,
an extension to the brain imaging data structure for electroencephalography. Scientific
Data 6 (1), 103.
(2024). Is
machine interpreting interpreting? Translation Spaces (Online
First).
Pruitt, J. & Grudin, J. (2003). Personas:
Practice and theory. In Proceedings of the 2003 Conference on
Designing for User Experiences. New York, NY: Association for Computing Machinery, 1–15.
Rehm, G., Piperidis, S., Bontcheva, K., Hajic, J., Arranz, V., Vasiļjevs, A., Backfried, G., Gomez-Perez, J. M., Germann, U., Calizzano, R. et al. (2021). European
Language Grid: A joint platform for the European language technology
community. In D. Gkatzia & D. Seddah (Eds.), Proceedings
of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System
demonstrations. Online: Association for Computational Linguistics, 221–230.
Russo, M., Bendazzoli, C., Sandrelli, A. & Spinolo, N. (2012). The
European Parliament Interpreting Corpus (EPIC): Implementation and
developments. In F. Straniero Sergio & C. Falbo (Eds.), Breaking
ground in corpus-based interpreting studies. Bern: Peter Lang, 53–90.
Saunders, B., Camgöz, N. C. & Bowden, R. (2022). Signing
at scale: Learning to co-articulate signs for large-scale photo-realistic sign language
production. In 2022 IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR). New Orleans, LA: Institute of Electrical and Electronics Engineers (IEEE), 5131–5141.
Seeber, K. G. (2006). SIMON:
An online clearing house for interpreter training materials. In C. M. Crawford, R. Carlsen, K. McFerrin, J. Price, R. Weber & D. A. Willis (Eds.), Proceedings
of Society for Information Technology & Teacher Education International Conference
2006. Orlando, FL: Association for the Advancement of Computing in Education (AACE), 2403–2408.
Setton, R. (2011). Corpus-based
interpretation studies (CIS): Reflections and prospects. In A. Kruger, K. Wallmach & J. Munday (Eds.), Corpus-based
translation studies: Research and
applications. London: Continuum, 33–75.
Shlesinger, M. (1998). Corpus-based
interpreting studies as an offshoot of corpus-based translation
studies. Meta 43 (4), 486–493.
Surrey Research
Park (2024). Signapse’s sign language technology advances Deaf
accessibility. [URL] (accessed 23 August 2024).
Technical Committee ISO/TC 37/SC
2 (2023). Code for individual languages and language
groups. Technical Report ISO
639:2023, Geneva: International Organization for Standardization.
Technical Committee ISO/TC 37/SC
4 (2015). Language resource management — Component Metadata Infrastructure
(CMDI) — Part 1: The component metadata model. Technical Report ISO
24622-1, Geneva: International Organization for Standardization.
(2019). Language resource management — Component Metadata Infrastructure
(CMDI) — Part 2: Component metadata specification language. Technical Report ISO
24622-2:2019, Geneva: International Organization for Standardization.
Technical Committee ISO/TC 46/SC
4 (2017). Information and documentation — The Dublin Core metadata element
set Part 1: Core elements. Technical Report ISO
15836-1:2017, Geneva: International Organization for Standardization.
Technical Committee ISO/TC
154 (2019). Date and time — Representations for information interchange
Part 1: Basic rules. Technical report ISO
8601-1:2019, Geneva: International Organization for Standardization.
Temnikova, I., Abdelali, A., Hedaya, S., Vogel, S. & Al Daher, A. (2017). Interpreting
strategies annotation in the WAW corpus. In Proceedings of the First
Workshop on Human-informed Translation and Interpreting Technology (HiT-IT). Varna, Bulgaria: Incoma Ltd., 36–43.
Thompson, B., Dhaliwal, M., Frisch, P., Domhan, T. & Federico, M. (2024). A
shocking amount of the web is machine translated: Insights from multi-way
parallelism. In L.-W. Ku, A. Martins & V. Srikumar (Eds.), Findings
of the Association for Computational Linguistics: ACL 2024. Bangkok, Thailand: Association for Computational Linguistics, 1763–1775.
Ticca, A. C. (2008). L’interprete
ad hoc nel dialogo mediato medico-paziente: processi interazionali in una clinica dello Yucatan
indigeno. PhD thesis, University of Pisa.
Vandeghinste, V., Van Dyck, B., De Coster, M., Goddefroy, M. & Dambre, J. (2022). BeCoS
corpus: Belgian Covid-19 Sign Language corpus. A corpus for training sign language recognition and
translation. Computational Linguistics in the Netherlands
Journal 121, 7–17.
Wallmach, K. (2000). Examining
simultaneous interpreting norms and strategies in a South African legislative context: A pilot corpus
analysis. Language
Matters 31 (1), 198–221.
Wang, B. (2012). A
descriptive study of norms in interpreting based on the Chinese–English consecutive interpreting corpus of Chinese premier
press
conferences. Meta 57 (1), 198–212.
Wehrmeyer, E. (2019). A
corpus for signed language interpreting
research. Interpreting 21 (1), 62–90.
