In:Computational Phraseology
Edited by Gloria Corpas Pastor and Jean-Pierre Colson
[IVITRA Research in Linguistics and Literature 24] 2020
► pp. 135–150
Multiword expressions in comparable corpora
Published online: 8 May 2020
https://doi.org/10.1075/ivitra.24.07dur
https://doi.org/10.1075/ivitra.24.07dur
Abstract
On the basis of Aranea Gigaword Web corpora, a family of
comparable corpora intended for use in contrastive linguistic research,
multilingual lexicography, language teaching and translation studies we
discuss the pros and cons of comparable corpora in contrast to monolingual
and parallel corpora for the analysis of multiword entities (MWEs). We
demonstrate that by using large corpora for two or more languages,
consisting of unrelated texts, yet created in a comparable manner, parallel
language structures and phenomena like MWEs can be identified if the
appropriate tools are employed. With the Aranea corpora, the “bilingual
sketch” functionality of the Sketch Engine is one such tool which provides a
new approach for analyses of similarities of (or differences between)
collocation profiles (word sketches) for words and their translation
equivalents.
Article outline
- 1.Comparable corpora: A brief survey
- 2.Aranea comparable corpora
- 2.1Methodology
- 2.2Available corpora
- 2.3Access to CC
- 3.Multi-word expressions in comparable corpora
- 3.1Competition between monolingual and comparable corpora
- 3.1.1Intralingual sketch in monolingual vs. comparable corpus
- 3.1.2Interlingual sketch from monolingual corpora
- 3.1.3Intralingual sketch difference and collocational equivalent
- 3.2Data mining in comparable corpora
- 3.2.1Intralingual sketches in different varieties of english corpora
- 3.2.2Interlingual sketches in comparable corpora
- 3.2.2.1Collocational preferences
- 3.2.2.2Collocational compatibility
- 3.2.2.3Collocational behaviour of MWES
- 3.1Competition between monolingual and comparable corpora
- 4.Conclusion
Notes References Internet links (Last accessed on June 11, 2019)
References (17)
Barrón-Cedeño, A., España-Bonet, C., Boldoba, J., Màrquez, L. (2015). A Factory of Comparable Corpora from Wikipedia. In P. Zweigenbaum, S. Sharoff, & R. Rapp (Eds.), Proceedings of the Eighth Workshop on Building and Using Comparable Corpora (pp. 3–13). Stroudsburg: The Association for Computational Linguistics. [URL] (Accessed: 2018–06–11).
Benko, V. (2013). Data Deduplication in Slovak Corpora. In K. Gajdošová, & A. Žáková (Eds.), Slovko 2013: Natural Language Processing, Corpus Linguistics, E-learning (pp. 27–39). Lüdenscheid: RAM-Verlag.
(2014a). Aranea: Yet Another Family of (Comparable) Web Corpora. In P. Sojka, A. Horák, I. Kopeček, & K. Pala (Eds.), Text, Speech and Dialogue. 17th International Conference, TSD 2014, Brno, Czech Republic, September 8–12, 2014 (pp. 257–264). Springer International Publishing Switzerland. ISBN: 978-3-319-10815-5 (Print), 978-3-319-10816-2 (Online).
(2014b). Compatible Sketch Grammars for Comparable Corpora. In A. Abel, C. Vettori, & N. Ralli (Eds.), Proceedings of the XVI EURALEX International Congress: The User in Focus 15–19 July 2014 (pp. 15–19). Bolzano/Bozen: Eurac Research. ISBN: 978-88-88906-97-3.
(2016). Two Years of Aranea: Increasing Counts and Tuning the Pipeline. In N. Calzolari et al. (Eds.), In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2016) (pp. 4245–4248). Portorož: European Language Resources Association (ELRA). [URL] (Accessed: 2018–06–11).
Benko, V., & Ďurčo, P. (2015). Aranea. Comparable Gigaword Web Corpora. In G. Corpas Pastor, R. Mitkov, J. Monti, & V. Seretan (Eds.), Workshop on Multi-word Units in Machine Translation and Translation Technology (MUMTTT2015) (2nd edition) (pp. 40–42). LEXYTRAD, Research Group in Lexicography and Translation. [URL] (Accessed: 2018–06–11).
Maia, B. (2003). What are comparable corpora. In Proceedings of the Corpus Linguistics workshop on Multilingual Corpora: Linguistic requirements and technical perspectives. [URL] (Accessed: 2018–06–11).
Mendoza Rivera, O., Mitkov, R., & Corpas Pastor, G. (2013). A Flexible Framework for Collocation Retrieval and Translation from Parallel and Comparable Corpora. In J. Monti, R. Mitkov, G. Corpas Pastor, & V. Seretan (Eds.), Workshop Proceedings for: Multi-word Units in Machine Translation and Translation Technologies (Organised at the 14th Machine Translation Summit 2013) (pp. 18–25). Allschwil: The European Association for Machine Translation. [URL] (Accessed: 2018–06–11).
Sharoff, S., Rapp, R., & Zweigenbaum, P. (2016). Overviewing Important Aspects of the Last Twenty Years of Research in Comparable Corpora. In BUCC, 9th Workshop on Building and Using Comparable Corpora. Co-located with LREC 2016 Portorož (Slovenia) 23 May 2016. [URL] (Accessed: 2018–06–11).
Smith, J. R., Quirk, C., & Toutanova, K. (2010). Extracting Parallel Sentences from Comparable Corpora using Document Level Alignment. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the ACL (pp. 403–411). Los Angeles, CA: Association for Computational Linguistics. [URL] (Accessed: 2018–06–11).
Aranea. A Family of Comparable. Gigaword Web Corpora. (n.d.), Retrieved from [URL]
BUCC, 9th Workshop on Building and Using Comparable Corpora. (Last modified on April 23, 2016). Retrieved from [URL]
Comparable Corpora. (n.d.) Examples for Conjunction Section, Retrieved from [URL]
Comparable Corpora. (n.d.) Retrieved from [URL]
Comparable Corpora. (n.d.) Examples for Conjunction Section. Retrieved from [URL]
Learning Bilingual Dictionaries from Comparable. (Last modified on September 13, 2017), Retrieved from [URL]
Wikipedia Comparable Corpora. (n.d.) Retrieved from [URL]
