Article published In: Graded Resources for Second and Foreign Language Learning
Edited by David Alfter and Thomas François
[ITL - International Journal of Applied Linguistics 175:1] 2024
► pp. 77–102
PolylexFLE
A MWE database for French L2 language learners
Published online: 5 April 2024
https://doi.org/10.1075/itl.22031.tod
https://doi.org/10.1075/itl.22031.tod
Abstract
MWE knowledge is key in the process of learning a foreign language, but its teaching remains hindered by the lack
of list of expressions connected to pedagogical aims. In this paper, we present an extended version of the PolylexFLE database,
containing 4,525 French multiword expressions (MWE) of three types: idioms, collocations or fixed expressions. In order to propose
exercises following the difficulty scale of the European Framework of Reference for Languages (CEFR), we used a mixed approach
(manual and automatic) to annotate 1,186 expressions according to the CEFR levels. The paper focuses mostly on the automatic
procedure that first identifies the expressions from the PolylexFLE database (and their variants) in a corpus of pedagogical texts
(with CEFR labels) using a pattern-based system. In a second step, their distribution in this corpus is estimated and transformed
into a single CEFR level. The automatic approach proposed is finally evaluated by 52 French as foreign language learners.
Article outline
- 1.Introduction
- 2.Related work
- 3.MWE: Definitions and classification criteria
- 4.Building the PolylexFLE database
- 4.1Data collection
- 4.2Linguistic informations
- 5.Identification of CEFR level
- 5.1The manual approach
- 5.2Corpus description
- 5.3MWE extraction : PolyExtractor
- 5.3.1Method description
- 5.3.2Evaluation of PolyExtractor
- 5.3.2.1Reference corpus manually annotated with MWE
- 5.3.2.2Results of PolyExtractor
- 5.4From distribution to a single CEFR level
- 6.Evaluating the quality of the CEFR annotation
- 7.Conclusion and further work
- Notes
Bibliography
References (71)
Alfter, D., Graën, J. (2019). Interconnecting
lexical resources and word alignment: How do learners get on with particle
verbs?. In Proceedings of the 22nd Nordic Conference on Computational
Linguistics, pages 321–326, Turku, Finland. Linköping University Electronic Press.
Alfter, David, Therese Lindström Tiedemann, and Elena Volodina. (2021). “Crowdsourcing
Relative Rankings of Multi-Word Expressions: Experts versus Non-Experts.” Northern European
Journal of Language Technology, 7(1).
Alfter, D., Bizzoni, Y., Agebjörn, A., Volodina, E., & Pilán, I. (2016). From
distributions to labels: A lexical proficiency analysis using learner
corpora. In Proceedings of the joint workshop on NLP for Computer
Assisted Language Learning and NLP for Language
Acquisition (pp. 1–7).
Al Saied, H., Candito, M., & Constant, M. (2017). The
ATILF-LLF system for parseme shared task: A transition-based verbal multiword expression
tagger. In The European Chapter of the Association for Computational
Linguistics EACL
2017, p. 127–132.
Araneta, M. G., Eryigit, G., König, A., Lee, J.-U., Luís, A., Lyding, V., Nicolas, L., Rodosthenous, C., Sangati, F. (2020). Substituto –
A Synchronous Educational Language Game for Simultaneous Teaching and
Crowdsourcing, In Proc. of the 9th Workshop on Natural Language
Processing for Computer Assisted Language Learning (NLP4CALL 2020), Linköping Electronic Conference
Proceedings 175.
Artstein, R., & Poesio, M. (2008). Inter-coder
agreement for computational linguistics. Computational
linguistics, 34(4), 555–596.
Baldwin, T. & Kim, S. N. (2010). Multiword
Expressions. In Handbook of Natural Language
Processing, Boca Raton, FL: CRC Press, Taylor and Francis Group. p. 267–292.
Beacco, J.-C. & Porquier, R. (2008). Niveau
A2 pour le français : utilisateur-apprenant
élémentaire, Didier, Paris.
Beacco, J.-C., Bouquet, S., Porquier, R. (2004). Niveau
B2 pour le français : un référentiel : utilisateur-apprenant
indépendant, Didier, Paris.
Beacco, J.-C., Lepage, S., & Riba, P. (2011). Niveau
B2 pour le français : un référentiel : utilisateur-apprenant
indépendant. Didier.
Beacco, J.-C., & Porquier, R. (2007). Niveau
A1 pour le français: utilisateur-apprenant
élémentaire. Didier.
Burstein, J., Elliot, N., Klebanov, B. B., Madnani, N., Napolitano, D., Schwartz, M., Houghton, P., & Molloy, H. (2018). Writing
Mentor: Writing Progress Using Self-Regulated Writing Support. Journal of Writing
Analytics, 21, 285–313.
Candito, M., Constant, M., Ramisch, C., Savary, A., Parmentier, Y., Pasquer, C., & Antoine, J.-Y. (2017, mai). Annotation
d’expressions polylexicales verbales en français. Actes de TALN
2017.
Cavalla, C. (2015). Les
émotions : phraséologie et enseignement en FLE. Séminaire de recherche du CRISCO,
CRISCO, Université de Caen – Basse Normandie, Dec 2015, Caen, France
Cavalla, C., Loiseau, M., Diwersy, S., Lascombe, V., & Socha, J. (2013, juillet). EmoProf. Journées
Lig-Lidilem. [URL]
Coavoux, M., & Crabbé, B. (2017). Incremental
Discontinuous Phrase Structure Parsing with the GAP Transition. Proceedings of EACL 2017:
Volume 1, Long Papers, 1259–1270. [URL].
Cobb, T. (2013). Frequency
2.0: Incorporating homoforms and multiword units in pedagogical frequency
lists. In C. Bardel, C. Lindqvist, & B. Laufer (Éds.), L2
vocabulary acquisition, knowledge and use: New perspectives on assessment and corpus analysis (p.
79–108). Eurosla.
Conseil de l’Europe. (2001). Cadre
européen commun de référence pour les langues : apprendre, enseigner,
évaluer. Hatier.
Constant, M., Ergÿgit, G., Monti, J., Van der Plas, L., Ramisch, C., Rosner, M., Todirascu, A. (2017). Multiword
Expression Processing : A Survey. Computational
Linguistics, 43(4), p. 837–892.
Diwersy, S., Goossens, V., Grutschus, A., Kern, B., Kraif, O., Melnikova, E., & Novakova, I. (2014). Traitement
des lexies d’émotion dans les corpus et les applications
d’EmoBase. Corpus, 131, 269–293.
Dürlich, L., & François, T. (2018). EFLLex:
A Graded Lexical Resource for Learners of English as a Foreign Language. Proceedings of LREC
2018, 873–879.
Foster, P., Bolibaugh, C., & Kotula, A. (2014). Knowledge
of nativelike selections in a L2: The influence of exposure, memory, age of onset, and motivation in foreign language and
immersion settings. Studies in Second Language
Acquisition, 36(1), 101–132.
François, T. (2014). An
analysis of a French as a Foreign language corpus for readability assessment. Proceedings of
the 3rd workshop on NLP for CALL, NEALT Proceedings Series Vol. 22, Linköping Electronic Conference
Proceedings 1071, 13–32.
François, T., Gala, N., Watrin, P. & Fairon, C. (2014). FLELex :
a graded lexical resource for French foreign learners. In Proc. of
the Language and Resources Evaluation Conference (LREC
2014), Reykjavick, Iceland, p. 3766–3773.
François, T., Volodina, E., Ildikó, P., & Tack, A. (2016). SVALex:
a CEFR-graded lexical resource for Swedish foreign and second language
learners. LREC 2016, 213–219.
François, T., & Watrin, P. (2011). On
the contribution of MWE-based features to a readability formula for French as a foreign
language. Proceedings of
RANLP 2011, 441–447.
Gala, N., François, T. et Fairon, C. (2013). Towards
a French lexicon with difficulty measures: NLP helping to bridge the gap between traditional dictionaries and specialized
lexicons. In Proceedings of Electronic lexicography in the 21st
century: thinking outside the paper
(eLEX-2013), 132–151, Tallinn, Estonia.
Garnier, M., & Schmitt, N. (2015). The
PHaVE List: A pedagogical list of phrasal verbs and their most frequent meaning
senses. Language Teaching
Research, 19(6), 645–666.
Granger, S., & Paquot, M. (2009). Lexical
verbs in academic discourse: A corpus-driven study of learner
use. In M. Charles, D. Pecorari, & S. Hunston (Eds.), Academic
writing: At the interface of corpus and
discourse (pp. 193–214). New York, NY: Continuum.
Granger, S., Paquot, M., et al. (2008). Disentangling
the phraseological web, Phraseology: An interdisciplinary perspective, vol 27, John Benjamins Amsterdam, p. 49.
Gross, M. (1994). Constructing
Lexicon-Grammars, In Atkins, R. and Zampolli, A., Computational
approaches to the lexicon, Oxford Univ. Press, p. 213–263.
Gooding, S., Taslimipoor, S., & Kochmar, E. (2020). Incorporating
Multiword Expressions in Phrase Complexity Estimation. Proceedings of the 1st Workshop on Tools
and Resources to Empower People with REAding DIfficulties
(READI), 14–19.
Hamel, M.-J. & Milicevic, J. (2007). Analyse
d’erreurs lexicales d’apprenants du FLS : démarche empirique pour l’élaboration d’un dictionnaire
d’apprentissage. Canadian Journal of Applied
Linguistics, 10(1), p. 25–45.
Hamel, M.-J., Slavkov, N., Inkpen, D., & Xiao, D. (2016). MyAnnotator :
A Tool for Technology-Mediated Written Corrective
Feedback. TAL, 57(3), 119–142.
Hathout, N., Sajous, F., & Calderone, B. (2014). GLÀFF,
a Large Versatile French Lexicon. Proceedings of
LREC’14, 1007–1012.
Jurafsky, D., & Martin, J. H. (2008). Speech
and Language Processing: An introduction to speech recognition, computational linguistics and natural language
processing. Upper Saddle River, NJ: Prentice Hall.
Kilgarriff, A., Charalabopoulou, F., Gavrilidou, M., Johannessen, J. B., Khalil, S., Johansson Kokkinakis, S., … & Volodina, E. (2014). Corpus-based
vocabulary lists for language learners for nine languages. Language resources and
evaluation, 48(1), 121–163.
Kremmel, B., Brunfaut, T., & Alderson, J. C. (2017). Exploring
the Role of Phraseological Knowledge in Foreign Language Reading. Applied
Linguistics, 38(6), 848–870.
Laporte, É., Ranchhod, E., & Yannacopoulou, A. (2008). Syntactic
variation of support verb constructions. Lingvisticae
Investigationes, 31(2), 173–185.
Madnani, N., Burstein, J., Sabatini, J., Biggers, K., & Andreyev, S. (2016). Language
MuseTM: Automated Linguistic Activity Generation for English Language Learners. Proceedings of
ACL 2016, 213–263.
Marello, C. (2012). Word
lists in Reference Level Descriptions of CEFR (Common European Framework of Reference for
Languages). Proceedings of the XV Euralex International
Congress, 328–335.
Martinez, R., & Schmitt, N. (2012). A
phrasal expressions list. Applied
linguistics, 33(3), 299–320.
McCauley, S. M., & Christiansen, M. H. (2017). Computational
investigations of multiword chunks in language learning. Topics in Cognitive
Science, 9(3), 637–652.
Mel’čuk, I. (1998). Collocations
and lexical functions. In Phraseology. Theory, analysis, and
applications (p. 23–53). Citeseer.
Ozasa, T., Weir, G., Fukui, M. (2007). Measuring
readability for Japanese learners of English, Proceedings of PAAL
2007, pp. 122–125, 2007.
Pasquer, C., Ramisch, C., Savary, A., & Antoine, J.-Y. (2018). VarIDE
at PARSEME Shared Task 2018: Are Variants Really as Alike as Two Peas in a Pod? Proceedings of
the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions
(LAW-MWE-CxG-2018), 283–289. [URL]
Pasquer, C., Savary, A., Ramisch, C., & Antoine, J-Y. (2020a). Seen2Unseen
at PARSEME Shared Task 2020: All Roads do not Lead to Unseen Verb-Noun
VMWEs, in the Proceedings of the Joint Workshop on Multiword
Expressions and Electronic Lexicons (MWE-LEX 2020), 13 December
2020, Barcelona, Spain (online).
Pasquer, C., Savary, A., Ramisch, C., & Antoine, J.-Y. (2020b). Verbal
Multiword Expression Identification: Do We Need a Sledgehammer to Crack a Nut? Proceedings of
COLING 2020.
Pawley, A., & Syder, F. H. (1983). Two
puzzles for linguistic theory: nativelike selection and nativelike
fluency. In J. Richards & R. Schmitt (Éds.), Language
and
Communication (p. 191–225). Longman.
Pellicer-Sánchez, A., & Schmitt, N. (2010). Incidental
vocabulary acquisition from an authentic novel: Do things fall apart? Reading in a Foreign
Language, 221, 31–55.
Pintard, A., & François, T. (2020). Combining
expert knowledge with frequency information to infer CEFR levels for
words. In Proceedings of the 1st Workshop on Tools and Resources to
Empower People with REAding DIfficulties
(READI) (pp. 85–92).
Ramisch, C., Cordeiro, S., Savary, A., Vincze, V., Mititelu, V., Bhatia, A., Buljian, M., Candito, M., Gantar, P., and others. (2018). Edition
1.1 of the PARSEME shared task on automatic identification of verbal multiword
expressions. In Proceedings of the Joint Workshop on Linguistic
Annotation, Multiword Expressions and Constructions
(LAW-MWE-CxG-2018), p. 222–240, Santa Fe, New Mexico, USA: Association
Ramisch, C. (2015). Multiword
Expressions Acquisition: A Generic and Open Framework, Springer International Publishing Switzerland 2015.
Rey, I. G. (2007). La
didactique du français idiomatique. Editions Modulaires Européennes InterCommunication.
Rott, S. (1999). The
Effect of Exposure Frequency on Intermediate Language Learners’ Incidental Vocabulary Acquisition and Retention through
Reading. Studies in second language
acquisition, 21(4), 589–619.
Sag, I., Baldwin, T., Bond, F., Copestake, A., & Flickinger, D. (2002). Multiword
Expressions: A Pain in the Neck for NLP. Proceedings of
CICLing-2002, 1–15.
Savary, A., Cordeiro, S. R., & Ramisch, C. (2019). Without
lexicons, multiword expression identification will never fly: A position statement. Joint
Workshop on Multiword Expressions and WordNet (MWE-WN
2019), 79–91.
Siyanova-Chanturia, A. (2017). Researching
the teaching and learning of multi-word expressions. Language Teaching
Research, 21(3), 289–297.
Siyanova-Chanturia, A., & Spina, S. (2020). Multi-word
expressions in second language writing: A large-scale longitudinal learner corpus
study. Language
Learning, 70(2), 420–463.
Tack, A., François, T., Desmet, P., & Fairon, C. (2018). NT2Lex:
A CEFR-Graded Lexical Resource for Dutch as a Foreign Language Linked to Open Dutch
WordNet. Proceedings of BEA 2018 (NAACL 2018).
Tack, A., François, T., Ligozat, A.-L., & Fairon, C. (2016). Evaluating
lexical simplification and vocabulary knowledge for learners of French: possibilities of using the FLELex
resource. Proceedings of
LREC 2016), 230–236.
Todirascu, A. & Cargill, M. (2019). SimpleApprenant:
a platform to improve French L2 learners’ knowledge of multiword
expressions. In proc. of EUROCALL “CALL &
Complexity”, 1651, Louvain-La-Neuve, Belgium.
Todirascu, A., Cargill, M., Francois, T. (2019). PolylexFLE :
une base de données d’expressions polylexicales pour le FLE. Actes de la 26e Conférence sur le
Traitement Automatique des Langues Naturelles (TALN), Toulouse, France, p. 143–156.
Tolone, E. (2011). Maintenance
du Lexique-Grammaire : Formules définitoires et arbre de classement. Ressources Linguistiques
Libres, 52(3), 153–190.
Tutin, A., Esperança-Rodier, E., Iborra, M., & Reverdy, J. (2015). Annotation
of multiword expressions in French. In C.-P. Gloria (Éd.), European
Society of Phraseology Conference (EUROPHRAS 2015) (p. 60–67). [URL]
Tutin, A. & Grossmann, F. (2002). Collocations
régulières et irrégulières : esquisse de typologie du phénomène
collocatif, RFLA, vol 1, no 1, p. 7–25.
Verlinde, S., Binon, J., & Selva, T. (2006). The
Base Lexicale du Français (BLF): A Multifunctional Online Database for Learners of
French. In C. O. Elisa Corino Carla Marello (Éd.), Proceedings
of the 12th EURALEX International Congress (p. 471–481). Edizioni dell’Orso.
Zampieri, N., Scholivet, M., Ramisch, C., & Favre, B. (2018). Veyn
at PARSEME Shared Task 2018: Recurrent Neural Networks for VMWE Identification. Proceedings of
the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions
(LAW-MWE-CxG-2018), 290–296. [URL]
