In:Parallel Corpora for Contrastive and Translation Studies: New resources and applications
Edited by Irene Doval and M. Teresa Sánchez Nieto
[Studies in Corpus Linguistics 90] 2019
► pp. 281–298
Normalization of shorthand forms in French text messages using word embedding and machine translation
Published online: 20 March 2019
https://doi.org/10.1075/scl.90.17gho
https://doi.org/10.1075/scl.90.17gho
This chapter focuses on the normalization of abbreviations and shorthand forms used in French text messages. These forms are difficult to normalize, as they mostly cannot be resolved by typical spell checkers and dictionary lookups. Firstly, we aligned normalized and non-normalized French text messages and built a parallel corpus. We applied two popular approaches for text normalization, namely multilingual word embeddings, and character-based machine translation. We compare our results and observe the efficacy of our models while normalizing deletions, substitutions, repetitions, swaps, and insertions, made to canonical forms. This is the first paper that uses Multivec and the Belgian SMS corpus collected under the SMS4Science Project. The unsupervised machine learning approach makes the system highly flexible, easily adaptable and provides a domain-independent method of text normalization.
Article outline
- 1.Introduction
- 2.Previous work
- 3.Corpus and preprocessing
- 3.1Corpus
- 3.2Preprocessing
- 4.Methodologies, tools and experiments
- 4.1Methodologies
- 4.2Tools and experiments
- multivec
- moses
- 5.Results analysis
- 6.Conclusion
- 7.Future work
Acknowledgment Notes References
References (22)
Beaufort, Richard, Roekhaut, Sophie, Cougnon, Louise-Amélie & Fairon, Cédrick. 2010. A hybrid rule/model-based finite-state framework for normalizing SMS Messages. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, 770–779.
Bérard, Alexandre, Servan, Christophe, Pietquin, Olivier & Besacier, Laurent. 2016. Multivec: A multilingual and multilevel representation learning toolkit for NLP. The 10th edition of the Language Resources and Evaluation Conference, 4188–4192.
Bird, Steven, Loper, Edward & Klein, Ewan. 2009. Natural Language Processing with Python. San Francisco CA: O’Reilly Media.
Bojanowski, Piotr, Grave, Edouard, Joulin, Armand & Mikolov, Tomas. 2016. Enriching Word Vectors with Subword Information. <[URL]> (13 May 2017).
Choudhury, Monojit, Saraf, Rahul, Jain, Vijit, Sudeshna, Sarkar & Basu, Anupam. 2007. Investigation and modeling of the structure of texting language. International Journal on Document Analysis and Recognition 10(3): 157–174.
De Clercq Orphée, Schulz, Sarah, Desmet, Bart, Lefever, Els, Hoste, Véronique. 2013. Normalization of Dutch user-generated content. Proceedings of 9th International Conference on Recent Advances in Natural Language Processing, 179–188. Berlin: Springer.
Fairon, Cécrick, Klein, Jean R. & Paumier, Sébastien. 2007. Le langage SMS: étude d'un corpus informatisé à partir de l'enquête ‘Faites don de vos SMS à la science’. Louvain-la-Neuve: Presses universitaires de Louvain.
Firth, John R. 1957. A synopsis of linguistic theory 1930–1955. Studies in Linguistic Analysis, 1–32. Oxford: Blackwell.
Jurafsky, Daniel & Martin, James H. 2014. Speech and Language Processing. Englewood Cliffs NJ: Prentice Hall.
Kobus, Catherine, Yvon, François & Damnati, Géraldine. 2008. Normalizing SMS: Are two metaphors better than one? Proceedings of the 22nd International Conference on Computational Linguistics 1, 441–448.
Koch, Peter & Oesterreicher, Wulf. 2001. Gesprochene und geschriebene Sprache. Französisch, Italienisch, Spanisch. Berlin: De Gruyter.
Koehn, Philipp, Hoang, Hieu, Birch, Alexandra, Callison-Burch, Chris, Federico, Marcello, Bertoldi, Brooke Cowan, Nicola, Shen, Wade, Moran, Christine, Zens, Richard, Dyer, Chris, Bojar, Ondřej, Constantin, Alexandra & Herbst, Evan. 2007. Moses: Open source toolkit for statistical machine translation. Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions, 177–180.
Li, Chen & Liu, Yang. 2012. Normalization of text messages using character-& phone-based machine translation approaches. Proceedings of 13th Annual Conference of the International Speech Communication Association, 2330–2333.
Mikolov, Tomas, Chen, Kai, Corrado, Greg & Dean, Jeffrey. 2013a. Efficient estimation of word representations in vector space. The Workshop Proceedings of the International Conference on Learning Representations. <[URL]> (13 May 2017).
Mikolov, Thomas, Ilya, Sutskever, Chen, Kai, Corrado, Greg & Dean, Jeffrey. 2013b. Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems.<[URL]> (13 May 2017).
Och, Franz Josef & Ney, Hermann. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics 29(1): 19–51.
Pennell, Deana L. & Liu, Yang. 2011. A character-level machine translation approach for normalization of SMS Abbreviations. Proceedings of International Joint Conference on Natural Language Processing (IJCNLP): 974–982.
Rong, Xin. 2014. word2vec parameter learning explained. <[URL]> (13 May 2017).
sms4science project. 2004. <[URL]> (13 May 2017).
Sridhar, V. K. R. 2015. Unsupervised text normalization using distributed representations of words and phrases. Proceedings of the 2015 Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (HLT-NAACL): 8–16.
