Article published In: Revista Española de Lingüística Aplicada/Spanish Journal of Applied Linguistics
Vol. 33:1 (2020) ► pp.140–170
Algoritmos fonéticos para la detección de palabras fonéticamente similares en el español del centro de México
Article language: Spanish
Published online: 21 August 2020
https://doi.org/10.1075/resla.18002.her
https://doi.org/10.1075/resla.18002.her
Resumen
En la actualidad, la detección de palabras fonéticamente similares se ha logrado de forma exitosa gracias a la
utilización de algoritmos fonéticos. Sin embargo, tales algoritmos dependen del lenguaje al que pertenecen, por lo que
generalmente no están optimizados para el español. Por esta razón, en el siguiente artículo se presentará el algoritmo PFS y su
variante PFS-US, los cuales son algoritmos fonéticos que consideran la fonología del español hablado en el centro de México, y
fueron diseñados para detectar palabras fonéticamente similares en grandes conjuntos de palabras. Ahora bien, a través de un
análisis comparativo entre otros cuatro algoritmos fonéticos de estado del arte, analizaremos la consideración fonológica
mencionada. Para ello, se definieron métricas independientes de la lengua para evaluar algoritmos fonéticos en general. Dichas
métricas se basan en la estructura de los grupos de palabras fonéticamente similares entre sí y su relación con palabras que no
son similares con ninguna otra. Adicionalmente, los recursos generados se comparten de forma libre para su uso y análisis.
Palabras clave: algoritmos fonéticos, español de México, Soundex, palabras fonéticamente similares, algoritmo PFS
Abstract
Phonetic algorithms for detection of phonetically similar words in Central-Mexico Spanish
Detection of phonetically similar words has achieved some amount of success thanks to phonetic algorithms.
However, these algorithms depend on the language on which they are employed, which is to say, they are not uniquely designed for
Spanish. In the following article, we will present the PFS algorithm and its variant PFS-US, which are phonetic algorithms that
consider the phonology of Spanish as spoken in Central Mexico. They were designed to detect phonetically similar words in large
sets of words. Now, following a comparative analysis between the data and four other state-of-the-art phonetic algorithms, we
study the effects of incorporating phonology. We begin by defining the independent properties of the language, which help to
evaluate phonetic algorithms. The properties are based on the structure of groups of phonetically similar words among themselves
and of dissimilar words standing alone. Additionally, the resources generated may be shared freely for use and analysis.
Article outline
- 1.Introducción
- 1.2Algoritmos fonéticos
- 1.3Español del centro de México
- 1.4El grafema <<x>>
- 2.Objetivos
- 3.Desarrollo de los algoritmos PFS y PFS-US
- 3.1Pre-transcripción
- 3.2Transcripción fonética
- 3.3Algoritmo PFS
- 3.4Algoritmo PFS-US
- 4.Evaluación
- 4.1Soundex
- 4.2NYSIIS
- 4.3Phonix
- 4.4Double Metaphone
- 4.5Algoritmos fonométricos
- 4.6El corpus
- 4.7Caracterización de algoritmos y experimentos
- 4.8Definición formal de distorsión
- 4.8.1Distorsión de grupos (dG(a))
- 4.8.2Distorsión de palabras individuales (d1(a))
- 4.9Índice de desempeño
- 4.10Métricas de caracterización
- 4.10.1Tamaño promedio de grupos
- 4.10.2Longitud máxima de grupo
- 4.10.3Diferencia máx-min de grupo
- 4.10.4Riqueza de caracteres
- 4.10.5Precisión de palabras individuales
- 4.11Discusión sobre los grupos identificados
- 4.12Grupos compactos
- 4.13Efecto de la pre-transcripción
- 4.14Errores en los algoritmos PFS
- 5.Conclusiones
- Agradecimientos
- Notas
Referencias
References (54)
Anguita, J., Peillon, S., Hernando, J., y Bramoulle, A. (2004). Word confusability prediction in automatic speech recognition. Eighth International Conference on Spoken Language Processing, 1–4.
Bahl, Lalit R., Gennaro, S. V. De, Gopalakrishnan, P. S. and Mercer, Robert L. 1989. First European Conference on Speech Communication and Technology. A fast approximate acoustic match for large vocabulary speech recognition.
Blanch, J. M. L. (1967). La influencia del sustrato en la fonética del español de México. Revista de Filología Española, 501, 145–161.
Branting, L. K. (2003). A comparative evaluation of name-matching algorithms. Proceedings of the 9th international conference on Artificial intelligence and law, 224–232.
Caballero-Morales, S.-O. (2013). Recognition of emotions in mexican spanish speech: An approach based on acoustic modelling of emotion-specific vowels. The Scientific World Journal, 1–13.
Chavarría-Amezcua, M.-A. (2010). Manual de etiquetado fonético e imágenes acústicas de los alófonos del español de la Ciudad de México, para su uso en las tecnologías del habla (pp. 70–187). Tesis de licenciatura, Facultad de Filosofía y Letras, UNAM.
Chen, J.-Y., Olsen, P. A., y Hershey, J. R. (2007). Word confusability-measuring hidden Markov model similarity. Eighth Annual Conference of the International Speech Communication Association, 2089–2092.
Cuétara, J. (2004). Fonética de la ciudad de México. Aportaciones desde las tecnologías del habla (pp. 15–135). Tesis de maestría, Posgrado en Lingüística, UNAM.
Daniel, Y. (2004). Application of the Double Metaphone Algorithm to Amharic Orthography. International Conference of Ethiopian Studies XV1, 1–13.
Davis, S., y Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE transactions on acoustics, speech, and signal processing, 281, 357–366.
Elmagarmid, A. K., Ipeirotis, P. G., y Verykios, V. S. (2007). Duplicate record detection: A survey. IEEE Transactions on knowledge and data engineering, 191, 1–16.
Fernández, J. G. (2007). Fonética para profesores de español: de la teoría a la práctica. Madrid: Arco Libros.
Gadd, T. (1988). Fisching fore werds: phonetic retrieval of written text in information systems. Program, 221, 222–237.
Gálvez, C. (2007). Identificación de nombres personales por medio de sistemas de codificación fonética. Encontros Bibli: Revista eletrônica de biblioteconomia e ciência da informação, 111, 105–116.
Goldrick, M., Vaughn, C., y Murphy, A. (2013). The effects of lexical neighbors on stop consonant articulation. The Journal of the Acoustical Society of America, 1341, 172–177.
Goldwater, S., Jurafsky, D., y Manning, C. D. (2010). Which words are hard to recognize? Prosodic, lexical, and disfluency factors that increase speech recognition error rates. Speech Communication, 521, 181–200.
Gonzales-Cam, C. (2008). Algoritmos fonéticos en el desarrollo de un sistema de información de marcas y signos distintivos. Biblios: Revista de bibliotecología y Ciencias de la comunicación, 321, 2–8.
Grannis, S. J., Overhage, J. M., y McDonald, C. J. (2004). Real world performance of approximate string comparators for use in patient matching. Medinfo, 43–47.
Hernández-Mena, C. D., y Herrera-Camacho, J. A. (2013). Creación de un diccionario de pronunciación de nombres propios para uso en tecnologías del habla. Vigésima cuarta reunión internacional de otoño de comunicaciones, computación, electrónica, automatización, robótica y exposición industrial ROCyC’2013, 1–5.
(2014a). CIEMPIESS: A new open-sourced mexican spanish radio corpus. Ninth International Conference on Language Resources and Evaluation, 141, 371–375.
Hernández-Mena, C. D., Martınez-Gómez, N. N., y Herrera-Camacho, J.-A. (2014b). A Set of Phonetic and Phonological Rules for Mexican Spanish Revisited, Updated, Enhanced and Implemented. Advances in Computing Science. Center for Computing Research of IPN, 831, 61–71.
Hernández-Mena, C. D., Meza-Ruiz, I. V., y Herrera-Camacho, J. A. (2017). Automatic speech recognizers for Mexican Spanish and its open resources. Journal of Applied Research and Technology, 15(1), 259–270.
Kondrak, G., y Dorr, B. (2004). Identification of confusable drug names: A new approach and evaluation methodology. Proceedings of the 20th international conference on Computational Linguistics, 9521.
Krstev, C., Vitas, D., Maurel, D., y Tran, M. (2005). Multilingual ontology of proper names. 2nd Language y Technology Conference, LTC’05, 116–119.
Lambert, B. L., Lin, S.-J., Chang, K.-Y., y Gandhi, S. K. (1999). Similarity as a risk factor in drug-name confusion errors: the look-alike (orthographic) and sound-alike (phonetic) model. Medical care, 371, 1214–1225.
Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Soviet physics doklady, 101, 707–710.
Luce, P. A., y Pisoni, D. B. (1998). Recognizing spoken words: The neighborhood activation model. Ear and hearing, 19(1), 1.
McDonald, D. (1996). Internal and external evidence in the identification and semantic categorization of proper names. Corpus processing for lexical acquisition, 21–39.
McQueen, J. M. (1991). The influence of the lexicon on phonetic categorization: stimulus quality in word-final ambiguity. Journal of Experimental Psychology: Human Perception and Performance, 171, 433.
Mills, D. L., Prat, C., Zangl, R., Stager, C. L., Neville, H. J., y Werker, J. F. (2004). Language experience and the organization of brain activity to phonetically similar words: ERP evidence from 14-and 20 month olds. Journal of Cognitive Neuroscience, 161, 1452–1464.
Nye, P., y Gaitenby, J. (1973). Consonant intelligibility in synthetic speech and in a natural speech control (modified rhyme test results). Haskins Laboratories Status Report on Speech Research, 331, 77–91.
Pande, B., y Dhami, H. (2011). Application of natural language processing tools in stemming. International Journal of Computer Applications, 271, 14–19.
Parmar, V. P., y Kumbharana, C. (2014). Study Existing Various Phonetic Algorithms and Designing and Development of a working model for the New Developed Algorithm and Comparison by implementing it with Existing Algorithm(s). International Journal of Computer Applications, 98(19), 45–49.
Peereman, R. (1997). Orthographic and phonological neighborhoods in naming: Not all neighbors are equally influential in orthographic space. Journal of Memory and language, 371, 382–410.
Pineda, L. A., Castellanos, H., Cuétara, J., Galescu, L., Juárez, J., Llisterri, J., Pérez, P., y Villaseñor, L. (2010). The Corpus DIMEx100: transcription and evaluation. Language Resources and Evaluation, 441, 347–370.
Pineda, L. A., Pineda, L. V., Cuétara, J., Castellanos, H., y López, I. (2004). DIMEx100: A new phonetic and speech corpus for Mexican Spanish. Iberamia, 33151, 974–984.
Pinto, D., Vilariño, D., Alemán, Y., Gómez, H., Loya, N., y Jiménez-Salazar, H. (2012). The Soundex phonetic algorithm revisited for SMS text representation. Text, Speech and Dialogue, 47–55.
Pisoni, D. B., Nusbaum, H. C., Luce, P. A., y Slowiaczek, L. M. (1985). Speech perception, word recognition and the structure of the lexicon. Speech communication, 41, 75–95.
Rahm, E., y Do, H. H. (2000). Data cleaning: Problems and current approaches. IEEE Data Eng. Bull, 231, 3–13.
Reddy, A. M., y Rose, R. C. (2008). Towards domain independence in machine aided human translation. Interspeech, 2358–2361.
Reyes-Barragán, M. A., Pineda, L. V., y Montes-y Gómez, M. (2009). INAOE at qast 2009: Evaluating the usefulness of a phonetic codification of transcriptions. CLEF Working Notes, 1–5.
Riley, M. D., y Roe, D. B. (1998). Confusable word detection in speech recognition. US: Patent No. 5,737,723, 71, 1–6.
Taft, R. (1970). Special Report no. 1. Albany. New York: Bureau of Systems Development, New York State Identification and Intelligence Systems (NYSIIS).
UzZaman, N., y Khan, M. (2005). A double metaphone encoding for Bangla and its application in spelling checker. Natural Language Processing and Knowledge Engineering IEEE, 705–710.
