José Carlos Rosales Núñez


pdf bib
Phonetic Normalization for Machine Translation of User Generated Content
José Carlos Rosales Núñez | Djamé Seddah | Guillaume Wisniewski
Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019)

We present an approach to correct noisy User Generated Content (UGC) in French aiming to produce a pretreatement pipeline to improve Machine Translation for this kind of non-canonical corpora. In order to do so, we have implemented a character-based neural model phonetizer to produce IPA pronunciations of words. In this way, we intend to correct grammar, vocabulary and accentuation errors often present in noisy UGC corpora. Our method leverages on the fact that some errors are due to confusion induced by words with similar pronunciation which can be corrected using a phonetic look-up table to produce normalization candidates. These potential corrections are then encoded in a lattice and ranked using a language model to output the most probable corrected phrase. Compare to using other phonetizers, our method boosts a transformer-based machine translation system on UGC.

pdf bib
Comparison between NMT and PBSMT Performance for Translating Noisy User-Generated Content
José Carlos Rosales Núñez | Djamé Seddah | Guillaume Wisniewski
Proceedings of the 22nd Nordic Conference on Computational Linguistics

This work compares the performances achieved by Phrase-Based Statistical Machine Translation systems (PB-SMT) and attention-based Neuronal Machine Translation systems (NMT) when translating User Generated Content (UGC), as encountered in social medias, from French to English. We show that, contrary to what could be expected, PBSMT outperforms NMT when translating non-canonical inputs. Our error analysis uncovers the specificities of UGC that are problematic for sequential NMT architectures and suggests new avenue for improving NMT models.


pdf bib
Analyse morpho-syntaxique en présence d’alternance codique (PoS tagging of Code Switching)
José Carlos Rosales Núñez | Guillaume Wisniewski
Actes de la Conférence TALN. Volume 1 - Articles longs, articles courts de TALN

L’alternance codique est le phénomène qui consiste à alterner les langues au cours d’une même conversation ou d’une même phrase. Avec l’augmentation du volume généré par les utilisateurs, ce phénomène essentiellement oral, se retrouve de plus en plus dans les textes écrits, nécessitant d’adapter les tâches et modèles de traitement automatique de la langue à ce nouveau type d’énoncés. Ce travail présente la collecte et l’annotation en partie du discours d’un corpus d’énoncés comportant des alternances codiques et évalue leur impact sur la tâche d’analyse morpho-syntaxique.