Ngoc Tan Le


2018

pdf bib
Low-Resource Machine Transliteration Using Recurrent Neural Networks of Asian Languages
Ngoc Tan Le | Fatiha Sadat
Proceedings of the Seventh Named Entities Workshop

Grapheme-to-phoneme models are key components in automatic speech recognition and text-to-speech systems. With low-resource language pairs that do not have available and well-developed pronunciation lexicons, grapheme-to-phoneme models are particularly useful. These models are based on initial alignments between grapheme source and phoneme target sequences. Inspired by sequence-to-sequence recurrent neural network-based translation methods, the current research presents an approach that applies an alignment representation for input sequences and pre-trained source and target embeddings to overcome the transliteration problem for a low-resource languages pair. We participated in the NEWS 2018 shared task for the English-Vietnamese transliteration task.

pdf bib
Improving the neural network-based machine transliteration for low-resourced language pair
Ngoc Tan Le | Fatiha Sadat
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation

2016

pdf bib
UQAM-NTL: Named entity recognition in Twitter messages
Ngoc Tan Le | Fatma Mallek | Fatiha Sadat
Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)

This paper describes our system used in the 2nd Workshop on Noisy User-generated Text (WNUT) shared task for Named Entity Recognition (NER) in Twitter, in conjunction with Coling 2016. Our system is based on supervised machine learning by applying Conditional Random Fields (CRF) to train two classifiers for two evaluations. The first evaluation aims at predicting the 10 fine-grained types of named entities; while the second evaluation aims at predicting no type of named entities. The experimental results show that our method has significantly improved Twitter NER performance.

2015

pdf bib
Building a Bilingual Vietnamese-French Named Entity Annotated Corpus through Cross-Linguistic Projection
Ngoc Tan Le | Fatiha Sadat
Actes de la 22e conférence sur le Traitement Automatique des Langues Naturelles. Démonstrations

The creation of high-quality named entity annotated resources is time-consuming and an expensive process. Most of the gold standard corpora are available for English but not for less-resourced languages such as Vietnamese. In Asian languages, this task is remained problematic. This paper focuses on an automatic construction of named entity annotated corpora for Vietnamese-French, a less-resourced pair of languages. We incrementally apply different cross-projection methods using parallel corpora, such as perfect string matching and edit distance similarity. Evaluations on Vietnamese –French pair of languages show a good accuracy (F-score of 94.90%) when identifying named entities pairs and building a named entity annotated parallel corpus.