Dieu-Thu Le


pdf bib
Joint learning of frequency and word embeddings for multilingual readability assessment
Dieu-Thu Le | Cam-Tu Nguyen | Xiaoliang Wang
Proceedings of the 5th Workshop on Natural Language Processing Techniques for Educational Applications

This paper describes two models that employ word frequency embeddings to deal with the problem of readability assessment in multiple languages. The task is to determine the difficulty level of a given document, i.e., how hard it is for a reader to fully comprehend the text. The proposed models show how frequency information can be integrated to improve the readability assessment. The experimental results testing on both English and Chinese datasets show that the proposed models improve the results notably when comparing to those using only traditional word embeddings.


pdf bib
Construction and Analysis of a Large Vietnamese Text Corpus
Dieu-Thu Le | Uwe Quasthoff
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper presents a new Vietnamese text corpus which contains around 4.05 billion words. It is a collection of Wikipedia texts, newspaper articles and random web texts. The paper describes the process of collecting, cleaning and creating the corpus. Processing Vietnamese texts faced several challenges, for example, different from many Latin languages, Vietnamese language does not use blanks for separating words, hence using common tokenizers such as replacing blanks with word boundary does not work. A short review about different approaches of Vietnamese tokenization is presented together with how the corpus has been processed and created. After that, some statistical analysis on this data is reported including the number of syllable, average word length, sentence length and topic analysis. The corpus is integrated into a framework which allows searching and browsing. Using this web interface, users can find out how many times a particular word appears in the corpus, sample sentences where this word occurs, its left and right neighbors.

pdf bib
Towards a text analysis system for political debates
Dieu-Thu Le | Ngoc Thang Vu | Andre Blessing
Proceedings of the 10th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities


pdf bib
TUHOI: Trento Universal Human Object Interaction Dataset
Dieu-Thu Le | Jasper Uijlings | Raffaella Bernardi
Proceedings of the Third Workshop on Vision and Language


pdf bib
Exploiting Language Models for Visual Recognition
Dieu-Thu Le | Jasper Uijlings | Raffaella Bernardi
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing


pdf bib
Query classification using topic models and support vector machine
Dieu-Thu Le | Raffaella Bernardi
Proceedings of ACL 2012 Student Research Workshop


pdf bib
Query classification via Topic Models for an art image archive
Dieu-Thu Le | Raffaella Bernardi | Ed Vald
Proceedings of the Workshop on Language Technologies for Digital Humanities and Cultural Heritage