Building a Bilingual Vietnamese-French Named Entity Annotated Corpus through Cross-Linguistic Projection

Ngoc Tan Le, Fatiha Sadat


Abstract
The creation of high-quality named entity annotated resources is time-consuming and an expensive process. Most of the gold standard corpora are available for English but not for less-resourced languages such as Vietnamese. In Asian languages, this task is remained problematic. This paper focuses on an automatic construction of named entity annotated corpora for Vietnamese-French, a less-resourced pair of languages. We incrementally apply different cross-projection methods using parallel corpora, such as perfect string matching and edit distance similarity. Evaluations on Vietnamese –French pair of languages show a good accuracy (F-score of 94.90%) when identifying named entities pairs and building a named entity annotated parallel corpus.
Anthology ID:
2015.jeptalnrecital-demonstration.6
Volume:
Actes de la 22e conférence sur le Traitement Automatique des Langues Naturelles. Démonstrations
Month:
June
Year:
2015
Address:
Caen, France
Venue:
JEP/TALN/RECITAL
SIG:
Publisher:
ATALA
Note:
Pages:
12–13
Language:
URL:
https://www.aclweb.org/anthology/2015.jeptalnrecital-demonstration.6
DOI:
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
http://aclanthology.lst.uni-saarland.de/2015.jeptalnrecital-demonstration.6.pdf