Hieu Hoang


2020

pdf bib
ParaCrawl: Web-Scale Acquisition of Parallel Corpora
Marta Bañón | Pinzhen Chen | Barry Haddow | Kenneth Heafield | Hieu Hoang | Miquel Esplà-Gomis | Mikel L. Forcada | Amir Kamran | Faheem Kirefu | Philipp Koehn | Sergio Ortiz Rojas | Leopoldo Pla Sempere | Gema Ramírez-Sánchez | Elsa Sarrías | Marek Strelec | Brian Thompson | William Waites | Dion Wiggins | Jaume Zaragoza
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

We report on methods to create the largest publicly available parallel corpora by crawling the web, using open source software. We empirically compare alternative methods and publish benchmark data sets for sentence alignment and sentence pair filtering. We also describe the parallel corpora released and evaluate their quality and their usefulness to create machine translation systems.

2019

pdf bib
ParaCrawl: Web-scale parallel corpora for the languages of the EU
Miquel Esplà | Mikel Forcada | Gema Ramírez-Sánchez | Hieu Hoang
Proceedings of Machine Translation Summit XVII Volume 2: Translator, Project and User Tracks

2018

pdf bib
Marian: Fast Neural Machine Translation in C++
Marcin Junczys-Dowmunt | Roman Grundkiewicz | Tomasz Dwojak | Hieu Hoang | Kenneth Heafield | Tom Neckermann | Frank Seide | Ulrich Germann | Alham Fikri Aji | Nikolay Bogoychev | André F. T. Martins | Alexandra Birch
Proceedings of ACL 2018, System Demonstrations

We present Marian, an efficient and self-contained Neural Machine Translation framework with an integrated automatic differentiation engine based on dynamic computation graphs. Marian is written entirely in C++. We describe the design of the encoder-decoder framework and demonstrate that a research-friendly toolkit can achieve high training and translation speed.

pdf bib
Fast Neural Machine Translation Implementation
Hieu Hoang | Tomasz Dwojak | Rihards Krislauks | Daniel Torregrosa | Kenneth Heafield
Proceedings of the 2nd Workshop on Neural Machine Translation and Generation

This paper describes the submissions to the efficiency track for GPUs at the Workshop for Neural Machine Translation and Generation by members of the University of Edinburgh, Adam Mickiewicz University, Tilde and University of Alicante. We focus on efficient implementation of the recurrent deep-learning model as implemented in Amun, the fast inference engine for neural machine translation. We improve the performance with an efficient mini-batching algorithm, and by fusing the softmax operation with the k-best extraction algorithm. Submissions using Amun were first, second and third fastest in the GPU efficiency track.

pdf bib
Marian: Cost-effective High-Quality Neural Machine Translation in C++
Marcin Junczys-Dowmunt | Kenneth Heafield | Hieu Hoang | Roman Grundkiewicz | Anthony Aue
Proceedings of the 2nd Workshop on Neural Machine Translation and Generation

This paper describes the submissions of the “Marian” team to the WNMT 2018 shared task. We investigate combinations of teacher-student training, low-precision matrix products, auto-tuning and other methods to optimize the Transformer model on GPU and CPU. By further integrating these methods with the new averaging attention networks, a recently introduced faster Transformer variant, we create a number of high-quality, high-performance models on the GPU and CPU, dominating the Pareto frontier for this shared task.

2017

pdf bib
A Parallel Corpus for Evaluating Machine Translation between Arabic and European Languages
Nizar Habash | Nasser Zalmout | Dima Taji | Hieu Hoang | Maverick Alzate
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

We present Arab-Acquis, a large publicly available dataset for evaluating machine translation between 22 European languages and Arabic. Arab-Acquis consists of over 12,000 sentences from the JRC-Acquis (Acquis Communautaire) corpus translated twice by professional translators, once from English and once from French, and totaling over 600,000 words. The corpus follows previous data splits in the literature for tuning, development, and testing. We describe the corpus and how it was created. We also present the first benchmarking results on translating to and from Arabic for 22 European languages.

2016

pdf bib
Fast and highly parallelizable phrase table for statistical machine translation
Nikolay Bogoychev | Hieu Hoang
Proceedings of the First Conference on Machine Translation: Volume 1, Research Papers

2014

pdf bib
Integrating an Unsupervised Transliteration Model into Statistical Machine Translation
Nadir Durrani | Hassan Sajjad | Hieu Hoang | Philipp Koehn
Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, volume 2: Short Papers

pdf bib
Augmenting String-to-Tree and Tree-to-String Translation with Non-Syntactic Phrases
Matthias Huck | Hieu Hoang | Philipp Koehn
Proceedings of the Ninth Workshop on Statistical Machine Translation

pdf bib
Preference Grammars and Soft Syntactic Constraints for GHKM Syntax-based Statistical Machine Translation
Matthias Huck | Hieu Hoang | Philipp Koehn
Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation

2013

pdf bib
Can Markov Models Over Minimal Translation Units Help Phrase-Based SMT?
Nadir Durrani | Alexander Fraser | Helmut Schmid | Hieu Hoang | Philipp Koehn
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2010

pdf bib
More Linguistic Annotation for Statistical Machine Translation
Philipp Koehn | Barry Haddow | Philip Williams | Hieu Hoang
Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR

pdf bib
Improved Translation with Source Syntax Labels
Hieu Hoang | Philipp Koehn
Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR

2009

pdf bib
A Systematic Analysis of Translation Model Search Spaces
Michael Auli | Adam Lopez | Hieu Hoang | Philipp Koehn
Proceedings of the Fourth Workshop on Statistical Machine Translation

pdf bib
Improving Mid-Range Re-Ordering Using Templates of Factors
Hieu Hoang | Philipp Koehn
Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)

2008

pdf bib
Towards better Machine Translation Quality for the German-English Language Pairs
Philipp Koehn | Abhishek Arun | Hieu Hoang
Proceedings of the Third Workshop on Statistical Machine Translation

pdf bib
Design of the Moses Decoder for Statistical Machine Translation
Hieu Hoang | Philipp Koehn
Software Engineering, Testing, and Quality Assurance for Natural Language Processing

pdf bib
Improving Interactive Machine Translation via Mouse Actions
Germán Sanchis-Trilles | Daniel Ortiz-Martínez | Jorge Civera | Francisco Casacuberta | Enrique Vidal | Hieu Hoang
Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing

2007

pdf bib
Moses: Open Source Toolkit for Statistical Machine Translation
Philipp Koehn | Hieu Hoang | Alexandra Birch | Chris Callison-Burch | Marcello Federico | Nicola Bertoldi | Brooke Cowan | Wade Shen | Christine Moran | Richard Zens | Chris Dyer | Ondřej Bojar | Alexandra Constantin | Evan Herbst
Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions

pdf bib
Factored Translation Models
Philipp Koehn | Hieu Hoang
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)