Michel Simard


2020

bib
Workshop on the Impact of Machine Translation (iMpacT 2020)
Sharon O'Brien | Michel Simard
Workshop on the Impact of Machine Translation (iMpacT 2020)

pdf bib
Human or Neural Translation?
Shivendra Bhardwaj | David Alfonso Hermelo | Phillippe Langlais | Gabriel Bernier-Colborne | Cyril Goutte | Michel Simard
Proceedings of the 28th International Conference on Computational Linguistics

Deep neural models tremendously improved machine translation. In this context, we investigate whether distinguishing machine from human translations is still feasible. We trained and applied 18 classifiers under two settings: a monolingual task, in which the classifier only looks at the translation; and a bilingual task, in which the source text is also taken into consideration. We report on extensive experiments involving 4 neural MT systems (Google Translate, DeepL, as well as two systems we trained) and varying the domain of texts. We show that the bilingual task is the easiest one and that transfer-based deep-learning classifiers perform best, with mean accuracies around 85% in-domain and 75% out-of-domain .

2019

pdf bib
Fully Unsupervised Crosslingual Semantic Textual Similarity Metric Based on BERT for Identifying Parallel Data
Chi-kiu Lo | Michel Simard
Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)

We present a fully unsupervised crosslingual semantic textual similarity (STS) metric, based on contextual embeddings extracted from BERT – Bidirectional Encoder Representations from Transformers (Devlin et al., 2019). The goal of crosslingual STS is to measure to what degree two segments of text in different languages express the same meaning. Not only is it a key task in crosslingual natural language understanding (XLU), it is also particularly useful for identifying parallel resources for training and evaluating downstream multilingual natural language processing (NLP) applications, such as machine translation. Most previous crosslingual STS methods relied heavily on existing parallel resources, thus leading to a circular dependency problem. With the advent of massively multilingual context representation models such as BERT, which are trained on the concatenation of non-parallel data from each language, we show that the deadlock around parallel resources can be broken. We perform intrinsic evaluations on crosslingual STS data sets and extrinsic evaluations on parallel corpus filtering and human translation equivalence assessment tasks. Our results show that the unsupervised crosslingual STS metric using BERT without fine-tuning achieves performance on par with supervised or weakly supervised approaches.

2018

pdf bib
Measuring sentence parallelism using Mahalanobis distances: The NRC unsupervised submissions to the WMT18 Parallel Corpus Filtering shared task
Patrick Littell | Samuel Larkin | Darlene Stewart | Michel Simard | Cyril Goutte | Chi-kiu Lo
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

The WMT18 shared task on parallel corpus filtering (Koehn et al., 2018b) challenged teams to score sentence pairs from a large high-recall, low-precision web-scraped parallel corpus (Koehn et al., 2018a). Participants could use existing sample corpora (e.g. past WMT data) as a supervisory signal to learn what a “clean” corpus looks like. However, in lower-resource situations it often happens that the target corpus of the language is the only sample of parallel text in that language. We therefore made several unsupervised entries, setting ourselves an additional constraint that we not utilize the additional clean parallel corpora. One such entry fairly consistently scored in the top ten systems in the 100M-word conditions, and for one task—translating the European Medicines Agency corpus (Tiedemann, 2009)—scored among the best systems even in the 10M-word conditions.

pdf bib
Accurate semantic textual similarity for cleaning noisy parallel corpora using semantic machine translation evaluation metric: The NRC supervised submissions to the Parallel Corpus Filtering task
Chi-kiu Lo | Michel Simard | Darlene Stewart | Samuel Larkin | Cyril Goutte | Patrick Littell
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

We present our semantic textual similarity approach in filtering a noisy web crawled parallel corpus using YiSi—a novel semantic machine translation evaluation metric. The systems mainly based on this supervised approach perform well in the WMT18 Parallel Corpus Filtering shared task (4th place in 100-million-word evaluation, 8th place in 10-million-word evaluation, and 6th place overall, out of 48 submissions). In fact, our best performing system—NRC-yisi-bicov is one of the only four submissions ranked top 10 in both evaluations. Our submitted systems also include some initial filtering steps for scaling down the size of the test corpus and a final redundancy removal step for better semantic and token coverage of the filtered corpus. In this paper, we also describe our unsuccessful attempt in automatically synthesizing a noisy parallel development corpus for tuning the weights to combine different parallelism and fluency features.

2016

pdf bib
CNRC at SemEval-2016 Task 1: Experiments in Crosslingual Semantic Textual Similarity
Chi-kiu Lo | Cyril Goutte | Michel Simard
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

2014

pdf bib
CNRC-TMT: Second Language Writing Assistant System Description
Cyril Goutte | Michel Simard | Marine Carpuat
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)

2012

pdf bib
Book Review: Bitext Alignment by Jörg Tiedemann
Michel Simard
Computational Linguistics, Volume 38, Issue 2 - June 2012

pdf bib
The Trouble with SMT Consistency
Marine Carpuat | Michel Simard
Proceedings of the Seventh Workshop on Statistical Machine Translation

2007

pdf bib
Statistical Phrase-Based Post-Editing
Michel Simard | Cyril Goutte | Pierre Isabelle
Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference

pdf bib
NRC‘s PORTAGE System for WMT 2007
Nicola Ueffing | Michel Simard | Samuel Larkin | Howard Johnson
Proceedings of the Second Workshop on Statistical Machine Translation

pdf bib
Rule-Based Translation with Statistical Phrase-Based Post-Editing
Michel Simard | Nicola Ueffing | Pierre Isabelle | Roland Kuhn
Proceedings of the Second Workshop on Statistical Machine Translation

2006

pdf bib
PORTAGE: with Smoothed Phrase Tables and Segment Choice Models
Howard Johnson | Fatiha Sadat | George Foster | Roland Kuhn | Michel Simard | Eric Joanis | Samuel Larkin
Proceedings on the Workshop on Statistical Machine Translation

pdf bib
Segment Choice Models: Feature-Rich Models for Global Distortion in Statistical Machine Translation
Roland Kuhn | Denis Yuen | Michel Simard | Patrick Paul | George Foster | Eric Joanis | Howard Johnson
Proceedings of the Human Language Technology Conference of the NAACL, Main Conference

2005

pdf bib
Translating with Non-contiguous Phrases
Michel Simard | Nicola Cancedda | Bruno Cavestro | Marc Dymetman | Eric Gaussier | Cyril Goutte | Kenji Yamada | Philippe Langlais | Arne Mauser
Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing

2003

pdf bib
Statistical Translation Alignment with Compositionality Constraints
Michel Simard | Philippe Langlais
Proceedings of the HLT-NAACL 2003 Workshop on Building and Using Parallel Texts: Data Driven Machine Translation and Beyond

pdf bib
Translation Spotting for Translation Memories
Michel Simard
Proceedings of the HLT-NAACL 2003 Workshop on Building and Using Parallel Texts: Data Driven Machine Translation and Beyond

pdf bib
Embedding Web-Based Statistical Translation Models in Cross-Language Information Retrieval
Wessel Kraaij | Jian-Yun Nie | Michel Simard
Computational Linguistics, Volume 29, Number 3, September 2003: Special Issue on the Web as Corpus

2000

pdf bib
TransSearch: A Free Translation Memory on the World Wide Web
Elliott Macklovitch | Michel Simard | Philippe Langlais
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)

1999

pdf bib
Text-Translation Alignment: Three Languages Are Better Than Two
Michel Simard
1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora

1998

pdf bib
Methods and Practical Issues in Evaluating Alignment Techniques
Philippe Langlais | Michel Simard | Jean Veronis
COLING 1998 Volume 1: The 17th International Conference on Computational Linguistics

pdf bib
Methods and Practical Issues in Evaluating Alignment Techniques
Philippe Langlais | Michel Simard | Jean Veronis
36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 1

pdf bib
Automatic Insertion of Accents in French Text
Michel Simard
Proceedings of the Third Conference on Empirical Methods for Natural Language Processing

1996

pdf bib
Bilingual sentence alignment: balancing robustness and accuracy
Michel Simard | Pierre Plamondon
Conference of the Association for Machine Translation in the Americas