Jyotsana Khatri


2020

pdf bib
Generating Fluent Translations from Disfluent Text Without Access to Fluent References: IIT Bombay@IWSLT2020
Nikhil Saini | Jyotsana Khatri | Preethi Jyothi | Pushpak Bhattacharyya
Proceedings of the 17th International Conference on Spoken Language Translation

Machine translation systems perform reasonably well when the input is well-formed speech or text. Conversational speech is spontaneous and inherently consists of many disfluencies. Producing fluent translations of disfluent source text would typically require parallel disfluent to fluent training data. However, fluent translations of spontaneous speech are an additional resource that is tedious to obtain. This work describes the submission of IIT Bombay to the Conversational Speech Translation challenge at IWSLT 2020. We specifically tackle the problem of disfluency removal in disfluent-to-fluent text-to-text translation assuming no access to fluent references during training. Common patterns of disfluency are extracted from disfluent references and a noise induction model is used to simulate them starting from a clean monolingual corpus. This synthetically constructed dataset is then considered as a proxy for labeled data during training. We also make use of additional fluent text in the target language to help generate fluent translations. This work uses no fluent references during training and beats a baseline model by a margin of 4.21 and 3.11 BLEU points where the baseline uses disfluent and fluent references, respectively. Index Terms- disfluency removal, machine translation, noise induction, leveraging monolingual data, denoising for disfluency removal.

pdf bib
Filtering Back-Translated Data in Unsupervised Neural Machine Translation
Jyotsana Khatri | Pushpak Bhattacharyya
Proceedings of the 28th International Conference on Computational Linguistics

Unsupervised neural machine translation (NMT) utilizes only monolingual data for training. The quality of back-translated data plays an important role in the performance of NMT systems. In back-translation, all generated pseudo parallel sentence pairs are not of the same quality. Taking inspiration from domain adaptation where in-domain sentences are given more weight in training, in this paper we propose an approach to filter back-translated data as part of the training process of unsupervised NMT. Our approach gives more weight to good pseudo parallel sentence pairs in the back-translation phase. We calculate the weight of each pseudo parallel sentence pair using sentence-wise round-trip BLEU score which is normalized batch-wise. We compare our approach with the current state of the art approaches for unsupervised NMT.

2019

pdf bib
Utilizing Monolingual Data in NMT for Similar Languages: Submission to Similar Language Translation Task
Jyotsana Khatri | Pushpak Bhattacharyya
Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)

This paper describes our submission to Shared Task on Similar Language Translation in Fourth Conference on Machine Translation (WMT 2019). We submitted three systems for Hindi -> Nepali direction in which we have examined the performance of a RNN based NMT system, a semi-supervised NMT system where monolingual data of both languages is utilized using the architecture by and a system trained with extra synthetic sentences generated using copy of source and target sentences without using any additional monolingual data.