Masoud Jalili Sabet


2020

pdf bib
SimAlign: High Quality Word Alignments Without Parallel Training Data Using Static and Contextualized Embeddings
Masoud Jalili Sabet | Philipp Dufter | François Yvon | Hinrich Schütze
Findings of the Association for Computational Linguistics: EMNLP 2020

Word alignments are useful for tasks like statistical and neural machine translation (NMT) and cross-lingual annotation projection. Statistical word aligners perform well, as do methods that extract alignments jointly with translations in NMT. However, most approaches require parallel training data and quality decreases as less training data is available. We propose word alignment methods that require no parallel data. The key idea is to leverage multilingual word embeddings – both static and contextualized – for word alignment. Our multilingual embeddings are created from monolingual data only without relying on any parallel data or dictionaries. We find that alignments created from embeddings are superior for four and comparable for two language pairs compared to those produced by traditional statistical aligners – even with abundant parallel data; e.g., contextualized embeddings achieve a word alignment F1 for English-German that is 5 percentage points higher than eflomal, a high-quality statistical aligner, trained on 100k parallel sentences.

2016

pdf bib
Learning to Weight Translations using Ordinal Linear Regression and Query-generated Training Data for Ad-hoc Retrieval with Long Queries
Javid Dadashkarimi | Masoud Jalili Sabet | Azadeh Shakery
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Ordinal regression which is known with learning to rank has long been used in information retrieval (IR). Learning to rank algorithms, have been tailored in document ranking, information filtering, and building large aligned corpora successfully. In this paper, we propose to use this algorithm for query modeling in cross-language environments. To this end, first we build a query-generated training data using pseudo-relevant documents to the query and all translation candidates. The pseudo-relevant documents are obtained by top-ranked documents in response to a translation of the original query. The class of each candidate in the training data is determined based on presence/absence of the candidate in the pseudo-relevant documents. We learn an ordinal regression model to score the candidates based on their relevance to the context of the query, and after that, we construct a query-dependent translation model using a softmax function. Finally, we re-weight the query based on the obtained model. Experimental results on French, German, Spanish, and Italian CLEF collections demonstrate that the proposed method achieves better results compared to state-of-the-art cross-language information retrieval methods, particularly in long queries with large training data.

pdf bib
Improving Word Alignment of Rare Words with Word Embeddings
Masoud Jalili Sabet | Heshaam Faili | Gholamreza Haffari
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

We address the problem of inducing word alignment for language pairs by developing an unsupervised model with the capability of getting applied to other generative alignment models. We approach the task by: i)proposing a new alignment model based on the IBM alignment model 1 that uses vector representation of words, and ii)examining the use of similar source words to overcome the problem of rare source words and improving the alignments. We apply our method to English-French corpora and run the experiments with different sizes of sentence pairs. Our results show competitive performance against the baseline and in some cases improve the results up to 6.9% in terms of precision.

pdf bib
An Unsupervised Method for Automatic Translation Memory Cleaning
Masoud Jalili Sabet | Matteo Negri | Marco Turchi | Eduard Barbu
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
TMop: a Tool for Unsupervised Translation Memory Cleaning
Masoud Jalili Sabet | Matteo Negri | Marco Turchi | José G. C. de Souza | Marcello Federico
Proceedings of ACL-2016 System Demonstrations