Ivan Titov


2020

pdf bib
Improving Massively Multilingual Neural Machine Translation and Zero-Shot Translation
Biao Zhang | Philip Williams | Ivan Titov | Rico Sennrich
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Massively multilingual models for neural machine translation (NMT) are theoretically attractive, but often underperform bilingual models and deliver poor zero-shot translations. In this paper, we explore ways to improve them. We argue that multilingual NMT requires stronger modeling capacity to support language pairs with varying typological characteristics, and overcome this bottleneck via language-specific components and deepening NMT architectures. We identify the off-target translation issue (i.e. translating into a wrong target language) as the major source of the inferior zero-shot performance, and propose random online backtranslation to enforce the translation of unseen training language pairs. Experiments on OPUS-100 (a novel multilingual dataset with 100 languages) show that our approach substantially narrows the performance gap with bilingual models in both one-to-many and many-to-many settings, and improves zero-shot performance by ~10 BLEU, approaching conventional pivot-based methods.

pdf bib
Unsupervised Opinion Summarization as Copycat-Review Generation
Arthur Bražinskas | Mirella Lapata | Ivan Titov
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Opinion summarization is the task of automatically creating summaries that reflect subjective information expressed in multiple documents, such as product reviews. While the majority of previous work has focused on the extractive setting, i.e., selecting fragments from input reviews to produce a summary, we let the model generate novel sentences and hence produce abstractive summaries. Recent progress in summarization has seen the development of supervised models which rely on large quantities of document-summary pairs. Since such training data is expensive to acquire, we instead consider the unsupervised setting, in other words, we do not use any summaries in training. We define a generative model for a review collection which capitalizes on the intuition that when generating a new review given a set of other reviews of a product, we should be able to control the “amount of novelty” going into the new review or, equivalently, vary the extent to which it deviates from the input. At test time, when generating summaries, we force the novelty to be minimal, and produce a text reflecting consensus opinions. We capture this intuition by defining a hierarchical variational autoencoder model. Both individual reviews and the products they correspond to are associated with stochastic latent codes, and the review generator (“decoder”) has direct access to the text of input reviews through the pointer-generator mechanism. Experiments on Amazon and Yelp datasets, show that setting at test time the review’s latent code to its mean, allows the model to produce fluent and coherent summaries reflecting common opinions.

pdf bib
Adaptive Feature Selection for End-to-End Speech Translation
Biao Zhang | Ivan Titov | Barry Haddow | Rico Sennrich
Findings of the Association for Computational Linguistics: EMNLP 2020

Information in speech signals is not evenly distributed, making it an additional challenge for end-to-end (E2E) speech translation (ST) to learn to focus on informative features. In this paper, we propose adaptive feature selection (AFS) for encoder-decoder based E2E ST. We first pre-train an ASR encoder and apply AFS to dynamically estimate the importance of each encoded speech feature to ASR. A ST encoder, stacked on top of the ASR encoder, then receives the filtered features from the (frozen) ASR encoder. We take L0DROP (Zhang et al., 2020) as the backbone for AFS, and adapt it to sparsify speech features with respect to both temporal and feature dimensions. Results on LibriSpeech EnFr and MuST-C benchmarks show that AFS facilitates learning of ST by pruning out ~84% temporal features, yielding an average translation gain of ~1.3-1.6 BLEU and a decoding speedup of ~1.4x. In particular, AFS reduces the performance gap compared to the cascade baseline, and outperforms it on LibriSpeech En-Fr with a BLEU score of 18.56 (without data augmentation).

pdf bib
Information-Theoretic Probing with Minimum Description Length
Elena Voita | Ivan Titov
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

To measure how well pretrained representations encode some linguistic property, it is common to use accuracy of a probe, i.e. a classifier trained to predict the property from the representations. Despite widespread adoption of probes, differences in their accuracy fail to adequately reflect differences in representations. For example, they do not substantially favour pretrained representations over randomly initialized ones. Analogously, their accuracy can be similar when probing for genuine linguistic labels and probing for random synthetic tasks. To see reasonable differences in accuracy with respect to these random baselines, previous work had to constrain either the amount of probe training data or its model size. Instead, we propose an alternative to the standard probes, information-theoretic probing with minimum description length (MDL). With MDL probing, training a probe to predict labels is recast as teaching it to effectively transmit the data. Therefore, the measure of interest changes from probe accuracy to the description length of labels given representations. In addition to probe quality, the description length evaluates “the amount of effort” needed to achieve the quality. This amount of effort characterizes either (i) size of a probing model, or (ii) the amount of data needed to achieve the high quality. We consider two methods for estimating MDL which can be easily implemented on top of the standard probing pipelines: variational coding and online coding. We show that these methods agree in results and are more informative and stable than the standard probes.

pdf bib
How do Decisions Emerge across Layers in Neural Models? Interpretation with Differentiable Masking
Nicola De Cao | Michael Sejr Schlichtkrull | Wilker Aziz | Ivan Titov
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Attribution methods assess the contribution of inputs to the model prediction. One way to do so is erasure: a subset of inputs is considered irrelevant if it can be removed without affecting the prediction. Though conceptually simple, erasure’s objective is intractable and approximate search remains expensive with modern deep NLP models. Erasure is also susceptible to the hindsight bias: the fact that an input can be dropped does not mean that the model ‘knows’ it can be dropped. The resulting pruning is over-aggressive and does not reflect how the model arrives at the prediction. To deal with these challenges, we introduce Differentiable Masking. DiffMask learns to mask-out subsets of the input while maintaining differentiability. The decision to include or disregard an input token is made with a simple model based on intermediate hidden layers of the analyzed model. First, this makes the approach efficient because we predict rather than search. Second, as with probing classifiers, this reveals what the network ‘knows’ at the corresponding layers. This lets us not only plot attribution heatmaps but also analyze how decisions are formed across network layers. We use DiffMask to study BERT models on sentiment classification and question answering.

pdf bib
Graph Convolutions over Constituent Trees for Syntax-Aware Semantic Role Labeling
Diego Marcheggiani | Ivan Titov
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Semantic role labeling (SRL) is the task of identifying predicates and labeling argument spans with semantic roles. Even though most semantic-role formalisms are built upon constituent syntax, and only syntactic constituents can be labeled as arguments (e.g., FrameNet and PropBank), all the recent work on syntax-aware SRL relies on dependency representations of syntax. In contrast, we show how graph convolutional networks (GCNs) can be used to encode constituent structures and inform an SRL system. Nodes in our SpanGCN correspond to constituents. The computation is done in 3 stages. First, initial node representations are produced by ‘composing’ word representations of the first and last words in the constituent. Second, graph convolutions relying on the constituent tree are performed, yielding syntactically-informed constituent representations. Finally, the constituent representations are ‘decomposed’ back into word representations, which are used as input to the SRL classifier. We evaluate SpanGCN against alternatives, including a model using GCNs over dependency trees, and show its effectiveness on standard English SRL benchmarks CoNLL-2005, CoNLL-2012, and FrameNet.

pdf bib
Few-Shot Learning for Opinion Summarization
Arthur Bražinskas | Mirella Lapata | Ivan Titov
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Opinion summarization is the automatic creation of text reflecting subjective information expressed in multiple documents, such as user reviews of a product. The task is practically important and has attracted a lot of attention. However, due to the high cost of summary production, datasets large enough for training supervised models are lacking. Instead, the task has been traditionally approached with extractive methods that learn to select text fragments in an unsupervised or weakly-supervised way. Recently, it has been shown that abstractive summaries, potentially more fluent and better at reflecting conflicting information, can also be produced in an unsupervised fashion. However, these models, not being exposed to actual summaries, fail to capture their essential properties. In this work, we show that even a handful of summaries is sufficient to bootstrap generation of the summary text with all expected properties, such as writing style, informativeness, fluency, and sentiment preservation. We start by training a conditional Transformer language model to generate a new product review given other available reviews of the product. The model is also conditioned on review properties that are directly related to summaries; the properties are derived from reviews with no manual effort. In the second stage, we fine-tune a plug-in module that learns to predict property values on a handful of summaries. This lets us switch the generator to the summarization mode. We show on Amazon and Yelp datasets that our approach substantially outperforms previous extractive and abstractive methods in automatic and human evaluation.

pdf bib
Visually Grounded Compound PCFGs
Yanpeng Zhao | Ivan Titov
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Exploiting visual groundings for language understanding has recently been drawing much attention. In this work, we study visually grounded grammar induction and learn a constituency parser from both unlabeled text and its visual groundings. Existing work on this task (Shi et al., 2019) optimizes a parser via Reinforce and derives the learning signal only from the alignment of images and sentences. While their model is relatively accurate overall, its error distribution is very uneven, with low performance on certain constituents types (e.g., 26.2% recall on verb phrases, VPs) and high on others (e.g., 79.6% recall on noun phrases, NPs). This is not surprising as the learning signal is likely insufficient for deriving all aspects of phrase-structure syntax and gradient estimates are noisy. We show that using an extension of probabilistic context-free grammar model we can do fully-differentiable end-to-end visually grounded learning. Additionally, this enables us to complement the image-text alignment loss with a language modeling objective. On the MSCOCO test captions, our model establishes a new state of the art, outperforming its non-grounded version and, thus, confirming the effectiveness of visual groundings in constituency grammar induction. It also substantially outperforms the previous grounded model, with largest improvements on more ‘abstract’ categories (e.g., +55.1% recall on VPs).

pdf bib
Detecting Word Sense Disambiguation Biases in Machine Translation for Model-Agnostic Adversarial Attacks
Denis Emelin | Ivan Titov | Rico Sennrich
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Word sense disambiguation is a well-known source of translation errors in NMT. We posit that some of the incorrect disambiguation choices are due to models’ over-reliance on dataset artifacts found in training data, specifically superficial word co-occurrences, rather than a deeper understanding of the source text. We introduce a method for the prediction of disambiguation errors based on statistical data properties, demonstrating its effectiveness across several domains and model types. Moreover, we develop a simple adversarial attack strategy that minimally perturbs sentences in order to elicit disambiguation errors to further probe the robustness of translation models. Our findings indicate that disambiguation robustness varies substantially between domains and that different models trained on the same data are vulnerable to different attacks.

pdf bib
Obfuscation for Privacy-preserving Syntactic Parsing
Zhifeng Hu | Serhii Havrylov | Ivan Titov | Shay B. Cohen
Proceedings of the 16th International Conference on Parsing Technologies and the IWPT 2020 Shared Task on Parsing into Enhanced Universal Dependencies

The goal of homomorphic encryption is to encrypt data such that another party can operate on it without being explicitly exposed to the content of the original data. We introduce an idea for a privacy-preserving transformation on natural language data, inspired by homomorphic encryption. Our primary tool is obfuscation, relying on the properties of natural language. Specifically, a given English text is obfuscated using a neural model that aims to preserve the syntactic relationships of the original sentence so that the obfuscated sentence can be parsed instead of the original one. The model works at the word level, and learns to obfuscate each word separately by changing it into a new word that has a similar syntactic role. The text obfuscated by our model leads to better performance on three syntactic parsers (two dependency and one constituency parsers) in comparison to an upper-bound random substitution baseline. More specifically, the results demonstrate that as more terms are obfuscated (by their part of speech), the substitution upper bound significantly degrades, while the neural model maintains a relatively high performing parser. All of this is done without much sacrifice of privacy compared to the random substitution upper bound. We also further analyze the results, and discover that the substituted words have similar syntactic properties, but different semantic content, compared to the original words.

2019

pdf bib
Context-Aware Monolingual Repair for Neural Machine Translation
Elena Voita | Rico Sennrich | Ivan Titov
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Modern sentence-level NMT systems often produce plausible translations of isolated sentences. However, when put in context, these translations may end up being inconsistent with each other. We propose a monolingual DocRepair model to correct inconsistencies between sentence-level translations. DocRepair performs automatic post-editing on a sequence of sentence-level translations, refining translations of sentences in context of each other. For training, the DocRepair model requires only monolingual document-level data in the target language. It is trained as a monolingual sequence-to-sequence model that maps inconsistent groups of sentences into consistent ones. The consistent groups come from the original training data; the inconsistent groups are obtained by sampling round-trip translations for each isolated sentence. We show that this approach successfully imitates inconsistencies we aim to fix: using contrastive evaluation, we show large improvements in the translation of several contextual phenomena in an English-Russian translation task, as well as improvements in the BLEU score. We also conduct a human evaluation and show a strong preference of the annotators to corrected translations over the baseline ones. Moreover, we analyze which discourse phenomena are hard to capture using monolingual data only.

pdf bib
Improving Deep Transformer with Depth-Scaled Initialization and Merged Attention
Biao Zhang | Ivan Titov | Rico Sennrich
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

The general trend in NLP is towards increasing model capacity and performance via deeper neural networks. However, simply stacking more layers of the popular Transformer architecture for machine translation results in poor convergence and high computational overhead. Our empirical analysis suggests that convergence is poor due to gradient vanishing caused by the interaction between residual connection and layer normalization. We propose depth-scaled initialization (DS-Init), which decreases parameter variance at the initialization stage, and reduces output variance of residual connections so as to ease gradient back-propagation through normalization layers. To address computational cost, we propose a merged attention sublayer (MAtt) which combines a simplified average-based self-attention sublayer and the encoder-decoder attention sublayer on the decoder side. Results on WMT and IWSLT translation tasks with five translation directions show that deep Transformers with DS-Init and MAtt can substantially outperform their base counterpart in terms of BLEU (+1.1 BLEU on average for 12-layer models), while matching the decoding speed of the baseline model thanks to the efficiency improvements of MAtt. Source code for reproduction will be released soon.

pdf bib
Semantic Role Labeling with Iterative Structure Refinement
Chunchuan Lyu | Shay B. Cohen | Ivan Titov
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Modern state-of-the-art Semantic Role Labeling (SRL) methods rely on expressive sentence encoders (e.g., multi-layer LSTMs) but tend to model only local (if any) interactions between individual argument labeling decisions. This contrasts with earlier work and also with the intuition that the labels of individual arguments are strongly interdependent. We model interactions between argument labeling decisions through iterative refinement. Starting with an output produced by a factorized model, we iteratively refine it using a refinement network. Instead of modeling arbitrary interactions among roles and words, we encode prior knowledge about the SRL problem by designing a restricted network architecture capturing non-local interactions. This modeling choice prevents overfitting and results in an effective model, outperforming strong factorized baseline models on all 7 CoNLL-2009 languages, and achieving state-of-the-art results on 5 of them, including English.

pdf bib
Learning Semantic Parsers from Denotations with Latent Structured Alignments and Abstract Programs
Bailin Wang | Ivan Titov | Mirella Lapata
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Semantic parsing aims to map natural language utterances onto machine interpretable meaning representations, aka programs whose execution against a real-world environment produces a denotation. Weakly-supervised semantic parsers are trained on utterance-denotation pairs treating programs as latent. The task is challenging due to the large search space and spuriousness of programs which may execute to the correct answer but do not generalize to unseen examples. Our goal is to instill an inductive bias in the parser to help it distinguish between spurious and correct programs. We capitalize on the intuition that correct programs would likely respect certain structural constraints were they to be aligned to the question (e.g., program fragments are unlikely to align to overlapping text spans) and propose to model alignments as structured latent variables. In order to make the latent-alignment framework tractable, we decompose the parsing task into (1) predicting a partial “abstract program” and (2) refining it while modeling structured alignments with differential dynamic programming. We obtain state-of-the-art performance on the WikiTableQuestions and WikiSQL datasets. When compared to a standard attention baseline, we observe that the proposed structured-alignment mechanism is highly beneficial.

pdf bib
The Bottom-up Evolution of Representations in the Transformer: A Study with Machine Translation and Language Modeling Objectives
Elena Voita | Rico Sennrich | Ivan Titov
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

We seek to understand how the representations of individual tokens and the structure of the learned feature space evolve between layers in deep neural networks under different learning objectives. We chose the Transformers for our analysis as they have been shown effective with various tasks, including machine translation (MT), standard left-to-right language models (LM) and masked language modeling (MLM). Previous work used black-box probing tasks to show that the representations learned by the Transformer differ significantly depending on the objective. In this work, we use canonical correlation analysis and mutual information estimators to study how information flows across Transformer layers and observe that the choice of the objective determines this process. For example, as you go from bottom to top layers, information about the past in left-to-right language models gets vanished and predictions about the future get formed. In contrast, for MLM, representations initially acquire information about the context around the token, partially forgetting the token identity and producing a more generalized token representation. The token identity then gets recreated at the top MLM layers.

pdf bib
Capturing Argument Interaction in Semantic Role Labeling with Capsule Networks
Xinchi Chen | Chunchuan Lyu | Ivan Titov
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Semantic role labeling (SRL) involves extracting propositions (i.e. predicates and their typed arguments) from natural language sentences. State-of-the-art SRL models rely on powerful encoders (e.g., LSTMs) and do not model non-local interaction between arguments. We propose a new approach to modeling these interactions while maintaining efficient inference. Specifically, we use Capsule Networks (Sabour et al., 2017): each proposition is encoded as a tuple of capsules, one capsule per argument type (i.e. role). These tuples serve as embeddings of entire propositions. In every network layer, the capsules interact with each other and with representations of words in the sentence. Each iteration results in updated proposition embeddings and updated predictions about the SRL structure. Our model substantially outperforms the non-refinement baseline model on all 7 CoNLL-2019 languages and achieves state-of-the-art results on 5 languages (including English) for dependency SRL. We analyze the types of mistakes corrected by the refinement procedure. For example, each role is typically (but not always) filled with at most one argument. Whereas enforcing this approximate constraint is not useful with the modern SRL system, iterative procedure corrects the mistakes by capturing this intuition in a flexible and context-sensitive way.

pdf bib
When a Good Translation is Wrong in Context: Context-Aware Machine Translation Improves on Deixis, Ellipsis, and Lexical Cohesion
Elena Voita | Rico Sennrich | Ivan Titov
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Though machine translation errors caused by the lack of context beyond one sentence have long been acknowledged, the development of context-aware NMT systems is hampered by several problems. Firstly, standard metrics are not sensitive to improvements in consistency in document-level translations. Secondly, previous work on context-aware NMT assumed that the sentence-aligned parallel data consisted of complete documents while in most practical scenarios such document-level data constitutes only a fraction of the available parallel data. To address the first issue, we perform a human study on an English-Russian subtitles dataset and identify deixis, ellipsis and lexical cohesion as three main sources of inconsistency. We then create test sets targeting these phenomena. To address the second shortcoming, we consider a set-up in which a much larger amount of sentence-level data is available compared to that aligned at the document level. We introduce a model that is suitable for this scenario and demonstrate major gains over a context-agnostic baseline on our new benchmarks without sacrificing performance as measured with BLEU.

pdf bib
Boosting Entity Linking Performance by Leveraging Unlabeled Documents
Phong Le | Ivan Titov
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Modern entity linking systems rely on large collections of documents specifically annotated for the task (e.g., AIDA CoNLL). In contrast, we propose an approach which exploits only naturally occurring information: unlabeled documents and Wikipedia. Our approach consists of two stages. First, we construct a high recall list of candidate entities for each mention in an unlabeled document. Second, we use the candidate lists as weak supervision to constrain our document-level entity linking model. The model treats entities as latent variables and, when estimated on a collection of unlabelled texts, learns to choose entities relying both on local context of each mention and on coherence with other entities in the document. The resulting approach rivals fully-supervised state-of-the-art systems on standard test sets. It also approaches their performance in the very challenging setting: when tested on a test set sampled from the data used to estimate the supervised systems. By comparing to Wikipedia-only training of our model, we demonstrate that modeling unlabeled documents is beneficial.

pdf bib
Interpretable Neural Predictions with Differentiable Binary Variables
Jasmijn Bastings | Wilker Aziz | Ivan Titov
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

The success of neural networks comes hand in hand with a desire for more interpretability. We focus on text classifiers and make them more interpretable by having them provide a justification–a rationale–for their predictions. We approach this problem by jointly training two neural network models: a latent model that selects a rationale (i.e. a short and informative part of the input text), and a classifier that learns from the words in the rationale alone. Previous work proposed to assign binary latent masks to input positions and to promote short selections via sparsity-inducing penalties such as L0 regularisation. We propose a latent model that mixes discrete and continuous behaviour allowing at the same time for binary selections and gradient-based training without REINFORCE. In our formulation, we can tractably compute the expected value of penalties such as L0, which allows us to directly optimise the model towards a pre-specified text selection rate. We show that our approach is competitive with previous work on rationale extraction, and explore further uses in attention mechanisms.

pdf bib
Distant Learning for Entity Linking with Automatic Noise Detection
Phong Le | Ivan Titov
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Accurate entity linkers have been produced for domains and languages where annotated data (i.e., texts linked to a knowledge base) is available. However, little progress has been made for the settings where no or very limited amounts of labeled data are present (e.g., legal or most scientific domains). In this work, we show how we can learn to link mentions without having any labeled examples, only a knowledge base and a collection of unannotated texts from the corresponding domain. In order to achieve this, we frame the task as a multi-instance learning problem and rely on surface matching to create initial noisy labels. As the learning signal is weak and our surrogate labels are noisy, we introduce a noise detection component in our model: it lets the model detect and disregard examples which are likely to be noisy. Our method, jointly learning to detect noise and link entities, greatly outperforms the surface matching baseline. For a subset of entity categories, it even approaches the performance of supervised learning.

pdf bib
Learning Latent Trees with Stochastic Perturbations and Differentiable Dynamic Programming
Caio Corro | Ivan Titov
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

We treat projective dependency trees as latent variables in our probabilistic model and induce them in such a way as to be beneficial for a downstream task, without relying on any direct tree supervision. Our approach relies on Gumbel perturbations and differentiable dynamic programming. Unlike previous approaches to latent tree learning, we stochastically sample global structures and our parser is fully differentiable. We illustrate its effectiveness on sentiment analysis and natural language inference tasks. We also study its properties on a synthetic structure induction task. Ablation studies emphasize the importance of both stochasticity and constraining latent structures to be projective trees.

pdf bib
Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned
Elena Voita | David Talbot | Fedor Moiseev | Rico Sennrich | Ivan Titov
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Multi-head self-attention is a key component of the Transformer, a state-of-the-art architecture for neural machine translation. In this work we evaluate the contribution made by individual attention heads to the overall performance of the model and analyze the roles played by them in the encoder. We find that the most important and confident heads play consistent and often linguistically-interpretable roles. When pruning heads using a method based on stochastic gates and a differentiable relaxation of the L0 penalty, we observe that specialized heads are last to be pruned. Our novel pruning method removes the vast majority of heads without seriously affecting performance. For example, on the English-Russian WMT dataset, pruning 38 out of 48 encoder heads results in a drop of only 0.15 BLEU.

pdf bib
Widening the Representation Bottleneck in Neural Machine Translation with Lexical Shortcuts
Denis Emelin | Ivan Titov | Rico Sennrich
Proceedings of the Fourth Conference on Machine Translation (Volume 1: Research Papers)

The transformer is a state-of-the-art neural translation model that uses attention to iteratively refine lexical representations with information drawn from the surrounding context. Lexical features are fed into the first layer and propagated through a deep network of hidden layers. We argue that the need to represent and propagate lexical features in each layer limits the model’s capacity for learning and representing other information relevant to the task. To alleviate this bottleneck, we introduce gated shortcut connections between the embedding layer and each subsequent layer within the encoder and decoder. This enables the model to access relevant lexical content dynamically, without expending limited resources on storing it within intermediate states. We show that the proposed modification yields consistent improvements over a baseline transformer on standard WMT translation tasks in 5 translation directions (0.9 BLEU on average) and reduces the amount of lexical information passed along the hidden layers. We furthermore evaluate different ways to integrate lexical connections into the transformer architecture and present ablation experiments exploring the effect of proposed shortcuts on model behavior.

pdf bib
Single Document Summarization as Tree Induction
Yang Liu | Ivan Titov | Mirella Lapata
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

In this paper, we conceptualize single-document extractive summarization as a tree induction problem. In contrast to previous approaches which have relied on linguistically motivated document representations to generate summaries, our model induces a multi-root dependency tree while predicting the output summary. Each root node in the tree is a summary sentence, and the subtrees attached to it are sentences whose content relates to or explains the summary sentence. We design a new iterative refinement algorithm: it induces the trees through repeatedly refining the structures predicted by previous iterations. We demonstrate experimentally on two benchmark datasets that our summarizer performs competitively against state-of-the-art methods.

pdf bib
Question Answering by Reasoning Across Documents with Graph Convolutional Networks
Nicola De Cao | Wilker Aziz | Ivan Titov
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Most research in reading comprehension has focused on answering questions based on individual documents or even single paragraphs. We introduce a neural model which integrates and reasons relying on information spread within documents and across multiple documents. We frame it as an inference problem on a graph. Mentions of entities are nodes of this graph while edges encode relations between different mentions (e.g., within- and cross-document co-reference). Graph convolutional networks (GCNs) are applied to these graphs and trained to perform multi-step reasoning. Our Entity-GCN method is scalable and compact, and it achieves state-of-the-art results on a multi-document question answering dataset, WikiHop (Welbl et al., 2018).

2018

pdf bib
Proceedings of the 22nd Conference on Computational Natural Language Learning
Anna Korhonen | Ivan Titov
Proceedings of the 22nd Conference on Computational Natural Language Learning

pdf bib
Exploiting Semantics in Neural Machine Translation with Graph Convolutional Networks
Diego Marcheggiani | Jasmijn Bastings | Ivan Titov
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

Semantic representations have long been argued as potentially useful for enforcing meaning preservation and improving generalization performance of machine translation methods. In this work, we are the first to incorporate information about predicate-argument structure of source sentences (namely, semantic-role representations) into neural machine translation. We use Graph Convolutional Networks (GCNs) to inject a semantic bias into sentence encoders and achieve improvements in BLEU scores over the linguistic-agnostic and syntax-aware versions on the English–German language pair.

pdf bib
Embedding Words as Distributions with a Bayesian Skip-gram Model
Arthur Bražinskas | Serhii Havrylov | Ivan Titov
Proceedings of the 27th International Conference on Computational Linguistics

We introduce a method for embedding words as probability densities in a low-dimensional space. Rather than assuming that a word embedding is fixed across the entire text collection, as in standard word embedding methods, in our Bayesian model we generate it from a word-specific prior density for each occurrence of a given word. Intuitively, for each word, the prior density encodes the distribution of its potential ‘meanings’. These prior densities are conceptually similar to Gaussian embeddings of ėwcitevilnis2014word. Interestingly, unlike the Gaussian embeddings, we can also obtain context-specific densities: they encode uncertainty about the sense of a word given its context and correspond to the approximate posterior distributions within our model. The context-dependent densities have many potential applications: for example, we show that they can be directly used in the lexical substitution task. We describe an effective estimation method based on the variational autoencoding framework. We demonstrate the effectiveness of our embedding technique on a range of standard benchmarks.

pdf bib
AMR Parsing as Graph Prediction with Latent Alignment
Chunchuan Lyu | Ivan Titov
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Abstract meaning representations (AMRs) are broad-coverage sentence-level semantic representations. AMRs represent sentences as rooted labeled directed acyclic graphs. AMR parsing is challenging partly due to the lack of annotated alignments between nodes in the graphs and words in the corresponding sentences. We introduce a neural parser which treats alignments as latent variables within a joint probabilistic model of concepts, relations and alignments. As exact inference requires marginalizing over alignments and is infeasible, we use the variational autoencoding framework and a continuous relaxation of the discrete alignments. We show that joint modeling is preferable to using a pipeline of align and parse. The parser achieves the best reported results on the standard benchmark (74.4% on LDC2016E25).

pdf bib
Context-Aware Neural Machine Translation Learns Anaphora Resolution
Elena Voita | Pavel Serdyukov | Rico Sennrich | Ivan Titov
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Standard machine translation systems process sentences in isolation and hence ignore extra-sentential information, even though extended context can both prevent mistakes in ambiguous cases and improve translation coherence. We introduce a context-aware neural machine translation model designed in such way that the flow of information from the extended context to the translation model can be controlled and analyzed. We experiment with an English-Russian subtitles dataset, and observe that much of what is captured by our model deals with improving pronoun translation. We measure correspondences between induced attention distributions and coreference relations and observe that the model implicitly captures anaphora. It is consistent with gains for sentences where pronouns need to be gendered in translation. Beside improvements in anaphoric cases, the model also improves in overall BLEU, both over its context-agnostic version (+0.7) and over simple concatenation of the context and source sentences (+0.6).

pdf bib
Improving Entity Linking by Modeling Latent Relations between Mentions
Phong Le | Ivan Titov
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Entity linking involves aligning textual mentions of named entities to their corresponding entries in a knowledge base. Entity linking systems often exploit relations between textual mentions in a document (e.g., coreference) to decide if the linking decisions are compatible. Unlike previous approaches, which relied on supervised systems or heuristics to predict these relations, we treat relations as latent variables in our neural entity-linking model. We induce the relations without any supervision while optimizing the entity-linking system in an end-to-end fashion. Our multi-relational model achieves the best reported scores on the standard benchmark (AIDA-CoNLL) and substantially outperforms its relation-agnostic version. Its training also converges much faster, suggesting that the injected structural bias helps to explain regularities in the training data.

2017

pdf bib
Optimizing Differentiable Relaxations of Coreference Evaluation Metrics
Phong Le | Ivan Titov
Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017)

Coreference evaluation metrics are hard to optimize directly as they are non-differentiable functions, not easily decomposable into elementary decisions. Consequently, most approaches optimize objectives only indirectly related to the end goal, resulting in suboptimal performance. Instead, we propose a differentiable relaxation that lends itself to gradient-based optimisation, thus bypassing the need for reinforcement learning or heuristic modification of cross-entropy. We show that by modifying the training objective of a competitive neural coreference system, we obtain a substantial gain in performance. This suggests that our approach can be regarded as a viable alternative to using reinforcement learning or more computationally expensive imitation learning.

pdf bib
A Simple and Accurate Syntax-Agnostic Neural Model for Dependency-based Semantic Role Labeling
Diego Marcheggiani | Anton Frolov | Ivan Titov
Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017)

We introduce a simple and accurate neural model for dependency-based semantic role labeling. Our model predicts predicate-argument dependencies relying on states of a bidirectional LSTM encoder. The semantic role labeler achieves competitive performance on English, even without any kind of syntactic information and only using local inference. However, when automatically predicted part-of-speech tags are provided as input, it substantially outperforms all previous local models and approaches the best reported results on the English CoNLL-2009 dataset. We also consider Chinese, Czech and Spanish where our approach also achieves competitive results. Syntactic parsers are unreliable on out-of-domain data, so standard (i.e., syntactically-informed) SRL models are hindered when tested in this setting. Our syntax-agnostic model appears more robust, resulting in the best reported results on standard out-of-domain test sets.

pdf bib
Modeling Semantic Expectation: Using Script Knowledge for Referent Prediction
Ashutosh Modi | Ivan Titov | Vera Demberg | Asad Sayeed | Manfred Pinkal
Transactions of the Association for Computational Linguistics, Volume 5

Recent research in psycholinguistics has provided increasing evidence that humans predict upcoming content. Prediction also affects perception and might be a key to robustness in human language processing. In this paper, we investigate the factors that affect human prediction by building a computational model that can predict upcoming discourse referents based on linguistic knowledge alone vs. linguistic knowledge jointly with common-sense knowledge in the form of scripts. We find that script knowledge significantly improves model estimates of human predictions. In a second study, we test the highly controversial hypothesis that predictability influences referring expression type but do not find evidence for such an effect.

pdf bib
Encoding Sentences with Graph Convolutional Networks for Semantic Role Labeling
Diego Marcheggiani | Ivan Titov
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Semantic role labeling (SRL) is the task of identifying the predicate-argument structure of a sentence. It is typically regarded as an important step in the standard NLP pipeline. As the semantic representations are closely related to syntactic ones, we exploit syntactic information in our model. We propose a version of graph convolutional networks (GCNs), a recent class of neural networks operating on graphs, suited to model syntactic dependency graphs. GCNs over syntactic dependency trees are used as sentence encoders, producing latent feature representations of words in a sentence. We observe that GCN layers are complementary to LSTM ones: when we stack both GCN and LSTM layers, we obtain a substantial improvement over an already state-of-the-art LSTM SRL model, resulting in the best reported scores on the standard benchmark (CoNLL-2009) both for Chinese and English.

pdf bib
Graph Convolutional Encoders for Syntax-aware Neural Machine Translation
Jasmijn Bastings | Ivan Titov | Wilker Aziz | Diego Marcheggiani | Khalil Sima’an
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

We present a simple and effective approach to incorporating syntactic structure into neural attention-based encoder-decoder models for machine translation. We rely on graph-convolutional networks (GCNs), a recent class of neural networks developed for modeling graph-structured data. Our GCNs use predicted syntactic dependency trees of source sentences to produce representations of words (i.e. hidden states of the encoder) that are sensitive to their syntactic neighborhoods. GCNs take word representations as input and produce word representations as output, so they can easily be incorporated as layers into standard encoders (e.g., on top of bidirectional RNNs or convolutional neural networks). We evaluate their effectiveness with English-German and English-Czech translation experiments for different types of encoders and observe substantial improvements over their syntax-agnostic versions in all the considered setups.

pdf bib
Semantic Role Labeling
Diego Marcheggiani | Michael Roth | Ivan Titov | Benjamin Van Durme
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts

This tutorial describes semantic role labelling (SRL), the task of mapping text to shallow semantic representations of eventualities and their participants. The tutorial introduces the SRL task and discusses recent research directions related to the task. The audience of this tutorial will learn about the linguistic background and motivation for semantic roles, and also about a range of computational models for this task, from early approaches to the current state-of-the-art. We will further discuss recently proposed variations to the traditional SRL task, including topics such as semantic proto-role labeling.We also cover techniques for reducing required annotation effort, such as methods exploiting unlabeled corpora (semi-supervised and unsupervised techniques), model adaptation across languages and domains, and methods for crowdsourcing semantic role annotation (e.g., question-answer driven SRL). Methods based on different machine learning paradigms, including neural networks, generative Bayesian models, graph-based algorithms and bootstrapping style techniques.Beyond sentence-level SRL, we discuss work that involves semantic roles in discourse. In particular, we cover data sets and models related to the task of identifying implicit roles and linking them to discourse antecedents. We introduce different approaches to this task from the literature, including models based on coreference resolution, centering, and selectional preferences. We also review how new insights gained through them can be useful for the traditional SRL task.

2016

pdf bib
Bilingual Learning of Multi-sense Embeddings with Discrete Autoencoders
Simon Šuster | Ivan Titov | Gertjan van Noord
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Adapting to All Domains at Once: Rewarding Domain Invariance in SMT
Hoang Cuong | Khalil Sima’an | Ivan Titov
Transactions of the Association for Computational Linguistics, Volume 4

Existing work on domain adaptation for statistical machine translation has consistently assumed access to a small sample from the test distribution (target domain) at training time. In practice, however, the target domain may not be known at training time or it may change to match user needs. In such situations, it is natural to push the system to make safer choices, giving higher preference to domain-invariant translations, which work well across domains, over risky domain-specific alternatives. We encode this intuition by (1) inducing latent subdomains from the training data only; (2) introducing features which measure how specialized phrases are to individual induced sub-domains; (3) estimating feature weights on out-of-domain data (rather than on the target domain). We conduct experiments on three language pairs and a number of different domains. We observe consistent improvements over a baseline which does not explicitly reward domain invariance.

pdf bib
Discrete-State Variational Autoencoders for Joint Discovery and Factorization of Relations
Diego Marcheggiani | Ivan Titov
Transactions of the Association for Computational Linguistics, Volume 4

We present a method for unsupervised open-domain relation discovery. In contrast to previous (mostly generative and agglomerative clustering) approaches, our model relies on rich contextual features and makes minimal independence assumptions. The model is composed of two parts: a feature-rich relation extractor, which predicts a semantic relation between two entities, and a factorization model, which reconstructs arguments (i.e., the entities) relying on the predicted relation. The two components are estimated jointly so as to minimize errors in recovering arguments. We study factorization models inspired by previous work in relation factorization and selectional preference modeling. Our models substantially outperform the generative and agglomerative-clustering counterparts and achieve state-of-the-art performance.

pdf bib
Proceedings of the Fifth Joint Conference on Lexical and Computational Semantics
Claire Gardent | Raffaella Bernardi | Ivan Titov
Proceedings of the Fifth Joint Conference on Lexical and Computational Semantics

2015

pdf bib
Unsupervised Induction of Semantic Roles within a Reconstruction-Error Minimization Framework
Ivan Titov | Ehsan Khoddam
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2014

pdf bib
Improved Estimation of Entropy for Evaluation of Word Sense Induction
Linlin Li | Ivan Titov | Caroline Sporleder
Computational Linguistics, Volume 40, Issue 3 - September 2014

pdf bib
Cross-lingual Model Transfer Using Feature Representation Projection
Mikhail Kozhevnikov | Ivan Titov
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
A Hierarchical Bayesian Model for Unsupervised Induction of Script Knowledge
Lea Frermann | Ivan Titov | Manfred Pinkal
Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics

pdf bib
Inducing Neural Models of Script Knowledge
Ashutosh Modi | Ivan Titov
Proceedings of the Eighteenth Conference on Computational Natural Language Learning

2013

pdf bib
Predicting the Resolution of Referring Expressions from User Behavior
Nikos Engonopoulos | Martín Villalba | Ivan Titov | Alexander Koller
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

pdf bib
Cross-lingual Transfer of Semantic Role Labeling Models
Mikhail Kozhevnikov | Ivan Titov
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
A Bayesian Model for Joint Unsupervised Induction of Sentiment, Aspect and Discourse Representations
Angeliki Lazaridou | Ivan Titov | Caroline Sporleder
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Multilingual Joint Parsing of Syntactic and Semantic Dependencies with a Latent Variable Model
James Henderson | Paola Merlo | Ivan Titov | Gabriele Musillo
Computational Linguistics, Volume 39, Issue 4 - December 2013

pdf bib
Semantic Role Labeling
Martha Palmer | Ivan Titov | Shumin Wu
NAACL HLT 2013 Tutorial Abstracts

pdf bib
Bootstrapping Semantic Role Labelers from Parallel Data
Mikhail Kozhevnikov | Ivan Titov
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity

2012

pdf bib
Crosslingual Induction of Semantic Roles
Ivan Titov | Alexandre Klementiev
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Unsupervised Induction of Frame-Semantic Representations
Ashutosh Modi | Ivan Titov | Alexandre Klementiev
Proceedings of the NAACL-HLT Workshop on the Induction of Linguistic Structure

pdf bib
A Bayesian Approach to Unsupervised Semantic Role Induction
Ivan Titov | Alexandre Klementiev
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics

pdf bib
Inducing Crosslingual Distributed Representations of Words
Alexandre Klementiev | Ivan Titov | Binod Bhattarai
Proceedings of COLING 2012

pdf bib
Semi-Supervised Semantic Role Labeling: Approaching from an Unsupervised Perspective
Ivan Titov | Alexandre Klementiev
Proceedings of COLING 2012

2011

pdf bib
Domain Adaptation by Constraining Inter-Domain Variability of Latent Feature Representation
Ivan Titov
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

pdf bib
A Bayesian Model for Unsupervised Semantic Parsing
Ivan Titov | Alexandre Klementiev
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

2010

pdf bib
Bootstrapping Semantic Analyzers from Non-Contradictory Texts
Ivan Titov | Mikhail Kozhevnikov
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

pdf bib
Unsupervised Discourse Segmentation of Documents with Inherently Parallel Structure
Minwoo Jeong | Ivan Titov
Proceedings of the ACL 2010 Conference Short Papers

2009

pdf bib
A Latent Variable Model of Synchronous Syntactic-Semantic Parsing for Multiple Languages
Andrea Gesmundo | James Henderson | Paola Merlo | Ivan Titov
Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL 2009): Shared Task

2008

pdf bib
A Joint Model of Text and Aspect Ratings for Sentiment Summarization
Ivan Titov | Ryan McDonald
Proceedings of ACL-08: HLT

pdf bib
A Latent Variable Model of Synchronous Parsing for Syntactic and Semantic Dependencies
James Henderson | Paola Merlo | Gabriele Musillo | Ivan Titov
CoNLL 2008: Proceedings of the Twelfth Conference on Computational Natural Language Learning

2007

pdf bib
A Latent Variable Model for Generative Dependency Parsing
Ivan Titov | James Henderson
Proceedings of the Tenth International Conference on Parsing Technologies

pdf bib
Constituent Parsing with Incremental Sigmoid Belief Networks
Ivan Titov | James Henderson
Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics

pdf bib
Fast and Robust Multilingual Dependency Parsing with a Generative Latent Variable Model
Ivan Titov | James Henderson
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)

2006

pdf bib
Loss Minimization in Parse Reranking
Ivan Titov | James Henderson
Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing

pdf bib
Porting Statistical Parsers with Data-Defined Kernels
Ivan Titov | James Henderson
Proceedings of the Tenth Conference on Computational Natural Language Learning (CoNLL-X)

2005

pdf bib
Data-Defined Kernels for Parse Reranking Derived from Probabilistic Models
James Henderson | Ivan Titov
Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05)