pdf
bib
Proceedings of the Fourth Workshop on Structured Prediction for NLP
Priyanka Agrawal

Zornitsa Kozareva

Julia Kreutzer

Gerasimos Lampouras

André Martins

Sujith Ravi

Andreas Vlachos
pdf
bib
abs
Syntaxdriven Iterative Expansion Language Models for Controllable Text Generation
Noe Casas

José A. R. Fonollosa

Marta R. Costajussà
The dominant language modeling paradigm handles text as a sequence of discrete tokens. While that approach can capture the latent structure of the text, it is inherently constrained to sequential dynamics for text generation. We propose a new paradigm for introducing a syntactic inductive bias into neural text generation, where the dependency parse tree is used to drive the Transformer model to generate sentences iteratively. Our experiments show that this paradigm is effective at text generation, with quality between LSTMs and Transformers, and comparable diversity, requiring less than half their decoding steps, and its generation process allows direct control over the syntactic constructions of the generated text, enabling the induction of stylistic variations.
pdf
bib
abs
CopyNext: Explicit Span Copying and Alignment in Sequence to Sequence Models
Abhinav Singh

Patrick Xia

Guanghui Qin

Mahsa Yarmohammadi

Benjamin Van Durme
Copy mechanisms are employed in sequence to sequence (seq2seq) models to generate reproductions of words from the input to the output. These frameworks, operating at the lexical type level, fail to provide an explicit alignment that records where each token was copied from. Further, they require contiguous token sequences from the input (spans) to be copied individually. We present a model with an explicit tokenlevel copy operation and extend it to copying entire spans. Our model provides hard alignments between spans in the input and output, allowing for nontraditional applications of seq2seq, like information extraction. We demonstrate the approach on Nested Named Entity Recognition, achieving near stateoftheart accuracy with an order of magnitude increase in decoding speed.
pdf
bib
abs
Generating Synthetic Data for TaskOriented Semantic Parsing with Hierarchical Representations
Ke Tran

Ming Tan
Modern conversational AI systems support natural language understanding for a wide variety of capabilities. While a majority of these tasks can be accomplished using a simple and flat representation of intents and slots, more sophisticated capabilities require complex hierarchical representations supported by semantic parsing. Stateoftheart semantic parsers are trained using supervised learning with data labeled according to a hierarchical schema which might be costly to obtain or not readily available for a new domain. In this work, we explore the possibility of generating synthetic data for neural semantic parsing using a pretrained denoising sequencetosequence model (i.e., BART). Specifically, we first extract masked templates from the existing labeled utterances, and then finetune BART to generate synthetic utterances conditioning on the extracted templates. Finally, we use an auxiliary parser (AP) to filter the generated utterances. The AP guarantees the quality of the generated data. We show the potential of our approach when evaluating on the Facebook TOP dataset for navigation domain.
pdf
bib
abs
Structured Prediction for Joint Class Cardinality and Entity Property Inference in ModelComplete Text Comprehension
Hendrik ter Horst

Philipp Cimiano
Modelcomplete text comprehension aims at interpreting a natural language text with respect to a semantic domain model describing the classes and their properties relevant for the domain in question. Solving this task can be approached as a structured prediction problem, consisting in inferring the most probable instance of the semantic model given the text. In this work, we focus on the challenging subproblem of cardinality prediction that consists in predicting the number of distinct individuals of each class in the semantic model. We show that cardinality prediction can successfully be approached by modeling the overall task as a joint inference problem, predicting the number of individuals of certain classes while at the same time extracting their properties. We approach this task with probabilistic graphical models computing the maximumaposteriori instance of the semantic model. Our main contribution lies on the empirical investigation and analysis of different approximative inference strategies based on Gibbs sampling. We present and evaluate our models on the task of extracting key parameters from scientific full text articles describing preclinical studies in the domain of spinal cord injury.
pdf
bib
abs
Energybased Neural Modelling for LargeScale Multiple Domain Dialogue State Tracking
Anh Duong Trinh

Robert J. Ross

John D. Kelleher
Scaling up dialogue state tracking to multiple domains is challenging due to the growth in the number of variables being tracked. Furthermore, dialog state tracking models do not yet explicitly make use of relationships between dialogue variables, such as slots across domains. We propose using energybased structure prediction methods for largescale dialogue state tracking task in two multiple domain dialogue datasets. Our results indicate that: (i) modelling variable dependencies yields better results; and (ii) the structured prediction output aligns with the dialogue slotvalue constraint principles. This leads to promising directions to improve stateoftheart models by incorporating variable dependencies into their prediction process.
pdf
bib
abs
EndtoEnd Extraction of Structured Information from Business Documents with PointerGenerator Networks
Clément Sage

Alex Aussem

Véronique Eglin

Haytham Elghazel

Jérémy Espinas
The predominant approaches for extracting key information from documents resort to classifiers predicting the information type of each word. However, the word level ground truth used for learning is expensive to obtain since it is not naturally produced by the extraction task. In this paper, we discuss a new method for training extraction models directly from the textual value of information. The extracted information of a document is represented as a sequence of tokens in the XML language. We learn to output this representation with a pointergenerator network that alternately copies the document words carrying information and generates the XML tags delimiting the types of information. The ability of our endtoend method to retrieve structured information is assessed on a large set of business documents. We show that it performs competitively with a standard word classifier without requiring costly word level supervision.
pdf
bib
abs
Layerwise Guided Training for BERT: Learning Incrementally Refined Document Representations
Nikolaos Manginas

Ilias Chalkidis

Prodromos Malakasiotis
Although BERT is widely used by the NLP community, little is known about its inner workings. Several attempts have been made to shed light on certain aspects of BERT, often with contradicting conclusions. A much raised concern focuses on BERT’s overparameterization and underutilization issues. To this end, we propose o novel approach to finetune BERT in a structured manner. Specifically, we focus on Large Scale Multilabel Text Classification (LMTC) where documents are assigned with one or more labels from a large predefined set of hierarchically organized labels. Our approach guides specific BERT layers to predict labels from specific hierarchy levels. Experimenting with two LMTC datasets we show that this structured finetuning approach not only yields better classification results but also leads to better parameter utilization.
pdf
bib
abs
Improving Joint Training of Inference Networks and Structured Prediction Energy Networks
Lifu Tu

Richard Yuanzhe Pang

Kevin Gimpel
Deep energybased models are powerful, but pose challenges for learning and inference (Belanger and McCallum, 2016). Tu and Gimpel (2018) developed an efficient framework for energybased models by training “inference networks” to approximate structured inference instead of using gradient descent. However, their alternating optimization approach suffers from instabilities during training, requiring additional loss terms and careful hyperparameter tuning. In this paper, we contribute several strategies to stabilize and improve this joint training of energy functions and inference networks for structured prediction. We design a compound objective to jointly train both costaugmented and testtime inference networks along with the energy function. We propose joint parameterizations for the inference networks that encourage them to capture complementary functionality during learning. We empirically validate our strategies on two sequence labeling tasks, showing easier paths to strong performance than prior work, as well as further improvements with global energy terms.
pdf
bib
abs
Reading the Manual: Event Extraction as Definition Comprehension
Yunmo Chen

Tongfei Chen

Seth Ebner

Aaron Steven White

Benjamin Van Durme
We ask whether text understanding has progressed to where we may extract event information through incremental refinement of bleached statements derived from annotation manuals. Such a capability would allow for the trivial construction and extension of an extraction framework by intended endusers through declarations such as, “Some person was born in some location at some time.” We introduce an example of a model that employs such statements, with experiments illustrating we can extract events under closed ontologies and generalize to unseen event types simply by reading new definitions.
pdf
bib
abs
On the Discrepancy between Density Estimation and Sequence Generation
Jason Lee

Dustin Tran

Orhan Firat

Kyunghyun Cho
Many sequencetosequence generation tasks, including machine translation and texttospeech, can be posed as estimating the density of the output y given the input x: p(yx). Given this interpretation, it is natural to evaluate sequencetosequence models using conditional loglikelihood on a test set. However, the goal of sequencetosequence generation (or structured prediction) is to find the best output yˆ given an input x, and each task has its own downstream metric R that scores a model output by comparing against a set of references y*: R(yˆ, y*  x). While we hope that a model that excels in density estimation also performs well on the downstream metric, the exact correlation has not been studied for sequence generation tasks. In this paper, by comparing several density estimators on five machine translation tasks, we find that the correlation between rankings of models based on loglikelihood and BLEU varies significantly depending on the range of the model families being compared. First, loglikelihood is highly correlated with BLEU when we consider models within the same family (e.g. autoregressive models, or latent variable models with the same parameterization of the prior). However, we observe no correlation between rankings of models across different families: (1) among nonautoregressive latent variable models, a flexible prior distribution is better at density estimation but gives worse generation quality than a simple prior, and (2) autoregressive models offer the best translation performance overall, while latent variable models with a normalizing flow prior give the highest heldout loglikelihood across all datasets. Therefore, we recommend using a simple prior for the latent variable nonautoregressive model when fast generation speed is desired.
pdf
bib
abs
LogLinear Reformulation of the Noisy Channel Model for DocumentLevel Neural Machine Translation
Sébastien Jean

Kyunghyun Cho
We seek to maximally use various data sources, such as parallel and monolingual data, to build an effective and efficient documentlevel translation system. In particular, we start by considering a noisy channel approach (CITATION) that combines a targettosource translation model and a language model. By applying Bayes’ rule strategically, we reformulate this approach as a loglinear combination of translation, sentencelevel and documentlevel language model probabilities. In addition to using static coefficients for each term, this formulation alternatively allows for the learning of dynamic pertoken weights to more finely control the impact of the language models. Using both static or dynamic coefficients leads to improvements over a contextagnostic baseline and a contextaware concatenation model.
pdf
bib
abs
Deeply Embedded Knowledge Representation & Reasoning For Natural Language Question Answering: A Practitioner’s Perspective
Arindam Mitra

Sanjay Narayana

Chitta Baral
Successful application of Knowledge Representation and Reasoning (KR) in Natural Language Understanding (NLU) is largely limited by the availability of a robust and general purpose natural language parser. Even though several projects have been launched in the pursuit of developing a universal meaning representation language, the existence of an accurate universal parser is far from reality. This has severely limited the application of knowledge representation and reasoning (KR) in the field of NLP and also prevented a proper evaluation of KR based NLU systems. Our goal is to build KR based systems for Natural Language Understanding without relying on a parser. Towards this we propose a method named Deeply Embedded Knowledge Representation & Reasoning (DeepEKR) where we replace the parser by a neural network, soften the symbolic representation so that a deterministic mapping exists between the parser neural network and the interpretable logical form, and finally replace the symbolic solver by an equivalent neural network, so the model can be trained endtoend. We evaluate our method with respect to the task of Qualitative Word Problem Solving on the two available datasets (QuaRTz and QuaRel). Our system achieves same accuracy as that of the stateoftheart accuracy on QuaRTz, outperforms the stateoftheart on QuaRel and severely outperforms a traditional KR based system. The results show that the bias introduced by a KR solution does not prevent it from doing a better job at the end task. Moreover, our method is interpretable due to the bias introduced by the KR approach.