pdf
bib
Proceedings of the Twelfth Workshop on GraphBased Methods for Natural Language Processing (TextGraphs12)
Goran Glavaš

Swapna Somasundaran

Martin Riedl

Eduard Hovy
pdf
bib
abs
Scientific Discovery as Link Prediction in Influence and Citation Graphs
Fan Luo

Marco A. ValenzuelaEscárcega

Gus HahnPowell

Mihai Surdeanu
We introduce a machine learning approach for the identification of “white spaces” in scientific knowledge. Our approach addresses this task as link prediction over a graph that contains over 2M influence statements such as “CTCF activates FOXA1”, which were automatically extracted using opendomain machine reading. We model this prediction task using graphbased features extracted from the above influence graph, as well as from a citation graph that captures scientific communities. We evaluated the proposed approach through backtesting. Although the data is heavily unbalanced (50 times more negative examples than positives), our approach predicts which influence links will be discovered in the “near future” with a F1 score of 27 points, and a mean average precision of 68%.
pdf
bib
abs
Efficient Generation and Processing of Word Cooccurrence Networks Using corpus2graph
Zheng Zhang

Pierre Zweigenbaum

Ruiqing Yin
Corpus2graph is an opensource NLPapplicationoriented tool that generates a word cooccurrence network from a large corpus. It not only contains different builtin methods to preprocess words, analyze sentences, extract word pairs and define edge weights, but also supports usercustomized functions. By using parallelization techniques, it can generate a large word cooccurrence network of the whole English Wikipedia data within hours. And thanks to its nodesedgesweight threelevel progressive calculation design, rebuilding networks with different configurations is even faster as it does not need to start all over again. This tool also works with other graph libraries such as igraph, NetworkX and graphtool as a front end providing data to boost network generation speed.
pdf
bib
abs
Multihop Inference for Sentencelevel TextGraphs: How Challenging is Meaningfully Combining Information for Science Question Answering?
Peter Jansen
Question Answering for complex questions is often modelled as a graph construction or traversal task, where a solver must build or traverse a graph of facts that answer and explain a given question. This “multihop” inference has been shown to be extremely challenging, with few models able to aggregate more than two facts before being overwhelmed by “semantic drift”, or the tendency for long chains of facts to quickly drift off topic. This is a major barrier to current inference models, as even elementary science questions require an average of 4 to 6 facts to answer and explain. In this work we empirically characterize the difficulty of building or traversing a graph of sentences connected by lexical overlap, by evaluating chance sentence aggregation quality through 9,784 manuallyannotated judgements across knowledge graphs built from three freetext corpora (including study guides and Simple Wikipedia). We demonstrate semantic drift tends to be high and aggregation quality low, at between 0.04 and 3, and highlight scenarios that maximize the likelihood of meaningfully combining information.
pdf
bib
abs
MultiSentence Compression with Word VertexLabeled Graphs and Integer Linear Programming
Elvys Linhares Pontes

Stéphane Huet

Thiago Gouveia da Silva

Andréa carneiro Linhares

JuanManuel TorresMoreno
MultiSentence Compression (MSC) aims to generate a short sentence with key information from a cluster of closely related sentences. MSC enables summarization and questionanswering systems to generate outputs combining fully formed sentences from one or several documents. This paper describes a new Integer Linear Programming method for MSC using a vertexlabeled graph to select different keywords, and novel 3gram scores to generate more informative sentences while maintaining their grammaticality. Our system is of good quality and outperforms the stateoftheart for evaluations led on news dataset. We led both automatic and manual evaluations to determine the informativeness and the grammaticality of compressions for each dataset. Additional tests, which take advantage of the fact that the length of compressions can be modulated, still improve ROUGE scores with shorter output sentences.
pdf
bib
abs
Largescale spectral clustering using diffusion coordinates on landmarkbased bipartite graphs
Khiem Pham

Guangliang Chen
Spectral clustering has received a lot of attention due to its ability to separate nonconvex, nonintersecting manifolds, but its high computational complexity has significantly limited its applicability. Motivated by the documentterm coclustering framework by Dhillon (2001), we propose a landmarkbased scalable spectral clustering approach in which we first use the selected landmark set and the given data to form a bipartite graph and then run a diffusion process on it to obtain a family of diffusion coordinates for clustering. We show that our proposed algorithm can be implemented based on very efficient operations on the affinity matrix between the given data and selected landmarks, thus capable of handling large data. Finally, we demonstrate the excellent performance of our method by comparing with the stateoftheart scalable algorithms on several benchmark data sets.
pdf
bib
abs
Efficient Graphbased Word Sense Induction by Distributional Inclusion Vector Embeddings
HawShiuan Chang

Amol Agrawal

Ananya Ganesh

Anirudha Desai

Vinayak Mathur

Alfred Hough

Andrew McCallum
Word sense induction (WSI), which addresses polysemy by unsupervised discovery of multiple word senses, resolves ambiguities for downstream NLP tasks and also makes word representations more interpretable. This paper proposes an accurate and efficient graphbased method for WSI that builds a global nonnegative vector embedding basis (which are interpretable like topics) and clusters the basis indexes in the ego network of each polysemous word. By adopting distributional inclusion vector embeddings as our basis formation model, we avoid the expensive step of nearest neighbor search that plagues other graphbased methods without sacrificing the quality of sense clusters. Experiments on three datasets show that our proposed method produces similar or better sense clusters and embeddings compared with previous stateoftheart methods while being significantly more efficient.
pdf
bib
abs
Fusing Document, Collection and Label Graphbased Representations with Word Embeddings for Text Classification
Konstantinos Skianis

Fragkiskos Malliaros

Michalis Vazirgiannis
Contrary to the traditional BagofWords approach, we consider the GraphofWords(GoW) model in which each document is represented by a graph that encodes relationships between the different terms. Based on this formulation, the importance of a term is determined by weighting the corresponding node in the document, collection and label graphs, using node centrality criteria. We also introduce novel graphbased weighting schemes by enriching graphs with wordembedding similarities, in order to reward or penalize semantic relationships. Our methods produce more discriminative feature weights for text categorization, outperforming existing frequencybased criteria.
pdf
bib
abs
Embedding Text in Hyperbolic Spaces
Bhuwan Dhingra

Christopher Shallue

Mohammad Norouzi

Andrew Dai

George Dahl
Natural language text exhibits hierarchical structure in a variety of respects. Ideally, we could incorporate our prior knowledge of this hierarchical structure into unsupervised learning algorithms that work on text data. Recent work by Nickel and Kiela (2017) proposed using hyperbolic instead of Euclidean embedding spaces to represent hierarchical data and demonstrated encouraging results when embedding graphs. In this work, we extend their method with a reparameterization technique that allows us to learn hyperbolic embeddings of arbitrarily parameterized objects. We apply this framework to learn word and sentence embeddings in hyperbolic space in an unsupervised manner from text corpora. The resulting embeddings seem to encode certain intuitive notions of hierarchy, such as wordcontext frequency and phrase constituency. However, the implicit continuous hierarchy in the learned hyperbolic space makes interrogating the model’s learned hierarchies more difficult than for models that learn explicit edges between items. The learned hyperbolic embeddings show improvements over Euclidean embeddings in some – but not all – downstream tasks, suggesting that hierarchical organization is more useful for some tasks than others.