Proceedings of the Student Research Workshop at the 15th Conference of the European Chapter of the Association for Computational Linguistics

Florian Kunneman, Uxoa Iñurrieta, John J. Camilleri, Mariona Coll Ardanuy (Editors)


Anthology ID:
E17-4
Month:
April
Year:
2017
Address:
Valencia, Spain
Venue:
EACL
SIG:
Publisher:
Association for Computational Linguistics
URL:
https://www.aclweb.org/anthology/E17-4
DOI:
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
http://aclanthology.lst.uni-saarland.de/E17-4.pdf

pdf bib
Proceedings of the Student Research Workshop at the 15th Conference of the European Chapter of the Association for Computational Linguistics
Florian Kunneman | Uxoa Iñurrieta | John J. Camilleri | Mariona Coll Ardanuy

pdf bib
Pragmatic descriptions of perceptual stimuli
Emiel van Miltenburg

This research proposal discusses pragmatic factors in image description, arguing that current automatic image description systems do not take these factors into account. I present a general model of the human image description process, and propose to study this process using corpus analysis, experiments, and computational modeling. This will lead to a better characterization of human image description behavior, providing a road map for future research in automatic image description, and the automatic description of perceptual stimuli in general.

pdf bib
Detecting spelling variants in non-standard texts
Fabian Barteld

Spelling variation in non-standard language, e.g. computer-mediated communication and historical texts, is usually treated as a deviation from a standard spelling, e.g. 2mr as an non-standard spelling for tomorrow. Consequently, in normalization – the standard approach of dealing with spelling variation – so-called non-standard words are mapped to their corresponding standard words. However, there is not always a corresponding standard word. This can be the case for single types (like emoticons in computer-mediated communication) or a complete language, e.g. texts from historical languages that did not develop to a standard variety. The approach presented in this thesis proposal deals with spelling variation in absence of reference to a standard. The task is to detect pairs of types that are variants of the same morphological word. An approach for spelling-variant detection is presented, where pairs of potential spelling variants are generated with Levenshtein distance and subsequently filtered by supervised machine learning. The approach is evaluated on historical Low German texts. Finally, further perspectives are discussed.

pdf bib
Replication issues in syntax-based aspect extraction for opinion mining
Edison Marrese-Taylor | Yutaka Matsuo

Reproducing experiments is an important instrument to validate previous work and build upon existing approaches. It has been tackled numerous times in different areas of science. In this paper, we introduce an empirical replicability study of three well-known algorithms for syntactic centric aspect-based opinion mining. We show that reproducing results continues to be a difficult endeavor, mainly due to the lack of details regarding preprocessing and parameter setting, as well as due to the absence of available implementations that clarify these details. We consider these are important threats to validity of the research on the field, specifically when compared to other problems in NLP where public datasets and code availability are critical validity components. We conclude by encouraging code-based research, which we think has a key role in helping researchers to understand the meaning of the state-of-the-art better and to generate continuous advances.

pdf bib
Discourse Relations and Conjoined VPs: Automated Sense Recognition
Valentina Pyatkin | Bonnie Webber

Sense classification of discourse relations is a sub-task of shallow discourse parsing. Discourse relations can occur both across sentences (inter-sentential) and within sentences (intra-sentential), and more than one discourse relation can hold between the same units. Using a newly available corpus of discourse-annotated intra-sentential conjoined verb phrases, we demonstrate a sequential classification pipeline for their multi-label sense classification. We assess the importance of each feature used in the classification, the feature scope, and what is lost in moving from gold standard manual parses to the output of an off-the-shelf parser.

pdf bib
Deception detection in Russian texts
Olga Litvinova | Pavel Seredin | Tatiana Litvinova | John Lyell

Humans are known to detect deception in speech randomly and it is therefore important to develop tools to enable them to detect deception. The problem of deception detection has been studied for a significant amount of time, however the last 10-15 years have seen methods of computational linguistics being employed. Texts are processed using different NLP tools and then classified as deceptive/truthful using machine learning methods. While most research has been performed for English, Slavic languages have never been a focus of detection deception studies. The paper deals with deception detection in Russian narratives. It employs a specially designed corpus of truthful and deceptive texts on the same topic from each respondent, N = 113. The texts were processed using Linguistic Inquiry and Word Count software that is used in most studies of text-based deception detection. The list of parameters computed using the software was expanded due to the designed users’ dictionaries. A variety of text classification methods was employed. The accuracy of the model was found to depend on the author’s gender and text type (deceptive/truthful).

pdf bib
A Computational Model of Human Preferences for Pronoun Resolution
Olga Seminck | Pascal Amsili

We present a cognitive computational model of pronoun resolution that reproduces the human interpretation preferences of the Subject Assignment Strategy and the Parallel Function Strategy. Our model relies on a probabilistic pronoun resolution system trained on corpus data. Factors influencing pronoun resolution are represented as features weighted by their relative importance. The importance the model gives to the preferences is in line with psycholinguistic studies. We demonstrate the cognitive plausibility of the model by running it on experimental items and simulating antecedent choice and reading times of human participants. Our model can be used as a new means to study pronoun resolution, because it captures the interaction of preferences.

pdf bib
Automatic Extraction of News Values from Headline Text
Alicja Piotrkowicz | Vania Dimitrova | Katja Markert

Headlines play a crucial role in attracting audiences’ attention to online artefacts (e.g. news articles, videos, blogs). The ability to carry out an automatic, large-scale analysis of headlines is critical to facilitate the selection and prioritisation of a large volume of digital content. In journalism studies news content has been extensively studied using manually annotated news values - factors used implicitly and explicitly when making decisions on the selection and prioritisation of news items. This paper presents the first attempt at a fully automatic extraction of news values from headline text. The news values extraction methods are applied on a large headlines corpus collected from The Guardian, and evaluated by comparing it with a manually annotated gold standard. A crowdsourcing survey indicates that news values affect people’s decisions to click on a headline, supporting the need for an automatic news values detection.

pdf bib
Assessing Convincingness of Arguments in Online Debates with Limited Number of Features
Lisa Andreevna Chalaguine | Claudia Schulz

We propose a new method in the field of argument analysis in social media to determining convincingness of arguments in online debates, following previous research by Habernal and Gurevych (2016). Rather than using argument specific feature values, we measure feature values relative to the average value in the debate, allowing us to determine argument convincingness with fewer features (between 5 and 35) than normally used for natural language processing tasks. We use a simple forward-feeding neural network for this task and achieve an accuracy of 0.77 which is comparable to the accuracy obtained using 64k features and a support vector machine by Habernal and Gurevych.

pdf bib
Zipf’s and Benford’s laws in Twitter hashtags
José Alberto Pérez Melián | J. Alberto Conejero | Cèsar Ferri Ramírez

Social networks have transformed communication dramatically in recent years through the rise of new platforms and the development of a new language of communication. This landscape requires new forms to describe and predict the behaviour of users in networks. This paper presents an analysis of the frequency distribution of hashtag popularity in Twitter conversations. Our objective is to determine if these frequency distribution follow some well-known frequency distribution that many real-life sets of numerical data satisfy. In particular, we study the similarity of frequency distribution of hashtag popularity with respect to Zipf’s law, an empirical law referring to the phenomenon that many types of data in social sciences can be approximated with a Zipfian distribution. Additionally, we also analyse Benford’s law, is a special case of Zipf’s law, a common pattern about the frequency distribution of leading digits. In order to compute correctly the frequency distribution of hashtag popularity, we need to correct many spelling errors that Twitter’s users introduce. For this purpose we introduce a new filter to correct hashtag mistake based on string distances. The experiments obtained employing datasets of Twitter streams generated under controlled conditions show that Benford’s law and Zipf’s law can be used to model hashtag frequency distribution.

pdf bib
A Multi-aspect Analysis of Automatic Essay Scoring for Brazilian Portuguese
Evelin Amorim | Adriano Veloso

Several methods for automatic essay scoring (AES) for English language have been proposed. However, multi-aspect AES systems for other languages are unusual. Therefore, we propose a multi-aspect AES system to apply on a dataset of Brazilian Portuguese essays, which human experts evaluated according to five aspects defined by Brazilian Government to the National Exam to High School Student (ENEM). These aspects are skills that student must master and every skill is assessed apart from each other. Besides the prediction of each aspect, the feature analysis also was performed for each aspect. The AES system proposed employs several features already employed by AES systems for English language. Our results show that predictions for some aspects performed well with the features we employed, while predictions for other aspects performed poorly. Also, it is possible to note the difference between the five aspects in the detailed feature analysis we performed. Besides these contributions, the eight millions of enrollments every year for ENEM raise some challenge issues for future directions in our research.

pdf bib
Literal or idiomatic? Identifying the reading of single occurrences of German multiword expressions using word embeddings
Rafael Ehren

Non-compositional multiword expressions (MWEs) still pose serious issues for a variety of natural language processing tasks and their ubiquity makes it impossible to get around methods which automatically identify these kind of MWEs. The method presented in this paper was inspired by Sporleder and Li (2009) and is able to discriminate between the literal and non-literal use of an MWE in an unsupervised way. It is based on the assumption that words in a text form cohesive units. If the cohesion of these units is weakened by an expression, it is classified as literal, and otherwise as idiomatic. While Sporleder an Li used Normalized Google Distance to modell semantic similarity, the present work examines the use of a variety of different word embeddings.

pdf bib
Evaluating the Reliability and Interaction of Recursively Used Feature Classes for Terminology Extraction
Anna Hätty | Michael Dorna | Sabine Schulte im Walde

Feature design and selection is a crucial aspect when treating terminology extraction as a machine learning classification problem. We designed feature classes which characterize different properties of terms based on distributions, and propose a new feature class for components of term candidates. By using random forests, we infer optimal features which are later used to build decision tree classifiers. We evaluate our method using the ACL RD-TEC dataset. We demonstrate the importance of the novel feature class for downgrading termhood which exploits properties of term components. Furthermore, our classification suggests that the identification of reliable term candidates should be performed successively, rather than just once.