Similarity-Based Alignment of Monolingual Corpora for Text Simplification Purposes

Sarah Albertsson, Evelina Rennes, Arne Jönsson


Abstract
Comparable or parallel corpora are beneficial for many NLP tasks. The automatic collection of corpora enables large-scale resources, even for less-resourced languages, which in turn can be useful for deducing rules and patterns for text rewriting algorithms, a subtask of automatic text simplification. We present two methods for the alignment of Swedish easy-to-read text segments to text segments from a reference corpus. The first method (M1) was originally developed for the task of text reuse detection, measuring sentence similarity by a modified version of a TF-IDF vector space model. A second method (M2), also accounting for part-of-speech tags, was developed, and the methods were compared. For evaluation, a crowdsourcing platform was built for human judgement data collection, and preliminary results showed that cosine similarity relates better to human ranks than the Dice coefficient. We also saw a tendency that including syntactic context to the TF-IDF vector space model is beneficial for this kind of paraphrase alignment task.
Anthology ID:
W16-4118
Volume:
Proceedings of the Workshop on Computational Linguistics for Linguistic Complexity (CL4LC)
Month:
December
Year:
2016
Address:
Osaka, Japan
Venues:
CL4LC | WS
SIG:
Publisher:
The COLING 2016 Organizing Committee
Note:
Pages:
154–163
Language:
URL:
https://www.aclweb.org/anthology/W16-4118
DOI:
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
http://aclanthology.lst.uni-saarland.de/W16-4118.pdf