Amaru Cuba Gyllensten


2019

pdf bib
R-grams: Unsupervised Learning of Semantic Units in Natural Language
Amaru Cuba Gyllensten | Ariel Ekgren | Magnus Sahlgren
Proceedings of the 13th International Conference on Computational Semantics - Student Papers

This paper investigates data-driven segmentation using Re-Pair or Byte Pair Encoding-techniques. In contrast to previous work which has primarily been focused on subword units for machine translation, we are interested in the general properties of such segments above the word level. We call these segments r-grams, and discuss their properties and the effect they have on the token frequency distribution. The proposed approach is evaluated by demonstrating its viability in embedding techniques, both in monolingual and multilingual test settings. We also provide a number of qualitative examples of the proposed methodology, demonstrating its viability as a language-invariant segmentation procedure.

2016

pdf bib
The Gavagai Living Lexicon
Magnus Sahlgren | Amaru Cuba Gyllensten | Fredrik Espinoza | Ola Hamfors | Jussi Karlgren | Fredrik Olsson | Per Persson | Akshay Viswanathan | Anders Holst
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper presents the Gavagai Living Lexicon, which is an online distributional semantic model currently available in 20 different languages. We describe the underlying distributional semantic model, and how we have solved some of the challenges in applying such a model to large amounts of streaming data. We also describe the architecture of our implementation, and discuss how we deal with continuous quality assurance of the lexicon.