Marcos Garcia


2019

pdf bib
Pay Attention when you Pay the Bills. A Multilingual Corpus with Dependency-based and Semantic Annotation of Collocations.
Marcos Garcia | Marcos García Salido | Susana Sotelo | Estela Mosqueira | Margarita Alonso-Ramos
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

This paper presents a new multilingual corpus with semantic annotation of collocations in English, Portuguese, and Spanish. The whole resource contains 155k tokens and 1,526 collocations labeled in context. The annotated examples belong to three syntactic relations (adjective-noun, verb-object, and nominal compounds), and represent 58 lexical functions in the Meaning-Text Theory (e.g., Oper, Magn, Bon, etc.). Each collocation was annotated by three linguists and the final resource was revised by a team of experts. The resulting corpus can serve as a basis to evaluate different approaches for collocation identification, which in turn can be useful for different NLP tasks such as natural language understanding or natural language generation.

pdf bib
A Method to Automatically Identify Diachronic Variation in Collocations.
Marcos Garcia | Marcos García Salido
Proceedings of the 1st International Workshop on Computational Approaches to Historical Language Change

This paper introduces a novel method to track collocational variations in diachronic corpora that can identify several changes undergone by these phraseological combinations and to propose alternative solutions found in later periods. The strategy consists of extracting syntactically-related candidates of collocations and ranking them using statistical association measures. Then, starting from the first period of the corpus, the system tracks each combination over time, verifying different types of historical variation such as the loss of one or both lemmas, the disappearance of the collocation, or its diachronic frequency trend. Using a distributional semantics strategy, it also proposes other linguistic structures which convey similar meanings to those extinct collocations. A case study on historical corpora of Portuguese and Spanish shows that the system speeds up and facilitates the finding of some diachronic changes and phraseological shifts that are harder to identify without using automated methods.

pdf bib
Unsupervised Compositional Translation of Multiword Expressions
Pablo Gamallo | Marcos Garcia
Proceedings of the Joint Workshop on Multiword Expressions and WordNet (MWE-WN 2019)

This article describes a dependency-based strategy that uses compositional distributional semantics and cross-lingual word embeddings to translate multiword expressions (MWEs). Our unsupervised approach performs translation as a process of word contextualization by taking into account lexico-syntactic contexts and selectional preferences. This strategy is suited to translate phraseological combinations and phrases whose constituent words are lexically restricted by each other. Several experiments in adjective-noun and verb-object compounds show that mutual contextualization (co-compositionality) clearly outperforms other compositional methods. The paper also contributes with a new freely available dataset of English-Spanish MWEs used to validate the proposed compositional strategy.

pdf bib
A comparison of statistical association measures for identifying dependency-based collocations in various languages.
Marcos Garcia | Marcos García Salido | Margarita Alonso-Ramos
Proceedings of the Joint Workshop on Multiword Expressions and WordNet (MWE-WN 2019)

This paper presents an exploration of different statistical association measures to automatically identify collocations from corpora in English, Portuguese, and Spanish. To evaluate the impact of the association metrics we manually annotated corpora with three different syntactic patterns of collocations (adjective-noun, verb-object and nominal compounds). We took advantage of the PARSEME 1.1 Shared Task corpora by selecting a subset of 155k tokens in the three referred languages, in which we annotated 1,526 collocations with the corresponding Lexical Functions according to the Meaning-Text Theory. Using the resulting gold-standard, we have carried out a comparison between frequency data and several well-known association measures, both symmetric and asymmetric. The results show that the combination of dependency triples with raw frequency information is as powerful as the best association measures in most syntactic patterns and languages. Furthermore, and despite the asymmetric behaviour of collocations, directional approaches perform worse than the symmetric ones in the extraction of these phraseological combinations.

2017

pdf bib
A rule-based system for cross-lingual parsing of Romance languages with Universal Dependencies
Marcos Garcia | Pablo Gamallo
Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

This article describes MetaRomance, a rule-based cross-lingual parser for Romance languages submitted to CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. The system is an almost delexicalized parser which does not need training data to analyze Romance languages. It contains linguistically motivated rules based on PoS-tag patterns. The rules included in MetaRomance were developed in about 12 hours by one expert with no prior knowledge in Universal Dependencies, and can be easily extended using a transparent formalism. In this paper we compare the performance of MetaRomance with other supervised systems participating in the competition, paying special attention to the parsing of different treebanks of the same language. We also compare our system with a delexicalized parser for Romance languages, and take advantage of the harmonized annotation of Universal Dependencies to propose a language ranking based on the syntactic distance each variety has from Romance languages.

pdf bib
A Web Interface for Diachronic Semantic Search in Spanish
Pablo Gamallo | Iván Rodríguez-Torres | Marcos Garcia
Proceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics

This article describes a semantic system which is based on distributional models obtained from a chronologically structured language resource, namely Google Books Syntactic Ngrams.The models were created using dependency-based contexts and a strategy for reducing the vector space, which consists in selecting the more informative and relevant word contexts. The system allowslinguists to analize meaning change of Spanish words in the written language across time.

pdf bib
Using bilingual word-embeddings for multilingual collocation extraction
Marcos Garcia | Marcos García-Salido | Margarita Alonso-Ramos
Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017)

This paper presents a new strategy for multilingual collocation extraction which takes advantage of parallel corpora to learn bilingual word-embeddings. Monolingual collocation candidates are retrieved using Universal Dependencies, while the distributional models are then applied to search for equivalents of the elements of each collocation in the target languages. The proposed method extracts not only collocation equivalents with direct translation between languages, but also other cases where the collocations in the two languages are not literal translations of each other. Several experiments -evaluating collocations with three syntactic patterns- in English, Spanish, and Portuguese show that our approach can effectively extract large pairs of bilingual equivalents with an average precision of about 90%. Moreover, preliminary results on comparable corpora suggest that the distributional models can be applied for identifying new bilingual collocations in different domains.

pdf bib
Towards Syntactic Iberian Polarity Classification
David Vilares | Marcos Garcia | Miguel A. Alonso | Carlos Gómez-Rodríguez
Proceedings of the 8th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

Lexicon-based methods using syntactic rules for polarity classification rely on parsers that are dependent on the language and on treebank guidelines. Thus, rules are also dependent and require adaptation, especially in multilingual scenarios. We tackle this challenge in the context of the Iberian Peninsula, releasing the first symbolic syntax-based Iberian system with rules shared across five official languages: Basque, Catalan, Galician, Portuguese and Spanish. The model is made available.

2016

pdf bib
Incorporating Lexico-semantic Heuristics into Coreference Resolution Sieves for Named Entity Recognition at Document-level
Marcos Garcia
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper explores the incorporation of lexico-semantic heuristics into a deterministic Coreference Resolution (CR) system for classifying named entities at document-level. The highest precise sieves of a CR tool are enriched with both a set of heuristics for merging named entities labeled with different classes and also with some constraints that avoid the incorrect merging of similar mentions. Several tests show that this strategy improves both NER labeling and CR. The CR tool can be applied in combination with any system for named entity recognition using the CoNLL format, and brings benefits to text analytics tasks such as Information Extraction. Experiments were carried out in Spanish, using three different NER tools.

2014

pdf bib
Citius: A Naive-Bayes Strategy for Sentiment Analysis on English Tweets
Pablo Gamallo | Marcos Garcia
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)

pdf bib
An Entity-Centric Coreference Resolution System for Person Entities with Rich Linguistic Information
Marcos Garcia | Pablo Gamallo
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

pdf bib
Multilingual corpora with coreferential annotation of person entities
Marcos Garcia | Pablo Gamallo
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper presents three corpora with coreferential annotation of person entities for Portuguese, Galician and Spanish. They contain coreference links between several types of pronouns (including elliptical, possessive, indefinite, demonstrative, relative and personal clitic and non-clitic pronouns) and nominal phrases (including proper nouns). Some statistics have been computed, showing distributional aspects of coreference both in journalistic and in encyclopedic texts. Furthermore, the paper shows the importance of coreference resolution for a task such as Information Extraction, by evaluating the output of an Open Information Extraction system on the annotated corpora. The corpora are freely distributed in two formats: (i) the SemEval-2010 and (ii) the brat rapid annotation tool, so they can be enlarged and improved collaboratively.

2012

pdf bib
Dependency-Based Open Information Extraction
Pablo Gamallo | Marcos Garcia | Santiago Fernández-Lanza
Proceedings of the Joint Workshop on Unsupervised and Semi-Supervised Learning in NLP

2011

pdf bib
Evaluating Various Linguistic Features on Semantic Relation Extraction
Marcos Garcia | Pablo Gamallo
Proceedings of the International Conference Recent Advances in Natural Language Processing 2011

pdf bib
Dependency-Based Text Compression for Semantic Relation Extraction
Marcos Garcia | Pablo Gamallo
Proceedings of the RANLP 2011 Workshop on Information Extraction and Knowledge Acquisition