Proceedings of the 28th International Conference on Computational Linguistics: Tutorial Abstracts
- Anthology ID:
- Barcelona, Spain (Online)
- International Committee for Computational Linguistics
This is an introductory tutorial to UCCA (Universal Conceptual Cognitive Annotation), a cross-linguistically applicable framework for semantic representation, with corpora annotated in English, German and French, and ongoing annotation in Russian and Hebrew. UCCA builds on extensive typological work and supports rapid annotation. The tutorial will provide a detailed introduction to the UCCA annotation guidelines, design philosophy and the available resources; and a comparison to other meaning representations. It will also survey the existing parsing work, including the findings of three recent shared tasks, in SemEval and CoNLL, that addressed UCCA parsing. Finally, the tutorial will present recent applications and extensions to the scheme, demonstrating its value for natural language processing in a range of languages and domains.
Embeddings have been one of the most important topics of interest in NLP for the past decade. Representing knowledge through a low-dimensional vector which is easily integrable in modern machine learning models has played a central role in the development of the field. Embedding techniques initially focused on words but the attention soon started to shift to other forms. This tutorial will provide a high-level synthesis of the main embedding techniques in NLP, in the broad sense. We will start by conventional word embeddings (e.g., Word2Vec and GloVe) and then move to other types of embeddings, such as sense-specific and graph alternatives. We will finalize with an overview of the trending contextualized representations (e.g., ELMo and BERT) and explain their potential and impact in NLP.
The advent of neural machine translation (NMT) has opened up exciting research in building multilingual translation systems i.e. translation models that can handle more than one language pair. Many advances have been made which have enabled (1) improving translation for low-resource languages via transfer learning from high resource languages; and (2) building compact translation models spanning multiple languages. In this tutorial, we will cover the latest advances in NMT approaches that leverage multilingualism, especially to enhance low-resource translation. In particular, we will focus on the following topics: modeling parameter sharing for multi-way models, massively multilingual models, training protocols, language divergence, transfer learning, zero-shot/zero-resource learning, pivoting, multilingual pre-training and multi-source translation.
Detecting and grounding false and misleading claims on the web has grown to form a substantial sub-field of NLP. The sub-field addresses problems at multiple different levels of misinformation detection: identifying check-worthy claims; tracking claims and rumors; rumor collection and annotation; grounding claims against knowledge bases; using stance to verify claims; and applying style analysis to detect deception. This half-day tutorial presents the theory behind each of these steps as well as the state-of-the-art solutions.
Question answering, natural language inference and commonsense reasoning are increasingly popular as general NLP system benchmarks, driving both modeling and dataset work. Only for question answering we already have over 100 datasets, with over 40 published after 2018. However, most new datasets get “solved” soon after publication, and this is largely due not to the verbal reasoning capabilities of our models, but to annotation artifacts and shallow cues in the data that they can exploit. This tutorial aims to (1) provide an up-to-date guide to the recent datasets, (2) survey the old and new methodological issues with dataset construction, and (3) outline the existing proposals for overcoming them. The target audience is the NLP practitioners who are lost in dozens of the recent datasets, and would like to know what these datasets are actually measuring. Our overview of the problems with the current datasets and the latest tips and tricks for overcoming them will also be useful to the researchers working on future benchmarks.
Grammatical Error Correction (GEC) is the task of automatically detecting and correcting all types of errors in written text. Although most research has focused on correcting errors in the context of English as a Second Language (ESL), GEC can also be applied to other languages and native text. The main application of a GEC system is thus to assist humans with their writing. Academic and commercial interest in GEC has grown significantly since the Helping Our Own (HOO) and Conference on Natural Language Learning (CoNLL) shared tasks in 2011-14, and a record-breaking 24 teams took part in the recent Building Educational Applications (BEA) shared task. Given this interest, and the recent shift towards neural approaches, we believe the time is right to offer a tutorial on GEC for researchers who may be new to the field or who are interested in the current state of the art and future challenges. With this in mind, the main goal of this tutorial is not only to bring attendees up to speed with GEC in general, but also examine the development of neural-based GEC systems.
This tutorial will focus on NLP for endangered languages documentation and revitalization. First, we will acquaint the attendees with the process and the challenges of language documentation, showing how the needs of the language communities and the documentary linguists map to specific NLP tasks. We will then present the state-of-the-art in NLP applied in this particularly challenging setting (extremely low-resource datasets, noisy transcriptions, limited annotations, non-standard orthographies). In doing so, we will also analyze the challenges of working in this domain and expand on both the capabilities and the limitations of current NLP approaches. Our ultimate goal is to motivate more NLP practitioners to work towards this very important direction, and also provide them with the tools and understanding of the limitations/challenges, both of which are needed in order to have an impact.