James Fiumara


2020

pdf bib
A Progress Report on Activities at the Linguistic Data Consortium Benefitting the LREC Community
Christopher Cieri | James Fiumara | Stephanie Strassel | Jonathan Wright | Denise DiPersio | Mark Liberman
Proceedings of the 12th Language Resources and Evaluation Conference

This latest in a series of Linguistic Data Consortium (LDC) progress reports to the LREC community does not describe any single language resource, evaluation campaign or technology but sketches the activities, since the last report, of a data center devoted to supporting the work of LREC attendees among other research communities. Specifically, we describe 96 new corpora released in 2018-2020 to date, a new technology evaluation campaign, ongoing activities to support multiple common task human language technology programs, and innovations to advance the methodology of language data collection and annotation.

pdf bib
Proceedings of the LREC 2020 Workshop on "Citizen Linguistics in Language Resource Development"
James Fiumara | Christopher Cieri | Mark Liberman | Chris Callison-Burch
Proceedings of the LREC 2020 Workshop on "Citizen Linguistics in Language Resource Development"

pdf bib
LanguageARC: Developing Language Resources Through Citizen Linguistics
James Fiumara | Christopher Cieri | Jonathan Wright | Mark Liberman
Proceedings of the LREC 2020 Workshop on "Citizen Linguistics in Language Resource Development"

This paper introduces the citizen science platform, LanguageARC, developed within the NIEUW (Novel Incentives and Workflows) project supported by the National Science Foundation under Grant No. 1730377. LanguageARC is a community-oriented online platform bringing together researchers and “citizen linguists” with the shared goal of contributing to linguistic research and language technology development. Like other Citizen Science platforms and projects, LanguageARC harnesses the power and efforts of volunteers who are motivated by the incentives of contributing to science, learning and discovery, and belonging to a community dedicated to social improvement. Citizen linguists contribute language data and judgments by participating in research tasks such as classifying regional accents from audio clips, recording audio of picture descriptions and answering personality questionnaires to create baseline data for NLP research into autism and neurodegenerative conditions. Researchers can create projects on Language ARC without any coding or HTML required using our Project Builder Toolkit.

pdf bib
LanguageARC - a tutorial
Christopher Cieri | James Fiumara
Proceedings of the LREC 2020 Workshop on "Citizen Linguistics in Language Resource Development"

LanguageARC is a portal that offers citizen linguists opportunities to contribute to language related research. It also provides researchers with infrastructure for easily creating data collection and annotation tasks on the portal and potentially connecting with contributors. This document describes LanguageARC’s main features and operation for researchers interested in creating new projects and or using the resulting data.

2018

pdf bib
Introducing NIEUW: Novel Incentives and Workflows for Eliciting Linguistic Data
Christopher Cieri | James Fiumara | Mark Liberman | Chris Callison-Burch | Jonathan Wright
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2012

pdf bib
Creating HAVIC: Heterogeneous Audio Visual Internet Collection
Stephanie Strassel | Amanda Morris | Jonathan Fiscus | Christopher Caruso | Haejoong Lee | Paul Over | James Fiumara | Barbara Shaw | Brian Antonishek | Martial Michel
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Linguistic Data Consortium and the National Institute of Standards and Technology are collaborating to create a large, heterogeneous annotated multimodal corpus to support research in multimodal event detection and related technologies. The HAVIC (Heterogeneous Audio Visual Internet Collection) Corpus will ultimately consist of several thousands of hours of unconstrained user-generated multimedia content. HAVIC has been designed with an eye toward providing increased challenges for both acoustic and video processing technologies, focusing on multi-dimensional variation inherent in user-generated multimedia content. To date the HAVIC corpus has been used to support the NIST 2010 and 2011 TRECVID Multimedia Event Detection (MED) Evaluations. Portions of the corpus are expected to be released in LDC's catalog in the coming year, with the remaining segments being published over time after their use in the ongoing MED evaluations.