Fei Xia


2020

pdf bib
Improving Chinese Word Segmentation with Wordhood Memory Networks
Yuanhe Tian | Yan Song | Fei Xia | Tong Zhang | Yonggang Wang
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Contextual features always play an important role in Chinese word segmentation (CWS). Wordhood information, being one of the contextual features, is proved to be useful in many conventional character-based segmenters. However, this feature receives less attention in recent neural models and it is also challenging to design a framework that can properly integrate wordhood information from different wordhood measures to existing neural frameworks. In this paper, we therefore propose a neural framework, WMSeg, which uses memory networks to incorporate wordhood information with several popular encoder-decoder combinations for CWS. Experimental results on five benchmark datasets indicate the memory mechanism successfully models wordhood information for neural segmenters and helps WMSeg achieve state-of-the-art performance on all those datasets. Further experiments and analyses also demonstrate the robustness of our proposed framework with respect to different wordhood measures and the efficiency of wordhood information in cross-domain experiments.

pdf bib
Joint Chinese Word Segmentation and Part-of-speech Tagging via Two-way Attentions of Auto-analyzed Knowledge
Yuanhe Tian | Yan Song | Xiang Ao | Fei Xia | Xiaojun Quan | Tong Zhang | Yonggang Wang
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Chinese word segmentation (CWS) and part-of-speech (POS) tagging are important fundamental tasks for Chinese language processing, where joint learning of them is an effective one-step solution for both tasks. Previous studies for joint CWS and POS tagging mainly follow the character-based tagging paradigm with introducing contextual information such as n-gram features or sentential representations from recurrent neural models. However, for many cases, the joint tagging needs not only modeling from context features but also knowledge attached to them (e.g., syntactic relations among words); limited efforts have been made by existing research to meet such needs. In this paper, we propose a neural model named TwASP for joint CWS and POS tagging following the character-based sequence labeling paradigm, where a two-way attention mechanism is used to incorporate both context feature and their corresponding syntactic knowledge for each input character. Particularly, we use existing language processing toolkits to obtain the auto-analyzed syntactic knowledge for the context, and the proposed attention module can learn and benefit from them although their quality may not be perfect. Our experiments illustrate the effectiveness of the two-way attentions for joint CWS and POS tagging, where state-of-the-art performance is achieved on five benchmark datasets.

pdf bib
Improving Constituency Parsing with Span Attention
Yuanhe Tian | Yan Song | Fei Xia | Tong Zhang
Findings of the Association for Computational Linguistics: EMNLP 2020

Constituency parsing is a fundamental and important task for natural language understanding, where a good representation of contextual information can help this task. N-grams, which is a conventional type of feature for contextual information, have been demonstrated to be useful in many tasks, and thus could also be beneficial for constituency parsing if they are appropriately modeled. In this paper, we propose span attention for neural chart-based constituency parsing to leverage n-gram information. Considering that current chart-based parsers with Transformer-based encoder represent spans by subtraction of the hidden states at the span boundaries, which may cause information loss especially for long spans, we incorporate n-grams into span representations by weighting them according to their contributions to the parsing process. Moreover, we propose categorical span attention to further enhance the model by weighting n-grams within different length categories, and thus benefit long-sentence parsing. Experimental results on three widely used benchmark datasets demonstrate the effectiveness of our approach in parsing Arabic, Chinese, and English, where state-of-the-art performance is obtained by our approach on all of them.

pdf bib
Summarizing Medical Conversations via Identifying Important Utterances
Yan Song | Yuanhe Tian | Nan Wang | Fei Xia
Proceedings of the 28th International Conference on Computational Linguistics

Summarization is an important natural language processing (NLP) task in identifying key information from text. For conversations, the summarization systems need to extract salient contents from spontaneous utterances by multiple speakers. In a special task-oriented scenario, namely medical conversations between patients and doctors, the symptoms, diagnoses, and treatments could be highly important because the nature of such conversation is to find a medical solution to the problem proposed by the patients. Especially consider that current online medical platforms provide millions of public available conversations between real patients and doctors, where the patients propose their medical problems and the registered doctors offer diagnosis and treatment, a conversation in most cases could be too long and the key information is hard to be located. Therefore, summarizations to the patients’ problems and the doctors’ treatments in the conversations can be highly useful, in terms of helping other patients with similar problems have a precise reference for potential medical solutions. In this paper, we focus on medical conversation summarization, using a dataset of medical conversations and corresponding summaries which were crawled from a well-known online healthcare service provider in China. We propose a hierarchical encoder-tagger model (HET) to generate summaries by identifying important utterances (with respect to problem proposing and solving) in the conversations. For the particular dataset used in this study, we show that high-quality summaries can be generated by extracting two types of utterances, namely, problem statements and treatment recommendations. Experimental results demonstrate that HET outperforms strong baselines and models from previous studies, and adding conversation-related features can further improve system performance.

pdf bib
Joint Chinese Word Segmentation and Part-of-speech Tagging via Multi-channel Attention of Character N-grams
Yuanhe Tian | Yan Song | Fei Xia
Proceedings of the 28th International Conference on Computational Linguistics

Chinese word segmentation (CWS) and part-of-speech (POS) tagging are two fundamental tasks for Chinese language processing. Previous studies have demonstrated that jointly performing them can be an effective one-step solution to both tasks and this joint task can benefit from a good modeling of contextual features such as n-grams. However, their work on modeling such contextual features is limited to concatenating the features or their embeddings directly with the input embeddings without distinguishing whether the contextual features are important for the joint task in the specific context. Therefore, their models for the joint task could be misled by unimportant contextual information. In this paper, we propose a character-based neural model for the joint task enhanced by multi-channel attention of n-grams. In the attention module, n-gram features are categorized into different groups according to several criteria, and n-grams in each group are weighted and distinguished according to their importance for the joint task in the specific context. To categorize n-grams, we try two criteria in this study, i.e., n-gram frequency and length, so that n-grams having different capabilities of carrying contextual information are discriminatively learned by our proposed attention module. Experimental results on five benchmark datasets for CWS and POS tagging demonstrate that our approach outperforms strong baseline models and achieves state-of-the-art performance on all five datasets.

pdf bib
Supertagging Combinatory Categorial Grammar with Attentive Graph Convolutional Networks
Yuanhe Tian | Yan Song | Fei Xia
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Supertagging is conventionally regarded as an important task for combinatory categorial grammar (CCG) parsing, where effective modeling of contextual information is highly important to this task. However, existing studies have made limited efforts to leverage contextual features except for applying powerful encoders (e.g., bi-LSTM). In this paper, we propose attentive graph convolutional networks to enhance neural CCG supertagging through a novel solution of leveraging contextual information. Specifically, we build the graph from chunks (n-grams) extracted from a lexicon and apply attention over the graph, so that different word pairs from the contexts within and across chunks are weighted in the model and facilitate the supertagging accordingly. The experiments performed on the CCGbank demonstrate that our approach outperforms all previous studies in terms of both supertagging and parsing. Further analyses illustrate the effectiveness of each component in our approach to discriminatively learn from word pairs to enhance CCG supertagging.

pdf bib
Studying Challenges in Medical Conversation with Structured Annotation
Nan Wang | Yan Song | Fei Xia
Proceedings of the First Workshop on Natural Language Processing for Medical Conversations

Medical conversation is a central part of medical care. Yet, the current state and quality of medical conversation is far from perfect. Therefore, a substantial amount of research has been done to obtain a better understanding of medical conversation and to address its practical challenges and dilemmas. In line with this stream of research, we have developed a multi-layer structure annotation scheme to analyze medical conversation, and are using the scheme to construct a corpus of naturally occurring medical conversation in Chinese pediatric primary care setting. Some of the preliminary findings are reported regarding 1) how a medical conversation starts, 2) where communication problems tend to occur, and 3) how physicians close a conversation. Challenges and opportunities for research on medical conversation with NLP techniques will be discussed.

2019

pdf bib
ChiMed: A Chinese Medical Corpus for Question Answering
Yuanhe Tian | Weicheng Ma | Fei Xia | Yan Song
Proceedings of the 18th BioNLP Workshop and Shared Task

Question answering (QA) is a challenging task in natural language processing (NLP), especially when it is applied to specific domains. While models trained in the general domain can be adapted to a new target domain, their performance often degrades significantly due to domain mismatch. Alternatively, one can require a large amount of domain-specific QA data, but such data are rare, especially for the medical domain. In this study, we first collect a large-scale Chinese medical QA corpus called ChiMed; second we annotate a small fraction of the corpus to check the quality of the answers; third, we extract two datasets from the corpus and use them for the relevancy prediction task and the adoption prediction task. Several benchmark models are applied to the datasets, producing good results for both tasks.

pdf bib
WTMED at MEDIQA 2019: A Hybrid Approach to Biomedical Natural Language Inference
Zhaofeng Wu | Yan Song | Sicong Huang | Yuanhe Tian | Fei Xia
Proceedings of the 18th BioNLP Workshop and Shared Task

Natural language inference (NLI) is challenging, especially when it is applied to technical domains such as biomedical settings. In this paper, we propose a hybrid approach to biomedical NLI where different types of information are exploited for this task. Our base model includes a pre-trained text encoder as the core component, and a syntax encoder and a feature encoder to capture syntactic and domain-specific information. Then we combine the output of different base models to form more powerful ensemble models. Finally, we design two conflict resolution strategies when the test data contain multiple (premise, hypothesis) pairs with the same premise. We train our models on the MedNLI dataset, yielding the best performance on the test set of the MEDIQA 2019 Task 1.

2018

pdf bib
PDF-to-Text Reanalysis for Linguistic Data Mining
Michael Wayne Goodman | Ryan Georgi | Fei Xia
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Constructing a Chinese Medical Conversation Corpus Annotated with Conversational Structures and Actions
Nan Wang | Yan Song | Fei Xia
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Coding Structures and Actions with the COSTA Scheme in Medical Conversations
Nan Wang | Yan Song | Fei Xia
Proceedings of the BioNLP 2018 workshop

This paper describes the COSTA scheme for coding structures and actions in conversation. Informed by Conversation Analysis, the scheme introduces an innovative method for marking multi-layer structural organization of conversation and a structure-informed taxonomy of actions. In addition, we create a corpus of naturally occurring medical conversations, containing 318 video-recorded and manually transcribed pediatric consultations. Based on the annotated corpus, we investigate 1) treatment decision-making process in medical conversations, and 2) effects of physician-caregiver communication behaviors on antibiotic over-prescribing. Although the COSTA annotation scheme is developed based on data from the task-specific domain of pediatric consultations, it can be easily extended to apply to more general domains and other languages.

2017

pdf bib
Learning Word Representations with Regularization from Prior Knowledge
Yan Song | Chia-Jung Lee | Fei Xia
Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017)

Conventional word embeddings are trained with specific criteria (e.g., based on language modeling or co-occurrence) inside a single information source, disregarding the opportunity for further calibration using external knowledge. This paper presents a unified framework that leverages pre-learned or external priors, in the form of a regularizer, for enhancing conventional language model-based embedding learning. We consider two types of regularizers. The first type is derived from topic distribution by running LDA on unlabeled data. The second type is based on dictionaries that are created with human annotation efforts. To effectively learn with the regularizers, we propose a novel data structure, trajectory softmax, in this paper. The resulting embeddings are evaluated by word similarity and sentiment classification. Experimental results show that our learning framework with regularization from prior knowledge improves embedding quality across multiple datasets, compared to a diverse collection of baseline methods.

pdf bib
STREAMLInED Challenges: Aligning Research Interests with Shared Tasks
Gina-Anne Levow | Emily M. Bender | Patrick Littell | Kristen Howell | Shobhana Chelliah | Joshua Crowgey | Dan Garrette | Jeff Good | Sharon Hargus | David Inman | Michael Maxwell | Michael Tjalve | Fei Xia
Proceedings of the 2nd Workshop on the Use of Computational Methods in the Study of Endangered Languages

pdf bib
Inferring Case Systems from IGT: Enriching the Enrichment
Kristen Howell | Emily M. Bender | Michel Lockwood | Fei Xia | Olga Zamaraeva
Proceedings of the 2nd Workshop on the Use of Computational Methods in the Study of Endangered Languages

pdf bib
Computational Support for Finding Word Classes: A Case Study of Abui
Olga Zamaraeva | František Kratochvíl | Emily M. Bender | Fei Xia | Kristen Howell
Proceedings of the 2nd Workshop on the Use of Computational Methods in the Study of Endangered Languages

pdf bib
CROWD-IN-THE-LOOP: A Hybrid Approach for Annotating Semantic Roles
Chenguang Wang | Alan Akbik | Laura Chiticariu | Yunyao Li | Fei Xia | Anbang Xu
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Crowdsourcing has proven to be an effective method for generating labeled data for a range of NLP tasks. However, multiple recent attempts of using crowdsourcing to generate gold-labeled training data for semantic role labeling (SRL) reported only modest results, indicating that SRL is perhaps too difficult a task to be effectively crowdsourced. In this paper, we postulate that while producing SRL annotation does require expert involvement in general, a large subset of SRL labeling tasks is in fact appropriate for the crowd. We present a novel workflow in which we employ a classifier to identify difficult annotation tasks and route each task either to experts or crowd workers according to their difficulties. Our experimental evaluation shows that the proposed approach reduces the workload for experts by over two-thirds, and thus significantly reduces the cost of producing SRL annotation at little loss in quality.

2016

pdf bib
Annotating and Detecting Medical Events in Clinical Notes
Prescott Klassen | Fei Xia | Meliha Yetisgen
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Early detection and treatment of diseases that onset after a patient is admitted to a hospital, such as pneumonia, is critical to improving and reducing costs in healthcare. Previous studies (Tepper et al., 2013) showed that change-of-state events in clinical notes could be important cues for phenotype detection. In this paper, we extend the annotation schema proposed in (Klassen et al., 2014) to mark change-of-state events, diagnosis events, coordination, and negation. After we have completed the annotation, we build NLP systems to automatically identify named entities and medical events, which yield an f-score of 94.7% and 91.8%, respectively.

pdf bib
A Web-framework for ODIN Annotation
Ryan Georgi | Michael Wayne Goodman | Fei Xia
Proceedings of ACL-2016 System Demonstrations

2015

pdf bib
Enriching Interlinear Text using Automatically Constructed Annotators
Ryan Georgi | Fei Xia | William Lewis
Proceedings of the 9th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH)

2014

pdf bib
Unsupervised Dependency Parsing with Transferring Distribution via Parallel Guidance and Entropy Regularization
Xuezhe Ma | Fei Xia
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Learning Grammar Specifications from IGT: A Case Study of Chintang
Emily M. Bender | Joshua Crowgey | Michael Wayne Goodman | Fei Xia
Proceedings of the 2014 Workshop on the Use of Computational Methods in the Study of Endangered Languages

pdf bib
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Tutorial Abstracts
Qun Liu | Fei Xia
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Tutorial Abstracts

pdf bib
Enriching ODIN
Fei Xia | William Lewis | Michael Wayne Goodman | Joshua Crowgey | Emily M. Bender
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper, we describe the expansion of the ODIN resource, a database containing many thousands of instances of Interlinear Glossed Text (IGT) for over a thousand languages harvested from scholarly linguistic papers posted to the Web. A database containing a large number of instances of IGT, which are effectively richly annotated and heuristically aligned bitexts, provides a unique resource for bootstrapping NLP tools for resource-poor languages. To make the data in ODIN more readily consumable by tool developers and NLP researchers, we propose a new XML format for IGT, called Xigt. We call the updated release ODIN-II.

pdf bib
Modern Chinese Helps Archaic Chinese Processing: Finding and Exploiting the Shared Properties
Yan Song | Fei Xia
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Languages change over time and ancient languages have been studied in linguistics and other related fields. A main challenge in this research area is the lack of empirical data; for instance, ancient spoken languages often leave little trace of their linguistic properties. From the perspective of natural language processing (NLP), while the NLP community has created dozens of annotated corpora, very few of them are on ancient languages. As an effort toward bridging the gap, we have created a word segmented and POS tagged corpus for Archaic Chinese using articles from Huainanzi, a book written during China’s Western Han Dynasty (206 BC-9 AD). We then compare this corpus with the Chinese Penn Treebank (CTB), a well-known corpus for Modern Chinese, and report several interesting differences and similarities between the two corpora. Finally, we demonstrate that the CTB can be used to improve the performance of word segmenters and POS taggers for Archaic Chinese, but only through features that have similar behaviors in the two corpora.

pdf bib
Annotating Clinical Events in Text Snippets for Phenotype Detection
Prescott Klassen | Fei Xia | Lucy Vanderwende | Meliha Yetisgen
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Early detection and treatment of diseases that onset after a patient is admitted to a hospital, such as pneumonia, is critical to improving and reducing costs in healthcare. NLP systems that analyze the narrative data embedded in clinical artifacts such as x-ray reports can help support early detection. In this paper, we consider the importance of identifying the change of state for events - in particular, clinical events that measure and compare the multiple states of a patient’s health across time. We propose a schema for event annotation comprised of five fields and create preliminary annotation guidelines for annotators to apply the schema. We then train annotators, measure their performance, and finalize our guidelines. With the complete guidelines, we then annotate a corpus of snippets extracted from chest x-ray reports in order to integrate the annotations as a new source of features for classification tasks.

2013

pdf bib
A Common Case of Jekyll and Hyde: The Synergistic Effect of Using Divided Source Training Data for Feature Augmentation
Yan Song | Fei Xia
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf bib
Annotating Change of State for Clinical Events
Lucy Vanderwende | Fei Xia | Meliha Yetisgen-Yildiz
Workshop on Events: Definition, Detection, Coreference, and Representation

pdf bib
Towards Creating Precision Grammars from Interlinear Glossed Text: Inferring Large-Scale Typological Properties
Emily M. Bender | Michael Wayne Goodman | Joshua Crowgey | Fei Xia
Proceedings of the 7th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities

pdf bib
Enhanced and Portable Dependency Projection Algorithms Using Interlinear Glossed Text
Ryan Georgi | Fei Xia | William D. Lewis
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
Dependency Parser Adaptation with Subtrees from Auto-Parsed Target Domain Data
Xuezhe Ma | Fei Xia
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2012

pdf bib
Measuring the Divergence of Dependency Structures Cross-Linguistically to Improve Syntactic Projection Algorithms
Ryan Georgi | Fei Xia | William Lewis
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Syntactic parses can provide valuable information for many NLP tasks, such as machine translation, semantic analysis, etc. However, most of the world's languages do not have large amounts of syntactically annotated corpora available for building parsers. Syntactic projection techniques attempt to address this issue by using parallel corpora between resource-poor and resource-rich languages, bootstrapping the resource-poor language with the syntactic analysis of the resource-rich language. In this paper, we investigate the possibility of using small, parallel, annotated corpora to automatically detect divergent structural patterns between two languages. These patterns can then be used to improve structural projection algorithms, allowing for better performing NLP tools for resource-poor languages, in particular those that may not have large amounts of annotated data necessary for traditional, fully-supervised methods. While this detection process is not exhaustive, we demonstrate that important instances of divergence are picked up with minimal prior knowledge of a given language pair.

pdf bib
Using a Goodness Measurement for Domain Adaptation: A Case Study on Chinese Word Segmentation
Yan Song | Fei Xia
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Domain adaptation is an important topic for natural language processing. There has been extensive research on the topic and various methods have been explored, including training data selection, model combination, semi-supervised learning. In this study, we propose to use a goodness measure, namely, description length gain (DLG), for domain adaptation for Chinese word segmentation. We demonstrate that DLG can help domain adaptation in two ways: as additional features for supervised segmenters to improve system performance, and also as a similarity measure for selecting training data to better match a test set. We evaluated our systems on the Chinese Penn Treebank version 7.0, which has 1.2 million words from five different genres, and the Chinese Word Segmentation Bakeoff-3 data.

pdf bib
Statistical Section Segmentation in Free-Text Clinical Records
Michael Tepper | Daniel Capurro | Fei Xia | Lucy Vanderwende | Meliha Yetisgen-Yildiz
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Automatically segmenting and classifying clinical free text into sections is an important first step to automatic information retrieval, information extraction and data mining tasks, as it helps to ground the significance of the text within. In this work we describe our approach to automatic section segmentation of clinical records such as hospital discharge summaries and radiology reports, along with section classification into pre-defined section categories. We apply machine learning to the problems of section segmentation and section classification, comparing a joint (one-step) and a pipeline (two-step) approach. We demonstrate that our systems perform well when tested on three data sets, two for hospital discharge summaries and one for radiology reports. We then show the usefulness of section information by incorporating it in the task of extracting comorbidities from discharge summaries.

pdf bib
Effort of Genre Variation and Prediction of System Performance
Dong Wang | Fei Xia
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Domain adaptation is an important task in order for NLP systems to work well in real applications. There has been extensive research on this topic. In this paper, we address two issues that are related to domain adaptation. The first question is how much genre variation will affect NLP systems' performance. We investigate the effect of genre variation on the performance of three NLP tools, namely, word segmenter, POS tagger, and parser. We choose the Chinese Penn Treebank (CTB) as our corpus. The second question is how one can estimate NLP systems' performance when gold standard on the test data does not exist. To answer the question, we extend the prediction model in (Ravi et al., 2008) to provide prediction for word segmentation and POS tagging as well. Our experiments show that the predicted scores are close to the real scores when tested on the CTB data.

pdf bib
Proceedings of the Sixth Linguistic Annotation Workshop
Nancy Ide | Fei Xia
Proceedings of the Sixth Linguistic Annotation Workshop

pdf bib
Creating a Tree Adjoining Grammar from a Multilayer Treebank
Rajesh Bhatt | Owen Rambow | Fei Xia
Proceedings of the 11th International Workshop on Tree Adjoining Grammars and Related Formalisms (TAG+11)

pdf bib
Improving Dependency Parsing with Interlinear Glossed Text and Syntactic Projection
Ryan Georgi | Fei Xia | William Lewis
Proceedings of COLING 2012: Posters

pdf bib
Entropy-based Training Data Selection for Domain Adaptation
Yan Song | Prescott Klassen | Fei Xia | Chunyu Kit
Proceedings of COLING 2012: Posters

2011

pdf bib
Email Formality in the Workplace: A Case Study on the Enron Corpus
Kelly Peterson | Matt Hohensee | Fei Xia
Proceedings of the Workshop on Language in Social Media (LSM 2011)

pdf bib
Linguistic Phenomena, Analyses, and Representations: Understanding Conversion between Treebanks
Rajesh Bhatt | Owen Rambow | Fei Xia
Proceedings of 5th International Joint Conference on Natural Language Processing

2010

pdf bib
Preliminary Experiments with Amazon’s Mechanical Turk for Annotating Medical Named Entities
Meliha Yetisgen-Yildiz | Imre Solti | Fei Xia | Scott Halgrim
Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk

pdf bib
Extracting Medication Information from Discharge Summaries
Scott Halgrim | Fei Xia | Imre Solti | Eithon Cadag | Özlem Uzuner
Proceedings of the NAACL HLT 2010 Second Louhi Workshop on Text and Data Mining of Health Documents

pdf bib
Proceedings of the 2010 Workshop on NLP and Linguistics: Finding the Common Ground
Fei Xia | William Lewis | Lori Levin
Proceedings of the 2010 Workshop on NLP and Linguistics: Finding the Common Ground

pdf bib
Comparing Language Similarity across Genetic and Typologically-Based Groupings
Ryan Georgi | Fei Xia | William Lewis
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)

pdf bib
A comparison of unsupervised methods for Part-of-Speech Tagging in Chinese
Alex Cheng | Fei Xia | Jianfeng Gao
Coling 2010: Posters

pdf bib
Empty Categories in a Hindi Treebank
Archna Bhatia | Rajesh Bhatt | Bhuvana Narasimhan | Martha Palmer | Owen Rambow | Dipti Misra Sharma | Michael Tepper | Ashwini Vaidya | Fei Xia
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We are in the process of creating a multi-representational and multi-layered treebank for Hindi/Urdu (Palmer et al., 2009), which has three main layers: dependency structure, predicate-argument structure (PropBank), and phrase structure. This paper discusses an important issue in treebank design which is often neglected: the use of empty categories (ECs). All three levels of representation make use of ECs. We make a high-level distinction between two types of ECs, trace and silent, on the basis of whether they are postulated to mark displacement or not. Each type is further refined into several subtypes based on the underlying linguistic phenomena which the ECs are introduced to handle. This paper discusses the stages at which we add ECs to the Hindi/Urdu treebank and why. We investigate methodically the different types of ECs and their role in our syntactic and semantic representations. We also examine our decisions whether or not to coindex each type of ECs with other elements in the representation.

pdf bib
The Problems of Language Identification within Hugely Multilingual Data Sets
Fei Xia | Carrie Lewis | William D. Lewis
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

As the data for more and more languages is finding its way into digital form, with an increasing amount of this data being posted to the Web, it has become possible to collect language data from the Web and create large multilingual resources, covering hundreds or even thousands of languages. ODIN, the Online Database of INterlinear text (Lewis, 2006), is such a resource. It currently consists of nearly 200,000 data points for over 1,000 languages, the data for which was harvested from linguistic documents on the Web. We identify a number of issues with language identification for such broad-coverage resources including the lack of training data, ambiguous language names, incomplete language code sets, and incorrect uses of language names and codes. After providing a short overview of existing language code sets maintained by the linguistic community, we discuss what linguists and the linguistic community can do to make the process of language identification easier.

2009

pdf bib
Applying NLP Technologies to the Collection and Enrichment of Language Data on the Web to Aid Linguistic Research
Fei Xia | William Lewis
Proceedings of the EACL 2009 Workshop on Language Technology and Resources for Cultural Heritage, Social Sciences, Humanities, and Education (LaTeCH – SHELT&R 2009)

pdf bib
A Multi-Representational and Multi-Layered Treebank for Hindi/Urdu
Rajesh Bhatt | Bhuvana Narasimhan | Martha Palmer | Owen Rambow | Dipti Sharma | Fei Xia
Proceedings of the Third Linguistic Annotation Workshop (LAW III)

pdf bib
Language ID in the Context of Harvesting Language Data off the Web
Fei Xia | William Lewis | Hoifung Poon
Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)

pdf bib
Parsing, Projecting & Prototypes: Repurposing Linguistic Data on the Web
William Lewis | Fei Xia
Proceedings of the Demonstrations Session at EACL 2009

2008

pdf bib
A Hybrid Approach to the Induction of Underlying Morphology
Michael Tepper | Fei Xia
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-I

pdf bib
Repurposing Theoretical Linguistic Data for Tool Development and Search
Fei Xia | William D. Lewis
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-I

pdf bib
Automatically Identifying Computationally Relevant Typological Features
William D. Lewis | Fei Xia
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-II

pdf bib
Finding parallel texts on the web using cross-language information retrieval
Achim Ruopp | Fei Xia
Proceedings of the 2nd workshop on Cross Lingual Information Access (CLIA) Addressing the Information Need of Multilingual Societies

pdf bib
Proceedings of the Third Workshop on Issues in Teaching Computational Linguistics
Martha Palmer | Chris Brew | Fei Xia
Proceedings of the Third Workshop on Issues in Teaching Computational Linguistics

pdf bib
Building a Flexible, Collaborative, Intensive Master’s Program in Computational Linguistics
Emily M. Bender | Fei Xia | Erik Bansleben
Proceedings of the Third Workshop on Issues in Teaching Computational Linguistics

pdf bib
The Evolution of a Statistical NLP Course
Fei Xia
Proceedings of the Third Workshop on Issues in Teaching Computational Linguistics

2007

pdf bib
Multilingual Structural Projection across Interlinear Text
Fei Xia | William Lewis
Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference

2006

pdf bib
Features, Bagging, and System Combination for the Chinese POS Tagging Task
Fei Xia | Lap Cheung
Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing

2004

pdf bib
Improving a Statistical MT System with Automatically Learned Rewrite Patterns
Fei Xia | Michael McCord
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics

2003

pdf bib
A Phrase-based Unigram Model for Statistical Machine Translation
Christoph Tillmann | Fei Xia
Companion Volume of the Proceedings of HLT-NAACL 2003 - Short Papers

pdf bib
TIPS: A Translingual Information Processing System
Yaser Al-Onaizan | Radu Florian | Martin Franz | Hany Hassan | Young-Suk Lee | J. Scott McCarley | Kishore Papineni | Salim Roukos | Jeffrey Sorensen | Christoph Tillmann | Todd Ward | Fei Xia
Companion Volume of the Proceedings of HLT-NAACL 2003 - Demonstrations

2001

pdf bib
Converting Dependency Structures to Phrase Structures
Fei Xia | Martha Palmer
Proceedings of the First International Conference on Human Language Technology Research

2000

pdf bib
Developing Guidelines and Ensuring Consistency for Chinese Text Annotation
Fei Xia | Martha Palmer | Nianwen Xue | Mary Ellen Okurowski | John Kovarik | Fu-Dong Chiou | Shizhe Huang | Tony Kroch | Mitch Marcus
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)

pdf bib
Comparing Lexicalized Treebank Grammars Extracted from Chinese, Korean, and English Corpora
Fei Xia | Chunghye Han | Martha Palmer | Aravind Joshi
Second Chinese Language Processing Workshop

pdf bib
A Uniform Method of Grammar Extraction and Its Applications
Fei Xia | Martha Palmer | Aravind Joshi
2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora

pdf bib
Some Experiments on Indicators of Parsing Complexity for Lexicalized Grammars
Anoop Sarkar | Fei Xia | Aravind Joshi
Proceedings of the COLING-2000 Workshop on Efficiency In Large-Scale Parsing Systems

pdf bib
A Corpus-based evaluation of syntactic locality in TAGs
Fei Xia | Tonia Bleam
Proceedings of the Fifth International Workshop on Tree Adjoining Grammar and Related Frameworks (TAG+5)

pdf bib
Comparing and integrating Tree Adjoining Grammars
Fei Xia | Martha Palmer
Proceedings of the Fifth International Workshop on Tree Adjoining Grammar and Related Frameworks (TAG+5)

1998

pdf bib
Consistent grammar development using partial-tree descriptions for Lexicalized Tree-Adjoining Grammars
Fei Xia | Martha Palmer | K. Vijay-Shanker | Joseph Rosenzweig
Proceedings of the Fourth International Workshop on Tree Adjoining Grammars and Related Frameworks (TAG+4)

1997

pdf bib
Maintaining the Forest and Burning out the Underbrush in XTAG
Christine Doran | Beth Hockey | Philip Hopely | Joseph Rosenzweig | Anoop Sarkar | B. Srinivas | Fei Xia
Computational Environments for Grammar Development and Linguistic Engineering

pdf bib
A Comparison of Head Transducers and Transfer for a Limited Domain Translation Application
Hiyan Alshawi | Adam L. Buchsbaum | Fei Xia
35th Annual Meeting of the Association for Computational Linguistics and 8th Conference of the European Chapter of the Association for Computational Linguistics

Search