Charibeth Cheng


2020

pdf bib
Localization of Fake News Detection via Multitask Transfer Learning
Jan Christian Blaise Cruz | Julianne Agatha Tan | Charibeth Cheng
Proceedings of the 12th Language Resources and Evaluation Conference

The use of the internet as a fast medium of spreading fake news reinforces the need for computational tools that combat it. Techniques that train fake news classifiers exist, but they all assume an abundance of resources including large labeled datasets and expert-curated corpora, which low-resource languages may not have. In this work, we make two main contributions: First, we alleviate resource scarcity by constructing the first expertly-curated benchmark dataset for fake news detection in Filipino, which we call “Fake News Filipino.” Second, we benchmark Transfer Learning (TL) techniques and show that they can be used to train robust fake news classifiers from little data, achieving 91% accuracy on our fake news dataset, reducing the error by 14% compared to established few-shot baselines. Furthermore, lifting ideas from multitask learning, we show that augmenting transformer-based transfer techniques with auxiliary language modeling losses improves their performance by adapting to writing style. Using this, we improve TL performance by 4-6%, achieving an accuracy of 96% on our best model. Lastly, we show that our method generalizes well to different types of news articles, including political news, entertainment news, and opinion articles.

2019

pdf bib
Annotation Process for the Dialog Act Classification of a Taglish E-commerce Q&A Corpus
Jared Rivera | Jan Caleb Oliver Pensica | Jolene Valenzuela | Alfonso Secuya | Charibeth Cheng
Proceedings of the Second Workshop on Economics and Natural Language Processing

With conversational agents or chatbots making up in quantity of replies rather than quality, the need to identify user intent has become a main concern to improve these agents. Dialog act (DA) classification tackles this concern, and while existing studies have already addressed DA classification in general contexts, no training corpora in the context of e-commerce is available to the public. This research addressed the said insufficiency by building a text-based corpus of 7,265 posts from the question and answer section of products on Lazada Philippines. The SWBD-DAMSL tagset for DA classification was modified to 28 tags fitting the categories applicable to e-commerce conversations. The posts were annotated manually by three (3) human annotators and preprocessing techniques decreased the vocabulary size from 6,340 to 1,134. After analysis, the corpus was composed dominantly of single-label posts, with 34% of the corpus having multiple intent tags. The annotated corpus allowed insights toward the structure of posts created with single to multiple intents.

2018

pdf bib
Modeling Personality Traits of Filipino Twitter Users
Edward Tighe | Charibeth Cheng
Proceedings of the Second Workshop on Computational Modeling of People’s Opinions, Personality, and Emotions in Social Media

Recent studies in the field of text-based personality recognition experiment with different languages, feature extraction techniques, and machine learning algorithms to create better and more accurate models; however, little focus is placed on exploring the language use of a group of individuals defined by nationality. Individuals of the same nationality share certain practices and communicate certain ideas that can become embedded into their natural language. Many nationals are also not limited to speaking just one language, such as how Filipinos speak Filipino and English, the two national languages of the Philippines. The addition of several regional/indigenous languages, along with the commonness of code-switching, allow for a Filipino to have a rich vocabulary. This presents an opportunity to create a text-based personality model based on how Filipinos speak, regardless of the language they use. To do so, data was collected from 250 Filipino Twitter users. Different combinations of data processing techniques were experimented upon to create personality models for each of the Big Five. The results for both regression and classification show that Conscientiousness is consistently the easiest trait to model, followed by Extraversion. Classification models for Agreeableness and Neuroticism had subpar performances, but performed better than those of Openness. An analysis on personality trait score representation showed that classifying extreme outliers generally produce better results for all traits except for Neuroticism and Openness.

2009

pdf bib
Philippine Language Resources: Trends and Directions
Rachel Edita Roxas | Charibeth Cheng | Nathalie Rose Lim
Proceedings of the 7th Workshop on Asian Language Resources (ALR7)

2008

pdf bib
Natural Language Database Interface for the Community Based Monitoring System
Krissanne Kaye Garcia | Ma. Angelica Lumain | Jose Antonio Wong | Jhovee Gerard Yap | Charibeth Cheng
Proceedings of the 22nd Pacific Asia Conference on Language, Information and Computation