Ankit Srivastava

Also published as: Ankit Kumar Srivastava, Ankit Kumar, Ankit K. Srivastava


2020

pdf bib
Noisy Text Data: Achilles’ Heel of BERT
Ankit Kumar | Piyush Makhija | Anuj Gupta
Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020)

Owing to the phenomenal success of BERT on various NLP tasks and benchmark datasets, industry practitioners are actively experimenting with fine-tuning BERT to build NLP applications for solving industry use cases. For most datasets that are used by practitioners to build industrial NLP applications, it is hard to guarantee absence of any noise in the data. While BERT has performed exceedingly well for transferring the learnings from one use case to another, it remains unclear how BERT performs when fine-tuned on noisy text. In this work, we explore the sensitivity of BERT to noise in the data. We work with most commonly occurring noise (spelling mistakes, typos) and show that this results in significant degradation in the performance of BERT. We present experimental results to show that BERT’s performance on fundamental NLP tasks like sentiment analysis and textual similarity drops significantly in the presence of (simulated) noise on benchmark datasets viz. IMDB Movie Review, STS-B, SST-2. Further, we identify shortcomings in the existing BERT pipeline that are responsible for this drop in performance. Our findings suggest that practitioners need to be vary of presence of noise in their datasets while fine-tuning BERT to solve industry use cases.

pdf bib
hinglishNorm - A Corpus of Hindi-English Code Mixed Sentences for Text Normalization
Piyush Makhija | Ankit Kumar | Anuj Gupta
Proceedings of the 28th International Conference on Computational Linguistics: Industry Track

We present hinglishNorm - a human annotated corpus of Hindi-English code-mixed sentences for text normalization task. Each sentence in the corpus is aligned to its corresponding human annotated normalized form. To the best of our knowledge, there is no corpus of Hindi-English code-mixed sentences for text normalization task that is publicly available. Our work is the first attempt in this direction. The corpus contains 13494 segments annotated for text normalization. Further, we present baseline normalization results on this corpus. We obtain a Word Error Rate (WER) of 15.55, BiLingual Evaluation Understudy (BLEU) score of 71.2, and Metric for Evaluation of Translation with Explicit ORdering (METEOR) score of 0.50.

2017

pdf bib
DFKI-DKT at SemEval-2017 Task 8: Rumour Detection and Classification using Cascading Heuristics
Ankit Srivastava | Georg Rehm | Julian Moreno Schneider
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

We describe our submissions for SemEval-2017 Task 8, Determining Rumour Veracity and Support for Rumours. The Digital Curation Technologies (DKT) team at the German Research Center for Artificial Intelligence (DFKI) participated in two subtasks: Subtask A (determining the stance of a message) and Subtask B (determining veracity of a message, closed variant). In both cases, our implementation consisted of a Multivariate Logistic Regression (Maximum Entropy) classifier coupled with hand-written patterns and rules (heuristics) applied in a post-process cascading fashion. We provide a detailed analysis of the system performance and report on variants of our systems that were not part of the official submission.

pdf bib
Event Detection and Semantic Storytelling: Generating a Travelogue from a large Collection of Personal Letters
Georg Rehm | Julian Moreno Schneider | Peter Bourgonje | Ankit Srivastava | Jan Nehring | Armin Berger | Luca König | Sören Räuchle | Jens Gerth
Proceedings of the Events and Stories in the News Workshop

We present an approach at identifying a specific class of events, movement action events (MAEs), in a data set that consists of ca. 2,800 personal letters exchanged by the German architect Erich Mendelsohn and his wife, Luise. A backend system uses these and other semantic analysis results as input for an authoring environment that digital curators can use to produce new pieces of digital content. In our example case, the human expert will receive recommendations from the system with the goal of putting together a travelogue, i.e., a description of the trips and journeys undertaken by the couple. We describe the components and architecture and also apply the system to news data.

pdf bib
Semantic Storytelling, Cross-lingual Event Detection and other Semantic Services for a Newsroom Content Curation Dashboard
Julian Moreno-Schneider | Ankit Srivastava | Peter Bourgonje | David Wabnitz | Georg Rehm
Proceedings of the 2017 EMNLP Workshop: Natural Language Processing meets Journalism

We present a prototypical content curation dashboard, to be used in the newsroom, and several of its underlying semantic content analysis components (such as named entity recognition, entity linking, summarisation and temporal expression analysis). The idea is to enable journalists (a) to process incoming content (agency reports, twitter feeds, reports, blog posts, social media etc.) and (b) to create new articles more easily and more efficiently. The prototype system also allows the automatic annotation of events in incoming content for the purpose of supporting journalists in identifying important, relevant or meaningful events and also to adapt the content currently in production accordingly in a semi-automatic way. One of our long-term goals is to support journalists building up entire storylines with automatic means. In the present prototype they are generated in a backend service using clustering methods that operate on the extracted events.

pdf bib
Multi-source Neural Automatic Post-Editing: FBK’s participation in the WMT 2017 APE shared task
Rajen Chatterjee | M. Amin Farajian | Matteo Negri | Marco Turchi | Ankit Srivastava | Santanu Pal
Proceedings of the Second Conference on Machine Translation

2016

pdf bib
DFKI’s system for WMT16 IT-domain task, including analysis of systematic errors
Eleftherios Avramidis | Aljoscha Burchardt | Vivien Macketanz | Ankit Srivastava
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

pdf bib
A Recurrent Neural Network Architecture for De-identifying Clinical Records
Shweta | Ankit Kumar | Asif Ekbal | Sriparna Saha | Pushpak Bhattacharyya
Proceedings of the 13th International Conference on Natural Language Processing

2011

pdf bib
Preliminary Experiments on Using Users’ Post-Editions to Enhance a SMT System Oracle-based Training for Phrase-based Statistical Machine Translation
Ankit Srivastava | Yanjun Ma | Andy Way
Proceedings of the 15th Annual conference of the European Association for Machine Translation

2010

pdf bib
MATREX: The DCU MT System for WMT 2010
Sergio Penkale | Rejwanul Haque | Sandipan Dandapat | Pratyush Banerjee | Ankit K. Srivastava | Jinhua Du | Pavel Pecina | Sudip Kumar Naskar | Mikel L. Forcada | Andy Way
Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR

2009

pdf bib
English-Hindi Transliteration Using Context-Informed PB-SMT: the DCU System for NEWS 2009
Rejwanul Haque | Sandipan Dandapat | Ankit Kumar Srivastava | Sudip Kumar Naskar | Andy Way
Proceedings of the 2009 Named Entities Workshop: Shared Task on Transliteration (NEWS 2009)