TeamDL at SemEval-2018 Task 8: Cybersecurity Text Analysis using Convolutional Neural Network and Conditional Random Fields

Manikandan R, Krishna Madgula, Snehanshu Saha


Abstract
In this work we present our participation to SemEval-2018 Task 8 subtasks 1 & 2 respectively. We developed Convolution Neural Network system for malware sentence classification (subtask 1) and Conditional Random Fields system for malware token label prediction (subtask 2). We experimented with couple of word embedding strategies, feature sets and achieved competitive performance across the two subtasks. For subtask 1 We experimented with two category of word embeddings namely native embeddings and task specific embedding using Word2vec and Glove algorithms. 1. Native Embeddings: All words including the unknown ones that are randomly initialized use embeddings from original Word2vec/Glove models. 2. Task specific : The embeddings are generated by training Word2vec/Glove algorithms on sentences from MalwareTextDB We found that glove outperforms rest of embeddings for subtask 1. For subtask 2, we used N-grams of size 6, previous, next tokens and labels, features giving disjunctions of words anywhere in the left or right, word shape features, word lemma of current, previous and next words, word-tag pair features, POS tags, prefix and suffixes.
Anthology ID:
S18-1140
Volume:
Proceedings of The 12th International Workshop on Semantic Evaluation
Month:
June
Year:
2018
Address:
New Orleans, Louisiana
Venue:
*SEMEVAL
SIGs:
SIGLEX | SIGSEM
Publisher:
Association for Computational Linguistics
Note:
Pages:
868–873
Language:
URL:
https://www.aclweb.org/anthology/S18-1140
DOI:
10.18653/v1/S18-1140
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
http://aclanthology.lst.uni-saarland.de/S18-1140.pdf