ISSN: 2182-2069 (printed) / ISSN: 2182-2077 (online)
Integrating Hybrid Neural Networks and Domain-Specific Embeddings for Detecting Hate Content in Code Mixed Social Media Comments
Hate speech is any communication intended to irritate, intimidate, disrupt, or incite anger in an individual or a group, typically targeting characteristics such as religion, ethnicity, appearance, or sexual orientation. Inhabitants of multilingual communities often engage in conversations using multiple regional languages. This sort of textual communication is known as code-mixed data since it combines many languages. This research shows how to recognize and detect hate speech in code-mixed Malayalam-English (Manglish) material. We created a dataset of Manglish-written social media comments from platforms like YouTube and Facebook. Before delving into word embeddings, we developed a unique stopword list designed specifically for Manglish, which has never been done previously. This bespoke stopword list significantly enhanced our data preparation operations. Following that, we concentrated on evaluating several word embedding techniques. We then utilized Glove to develop a distinct domain-specific word embedding model (DSG)for Malayalam-English code-mixed data. This concept was crucial in increasing the overall efficiency of our model. In addition to the approaches described above, we conducted a comprehensive set of experiments using several classifiers such as logistic regression, SVM, and XGBoost, as well as deep learning models such as Convolutional Neural Network (CNN) and bidirectional Long-Short-Term Memory (BiLSTM). Following thorough experimental testing, we suggested a unique hybrid deep-learning model with domain-specific word embeddings. This technique was quite effective in managing our dataset, with an astonishing 96.4% accuracy in detecting hate speech in Manglish comments.