loading page

Lexicon - pointed hybrid N-gram Features Extraction Model (LeNFEM) for Sentence Level Sentiment Analysis.
  • James Mutinda,
  • Ronald Mwangi,
  • George Okeyo
James Mutinda

Corresponding Author:jsmtinda@yahoo.com

Author Profile
Ronald Mwangi
Author Profile
George Okeyo
Author Profile

Abstract

Sentiment analysis of social media posts and texts can provide information and knowledge that is applicable in social settings, business intelligence, evaluation of citizens’ opinions in governance and mood triggered devices in Internet of Things. Feature extraction and selection is a key determinant of accuracy and computational cost of machine learning models for such analysis. Most feature extraction and selection techniques utilize bag of words such as N-grams and frequency-based algorithms especially Term Frequency-Inverse document frequency (TF-IDF). However, these approaches suffer shortcomings such as; they do not consider relationships between words, they ignore words’ characteristics and they suffer high feature dimensionality. In this paper we propose and evaluate an approach that utilizes a fixed hybrid N-gram window for feature extraction and Minimum Redundancy Maximum Relevance feature selection for sentence level sentiment analysis. The approach improves the existing feature extraction techniques specifically the N-gram by generating a tri-gram vector from words, Part of speech tags and word semantic orientation. The N-gram vector is extracted by employing a static 3-gram window identified by a lexicon where a sentiment word appears in a sentence. A blend of the words, POS tags and the sentiment orientations of the 3N-gram are used to build the feature vector. The optimal features from the vector are then selected using Minimum Redundancy Maximum Relevance (MR2) algorithm. Experiments were carried out with a publicly available yelp tweets dataset to evaluate the performance of four supervised machine learning classifiers (Naïve Bayes, K-Nearest Neighbor, Decision Tree and Support Vector Machines) when augmented with the proposed model. The results showed that the proposed model had the highest accuracy (86.85%), recall (86.85%) and precision (86.96%).
18 Sep 2020Submitted to Engineering Reports
18 Sep 2020Submission Checks Completed
18 Sep 2020Assigned to Editor
22 Sep 2020Reviewer(s) Assigned
27 Oct 2020Editorial Decision: Reject & Resub
29 Nov 20201st Revision Received
30 Nov 2020Submission Checks Completed
30 Nov 2020Assigned to Editor
30 Nov 2020Reviewer(s) Assigned
04 Jan 2021Editorial Decision: Revise Minor
15 Jan 20212nd Revision Received
18 Jan 2021Submission Checks Completed
18 Jan 2021Assigned to Editor
20 Jan 2021Editorial Decision: Accept
07 Feb 2021Published in Engineering Reports. 10.1002/eng2.12374