Paul Ntim Yeboah

and 4 more

Cyber-attacks on industrial applications, specifically, phishing and web attacks are the most common data breach vectors and, as such, have attracted significant research attention. To mitigate these types of attacks, many countermeasures based on machine learning (ML) have been proposed. Although MLbased countermeasures are reported to yield satisfactory detection performance on phishing and web attacks, they often require massive amounts of manually labelled email and web request data to build these countermeasures. The manual generation of labels, however, can be laborious, error-prone, and may not be proactive in detecting novel phishing schemes and web attacks since attacks need to be identified and annotated prior to training. To cope with the evolution of web attacks and phishing emails, methods which exploit the vast volumes of unlabelled email and web request data should be adopted. This study therefore proposes self-supervised learning (SSL) based on computer vision (CV) and natural language processing (NLP) techniques to pre-train models on unlabelled email texts and HTTP web requests for the detection of phishing emails and web attacks. By leveraging NLP and CV SSL methods, we pre-train models to learn the structural and contextual representations of unlabelled email texts and HTTP requests. By ensembling the features extracted from the pre-trained models, we obtain robust representations of email text and HTTP request data for the effective detection of phishing emails and web attacks. The experiment results on an imbalanced dataset show that the combined self-supervised pre-trained models outperform other existing works in respect to accuracy, precision, recall and F1-score for phishing email and web attack detection.

Paul Yeboah

and 4 more