Essential Site Maintenance: Authorea-powered sites will be updated circa 15:00-17:00 Eastern on Tuesday 5 November.
There should be no interruption to normal services, but please contact us at help@authorea.com in case you face any issues.

loading page

A new parallel data geometry analysis algorithm to select training data for support vector machine
  • Yunfeng Shi,
  • Shu Lv,
  • Kaibo Shi
Yunfeng Shi
University of Electronic Science and Technology of China
Author Profile
Shu Lv
University of Electronic Science and Technology of China

Corresponding Author:lvshu@uestc.edu.cn

Author Profile
Kaibo Shi
Chengdu University
Author Profile

Abstract

Support vector machine (SVM) is one of the most powerful technologies of machine learning, which has been widely concerned because of its remarkable performance. However, when dealing with the classification problem of large-scale datasets, the high complexity of SVM model leads to low efficiency or become impractical. Due to the sparsity of SVM in the sample space, this paper presents a new parallel data geometry analysis(PDGA) algorithm to reduce the training set of SVM, which helps to improve the efficiency of SVM training. The PDGA introduce mahalanobis distance to measure the distance from each sample to its centroid, and based on this, define hyperellipsoid spatial density to help remove dense redundant data. When further reducing the training set, cosine angle distance analysis method is proposed to determine whether the samples are redundant data, to ensure that the valuable data are not removed. Different from the previous data geometry analysis methods, the PDGA algorithm is implemented in parallel, which leading to substantial saving in the computational cost. Experimental results on artificial dataset and 6 real datasets show that the algorithm can adapt to different sample distributions, significantly reduce the training time and memory requirements without sacrificing the classification accuracy, and its performance is significantly better than the other 4 competitive algorithms.