A new parallel data geometry analysis algorithm to select training data
for support vector machine
Abstract
Support vector machine (SVM) is one of the most powerful technologies of
machine learning, which has been widely concerned because of its
remarkable performance. However, when dealing with the classification
problem of large-scale datasets, the high complexity of SVM model leads
to low efficiency or become impractical. Due to the sparsity of SVM in
the sample space, this paper presents a new parallel data geometry
analysis(PDGA) algorithm to reduce the training set of SVM, which helps
to improve the efficiency of SVM training. The PDGA introduce
mahalanobis distance to measure the distance from each sample to its
centroid, and based on this, define hyperellipsoid spatial density to
help remove dense redundant data. When further reducing the training
set, cosine angle distance analysis method is proposed to determine
whether the samples are redundant data, to ensure that the valuable data
are not removed. Different from the previous data geometry analysis
methods, the PDGA algorithm is implemented in parallel, which leading to
substantial saving in the computational cost. Experimental results on
artificial dataset and 6 real datasets show that the algorithm can adapt
to different sample distributions, significantly reduce the training
time and memory requirements without sacrificing the classification
accuracy, and its performance is significantly better than the other 4
competitive algorithms.