Developer Guide for Intel® Data Analytics Acceleration Library 2019 Update 1
The library provides kNN classification based on multidimensional binary search tree (K-D tree, where D means the dimension and K means the number of dimensions in the feature space). For more details, see [James2013, Patwary2016].
Given n feature vectors x1= (x11,…,x1p), …, xn= (xn1,…,xnp) of size p and a vector of class labels y = (y1,…,yn), where yi ∈ {0, 1, ..., C - 1} and C is the number of classes, describes the class to which the feature vector xi belongs, the problem is to build a kNN classifier.
Given a positive integer parameter k and a test observation x0, the kNN classifier does the following:
Intel DAAL version of the kNN algorithm uses the PANDA algorithm [Patwary2016] that uses a space-partitioning data structure known as K-D tree. Each non-leaf node of a tree contains the identifier of a feature along which to split the feature space and an appropriate feature value (a cut-point) that defines the splitting hyperplane to partition the feature space into two parts. Each leaf node of the tree has an associated subset (a bucket) of elements of the training data set. Feature vectors from any bucket belong to the region of the space defined by tree nodes on the path from the root node to the respective leaf.
For each non-leaf node, the process of building a K-D tree involves the choice of the feature (that is, dimension in the feature space) and the value for this feature (a cut-point) to split the feature space. This procedure starts with the entire feature space for the root node of the tree, and for every next level of the tree deals with ever smaller part of the feature space.
The PANDA algorithm constructs the K-D tree by choosing the dimension with the maximum variance for splitting [Patwary2016]. Therefore, for each new non-leaf node of the tree, the algorithm computes the variance of values that belong to the respective region of the space for each of the features and chooses the feature with the largest variance. Due to high computational cost of this operation, PANDA uses a subset of feature values to compute the variance.
PANDA uses a sampling heuristic to estimate the data distribution for the chosen feature and chooses the median estimate as the cut-point.
PANDA generates new K-D tree levels until the number of feature vectors in a leaf node gets less or equal to a predefined threshold. Once the threshold is reached, PANDA stops growing the tree and associates the feature vectors with the bucket of the respective leaf node.