Developer Guide for Intel® Data Analytics Acceleration Library 2019 Update 1

Details

Given n feature vectors X = { x 1= (x 11,…,x 1p ), ..., x n = (x n1,…,x np ) } of n p-dimensional feature vectors, a vector of class labels y = (y 1, … ,y n ), where y i {0, 1, ... , C - 1} describes the class to which the feature vector x i belongs and C is the number of classes, the problem is to build a decision forest classifier.

Training Stage

Decision forest classifier follows the algorithmic framework of decision forest training with Gini impurity metrics as impurity metrics, that are calculated as follows:



where is the fraction of observations in the subset D that belong to the i-th class.

Prediction Stage

Given decision forest classifier and vectors x 1, ... , x r , the problem is to calculate the labels for those vectors. To solve the problem for each given query vector x i , the algorithm finds the leaf node in a tree in the forest that gives the classification response by that tree. The forest chooses the label y taking the majority of trees in the forest voting for that label.

Out-of-bag Error

Decision forest classifier follows the algorithmic framework for calculating the decision forest out-of-bag (OOB) error, where aggregation of the out-of-bag predictions in all trees and calculation of the OOB error of the decision forest is done as follows:

Variable Importance

The library computes Mean Decrease Impurity (MDI) importance measure, also known as the Gini importance or Mean Decrease Gini, by using the Gini index as impurity metrics.