Developer Guide for Intel® Data Analytics Acceleration Library 2019 Update 5
Given n feature vectors X = { x 1= (x 11, … , x 1p ), ... , x n = (x n1, … , x np ) } of n p-dimensional feature vectors and n responses Y = {y 1, … , y n }, the problem is to build a decision forest classification or regression model.
Library uses the following algorithmic framework for the training stage. Let S = (X, Y) be the set of observations. Given a positive integer parameters, such as the number of trees B, the bootstrap parameter N = f*n, where f is a fraction of observations used for a training of one tree, and the number of features per node m, the algorithm does the following for b = 1, ... , B:
Decision tree T is trained using the training set D of size N. Each node t in the tree corresponds to the subset D t of the training set D, with the root node being D itself. Its internal nodes t represent a binary test (split) dividing their subset X t in two subsets X t L and X t R , corresponding to their children t L and t R .
The metric for measuring the best split is called impurity, i(t). It generally reflects the homogeneity of responses within the subset D t in the node t.
For the detailed definition of i(t) metrics, see the description of a specific algorithm.
Let the impurity decrease in the node t be
The library supports the following termination criteria of decision forest training:
Decision tree follows the recursive procedure below applied in each terminal node t:
To create a bootstrap set and choose feature indices in the performant way, the training algorithm requires the source of random numbers, capable to produce sequences of random numbers in parallel. See Algorithms > Analysis > Engines for the description of the respective techniques.
Initialization of the engine in the decision forest is based on the scheme below:
The state of the engine is updated once the training of the decision forest model is completed. The library provides support to retrieve the instance of the engine with updated state that can be used in other computations. The update of the state is engine-specific and depends on the parallelization technique used as defined earlier:
Family: the updated state is the set of states that represent individual engines in the family.
Leapfrog: the updated state is the state of the sequence with the rightmost position on the sequence. The example below demonstrates the idea for case of 2 subsequences (‘x’ and ‘o’) of the random number sequence:
SkipAhead: the updated state is the state of the independent sequence with the rightmost position on the sequence. The example below demonstrates the idea for case of 2 subsequences (‘x’ and ‘o’) of the random number sequence:
Given decision forest classifier and vectors x1, ... , x r , the problem is to calculate the responses for those vectors. To solve the problem for each given query vector x i , the algorithm finds the leaf node in a tree in the forest that gives the response by that tree. The response of the forest is based on an aggregation of responses from all trees in the forest. For the detailed definition, see the description of a specific algorithm.
Decision forests can produce additional characteristics, such as an estimate of generalization error and an importance measure (relative decisive power) of each of p features (variables).
The estimate of the generalization error based on the training data can be obtained and calculated as follows:
For the detailed definition, see the description of a specific algorithm.
There are two main types of variable importance measures:
Mean Decrease Impurity importance (MDI).
Importance of the j-th variable for predicting Y is the sum of weighted impurity decreases p(t)∆i(s t ,t) for all nodes t that use x j , averaged over all B trees in the forest:
where
is the fraction of observations reaching node
t in the tree
T
b
, and
v(s
t
) is the index of the variable used in split
s
t
.
Mean Decrease Accuracy (MDA).
Importance of the j-th variable for predicting Y is the average increase in the OOB error over all trees in the forest when the values of the j-th variable are randomly permuted in the OOB set. For that reason, this latter measure is also known as permutation importance.
In more details, the library calculates MDA importance as follows:
Let
E
b,j
be the OOB error calculated for
T
b
using
, and its out-of-bag dataset
is permuted on the
j-th variable. Then