Developer Guide for Intel® Data Analytics Acceleration Library 2019 Update 5

Details

Given n feature vectors X = { x 1= (x 11, … , x 1p ), ... , x n = (x n1, … , x np ) } of n p-dimensional feature vectors and n responses Y = {y 1, … , y n }, the problem is to build a decision forest classification or regression model.

Training Stage

Library uses the following algorithmic framework for the training stage. Let S = (X, Y) be the set of observations. Given a positive integer parameters, such as the number of trees B, the bootstrap parameter N = f*n, where f is a fraction of observations used for a training of one tree, and the number of features per node m, the algorithm does the following for b = 1, ... , B:

Decision tree T is trained using the training set D of size N. Each node t in the tree corresponds to the subset D t of the training set D, with the root node being D itself. Its internal nodes t represent a binary test (split) dividing their subset X t in two subsets X t L and X t R , corresponding to their children t L and t R .

The metric for measuring the best split is called impurity, i(t). It generally reflects the homogeneity of responses within the subset D t in the node t.

For the detailed definition of i(t) metrics, see the description of a specific algorithm.

Let the impurity decrease in the node t be



The library supports the following termination criteria of decision forest training:

Decision tree follows the recursive procedure below applied in each terminal node t:

To create a bootstrap set and choose feature indices in the performant way, the training algorithm requires the source of random numbers, capable to produce sequences of random numbers in parallel. See Algorithms > Analysis > Engines for the description of the respective techniques.

Initialization of the engine in the decision forest is based on the scheme below:



The state of the engine is updated once the training of the decision forest model is completed. The library provides support to retrieve the instance of the engine with updated state that can be used in other computations. The update of the state is engine-specific and depends on the parallelization technique used as defined earlier:

Prediction Stage

Given decision forest classifier and vectors x1, ... , x r , the problem is to calculate the responses for those vectors. To solve the problem for each given query vector x i , the algorithm finds the leaf node in a tree in the forest that gives the response by that tree. The response of the forest is based on an aggregation of responses from all trees in the forest. For the detailed definition, see the description of a specific algorithm.

Additional Characteristics Calculated by the Decision Forest

Decision forests can produce additional characteristics, such as an estimate of generalization error and an importance measure (relative decisive power) of each of p features (variables).

Out-of-bag Error

The estimate of the generalization error based on the training data can be obtained and calculated as follows:

For the detailed definition, see the description of a specific algorithm.

Variable Importance

There are two main types of variable importance measures: