Developer Guide for Intel® Data Analytics Acceleration Library 2018

Details

Given n feature vectors X = { x 1= (x 11, … , x 1p ), ... , x n = (x n1, … , x np ) } of n p-dimensional feature vectors and n responses Y = {y 1, … , y n }, the problem is to build a decision forest classification or regression model.

Training Stage

Library uses the following algorithmic framework for the training stage. Let S = (X, Y) be the set of observations. Given a positive integer parameters, such as the number of trees B, the bootstrap parameter N = f*n, where f is a fraction of observations used for a training of one tree, and the number of features per node m, the algorithm does the following for b = 1, ... , B:

Decision tree T is trained using the training set D of size N. Each node t in the tree corresponds to the subset D t of the training set D, with the root node being D itself. Its internal nodes t represent a binary test (split) dividing their subset X t in two subsets X t L and X t R , corresponding to their children t L and t R .

The metric for measuring the best split is called impurity, i(t). It generally reflects the homogeneity of responses within the subset D t in the node t.

For the detailed definition of i(t) metrics, see the description of a specific algorithm.

Let the impurity decrease in the node t be



The library supports the following termination criteria of decision forest training:

Decision tree follows the recursive procedure below applied in each terminal node t:

Prediction Stage

Given decision forest classifier and vectors x1, ... , x r , the problem is to calculate the responses for those vectors. To solve the problem for each given query vector x i , the algorithm finds the leaf node in a tree in the forest that gives the response by that tree. The response of the forest is based on an aggregation of responses from all trees in the forest. For the detailed definition, see the description of a specific algorithm.

Additional Characteristics Calculated by the Decision Forest

Decision forests can produce additional characteristics, such as an estimate of generalization error and an importance measure (relative decisive power) of each of p features (variables).

Out-of-bag Error

The estimate of the generalization error based on the training data can be obtained and calculated as follows:

For the detailed definition, see the description of a specific algorithm.

Variable Importance

There are two main types of variable importance measures: