Developer Guide for Intel® Data Analytics Acceleration Library 2019 Update 2

Details

Given the set X = { x 1= (x 11,…,x 1p ), ..., x n = (x n1,…,x n p ) } of n p-dimensional feature vectors and a positive integer k, the problem is to find a set C = { c 1, ... , c k } of k p-dimensional vectors that minimize the objective function (overall error)



where d 2(x i ,C) is the distance from x i to the closest center in C , such as the Euclidean distance.

The vectors c 1,…,c k are called centroids. To start computations, the algorithm requires initial values of centroids.

Centroid Initialization

Centroids initialization can be done using these methods:

Computation

Computation of the goal function includes computation of the Euclidean distance between vectors ||x j - m i ||. The algorithm uses the following modification of the Euclidean distance between feature vectors a and b: d(a,b) = d 1(a,b) + d 2(a,b), where d 1 is computed for continuous features as

and d 2 is computed for binary categorical features as

In these equations, γ weighs the impact of binary categorical features on the clustering, p 1 is the number of continuous features, and p 2 is the number of binary categorical features. Note that the algorithm does not support non-binary categorical features.

The K-Means clustering algorithm computes centroids using Lloyd's method [Lloyd82]. For each feature vector x 1,…,x n , you can also compute the index of the cluster that contains the feature vector.

In some cases, if no vectors are assigned to some clusters on a particular iteration, the iteration produces an empty cluster. It may occur due to bad initialization of centroids or the dataset structure. In this case, the algorithm uses the following strategy to replace the empty cluster centers and decrease the value of the overall goal function: