Developer Guide for Intel® Data Analytics Acceleration Library 2018

Details

Given the set X = { x 1= (x 11,…,x 1p ), ..., x n = (x n1,…,x n p ) } of n p-dimensional feature vectors and a positive integer k, the problem is to find a set C = { c 1, ... , c k } of k p-dimensional vectors that minimize the objective function (overall error)



where d 2(x i ,C) is the distance from x i to the closest center in C , such as the Euclidean distance.

The vectors c 1,…,c k are called centroids. To start computations, the algorithm requires initial values of centroids.

Centroid Initialization

Centroids initialization can be done using these methods:

Computation

Computation of the goal function includes computation of the Euclidean distance between vectors ||x j - m i ||. The algorithm uses the following modification of the Euclidean distance between feature vectors a and b: d(a,b) = d 1(a,b) + d 2(a,b), where d 1 is computed for continuous features as

and d 2 is computed for binary categorical features as

In these equations, γ weighs the impact of binary categorical features on the clustering, p 1 is the number of continuous features, and p 2 is the number of binary categorical features. Note that the algorithm does not support non-binary categorical features.

The K-Means clustering algorithm computes centroids using Lloyd's method [Lloyd82]. For each feature vector x 1,…,x n , you can also compute the index of the cluster that contains the feature vector.