Developer Guide for Intel® Data Analytics Acceleration Library 2018
Given the set X = { x 1= (x 11,…,x 1p ), ..., x n = (x n1,…,x n p ) } of n p-dimensional feature vectors and a positive integer k, the problem is to find a set C = { c 1, ... , c k } of k p-dimensional vectors that minimize the objective function (overall error)
The vectors c 1,…,c k are called centroids. To start computations, the algorithm requires initial values of centroids.
Centroids initialization can be done using these methods:
Choice of first k feature vectors from the data set X
Random choice of k feature vectors from the data set using the following simple random sampling draw-by-draw algorithm. The algorithm does the following:
Chooses one of the feature vectors x i from X with equal probability
Excludes x i from X and adds it to the current set of centers
Resumes from step 1 until the set of centers reaches the desired size k
K-Means++ algorithm [Arthur2007], which selects centers with the probability proportional to their contribution to the overall error Φ X (C) according to the following scheme:
Chooses one of the feature vectors x i from X with equal probability
Excludes x i from X and adds it to the current set of centers C
For each feature vector x i in X calculates its minimal distance d(x i , C) from the current set of centers C
Chooses one of the feature vectors
x
i
from
X with the probability
Resumes from step 2 until the set of centers C reaches the desired size k.
Parallel K-Means++ algorithm [Bahmani2012] that does the following:
Chooses one of the feature vectors x i from X with equal probability
Excludes x i from X and adds it to the current set of centers C
Repeats nRounds times:
For each feature vector x i from X calculates its minimal distance d(x i , C) from the current set of centers C
Chooses
L=oversamplingFactor*k
feature vectors
x
i
from
X with the probability
Excludes x i vectors chosen in the previous step from X and adds them to the current set of centers C
For c i ∈ C sets w i to the ratings, the number of points in X closer to c i than to any other point in C
Applies K-Means++ algorithm with weights w i to the points in C, which means that the following probability is used in step 4:
The algorithm parameters define the number of candidates L selected in each round and number of rounds:
Choose oversamplingFactor to make L=O(k).
Choose nRounds as O(logΦ x (C)), where Φ x (C) is the estimation of the goal function when the first center is chosen. [Bahmani2012] recommends to set nRounds to a constant value not greater than 8.
Computation of the goal function includes computation of the Euclidean distance between vectors ||x j - m i ||. The algorithm uses the following modification of the Euclidean distance between feature vectors a and b: d(a,b) = d 1(a,b) + d 2(a,b), where d 1 is computed for continuous features as
and d 2 is computed for binary categorical features as
In these equations, γ weighs the impact of binary categorical features on the clustering, p 1 is the number of continuous features, and p 2 is the number of binary categorical features. Note that the algorithm does not support non-binary categorical features.
The K-Means clustering algorithm computes centroids using Lloyd's method [Lloyd82]. For each feature vector x 1,…,x n , you can also compute the index of the cluster that contains the feature vector.