You are currently browsing the monthly archive for April 2017.

During my third year of undergraduate study I took a course taught by David Gabai in which tests were negatively graded. This meant that points were subtracted for every incorrect statement made in a proof. As proofs could, in principle, contain an unbounded number of false statements, there was essentially no lower bound on the grade one could receive on a test (course grades was bounded below by “F”). Although painful on occasion, I grew to love the class and the transcendent lessons that it taught. These memories came flooding back this past week, when a colleague brought to my attention the paper A simple and fast algorithm for K-medoids clustering by Hae-Sang Park and Chi-Hyuck Jun.

The Park-Jun paper introduces a K-means like method for K-medoid clustering. K-medoid clustering is a clustering formulation based on a generalization of (a relative of) the median rather than the mean, and is useful for the same robustness reasons that make the median preferable to the mean. The medoid is defined as follows: given n points $x_1,\ldots,x_n$ in $\mathbb{R}^d$, the medoid is a point $x$ among them with the property that it minimizes the average distance to the other points, i.e. $x \in \{x_1,\ldots,x_n\}$ minimizes $\sum_{i=1}^n ||x-x_i||$. In the case of $d=1$, when n is odd, this corresponds to the median (the direct generalization of the median is to find a point $x$ not necessarily among the $x_i$ minimizing the average distance to the other points, and this is called the geometric median).

The K-medoid problem is to partition the points into k disjoint subsets $S_1,\ldots,S_n$ so that if $m_1,\ldots,m_k$ are the respective medoids of the subsets (clusters) then the average distance of each medoid to the points of the cluster it belongs to  is minimized. In other words, the K-medoids problem is to find

$argmin_{S=\{S_1,\ldots,S_k\}} \sum_{j=1}^k \sum_{i \in S_k} ||m_j-x_i||$.

For comparison, K-means clustering is the problem of finding

$argmin_{S=\{S_1,\ldots,S_k\}} \sum_{j=1}^k \sum_{i \in S_k} ||\mu_j - x_i||^2$,

where the $\mu_j$ are the centroids of the points in each partition $S_i$. The K-means and K-medoids problems are instances of partitional clustering formulations that are NP-hard.

The most widely used approach for finding a (hopefully good) solution to the K-medoids problem has been a greedy clustering algorithm called PAM by Kaufman and Rousseeuw (partitioning around medoids). To understand the Park-Jun contribution it is helpful to briefly review PAM first.  The method works as follows:

1. Initialize by selecting k distinguished points from the set of points.
2. Identify, for each point, the distinguished point it is closest to. This identification partitions the points into sets and a “cost” can be associated to the partition, namely the sum of the distances from each point to its associated distinguished point (note that the distinguished points may not yet be the medoids of their respective partitions).
3. For each distinguished point d, and each non-distingsuished point repeat the assignment and cost calculation of step  2 with and x swapped so that x becomes distinguished and returns to undistinguished status. Choose the swap that minimizes the total cost. If no swap reduces the cost then output the partition and the distinguished points.
4. Repeat step 3 until termination.

It’s easy to see that when the PAM algorithm terminates the distinguished points must be medoids for the partitions associated to them.

The running time of PAM is $O(k(n-k)^2)$. This is because in step 3, for each of the k distinguished points, there are n-k swaps to consider for a total of $k(n-k)$ swaps, and the cost computation for each swap requires n-k assignments and additions. Thus, PAM is quadratic in the number of points.

To speed-up the PAM method, Park and Jun introduced in their paper an analogy of the K-means algorithm, with mean replaced by median:

1. Initialize by selecting k distinguished points from the set of points.
2. Identify, for each point, the distinguished point it is closest to. This identification partitions the points into sets in the same way as the PAM algorithm.
3. Compute the medoid for each partition. Repeat step 2 until the cost no longer decreases.

Park-Jun claim that their method has run time complexity “$O(nk)$ which is equivalent to K-means clustering”. Yet the pseudocode in the paper begins with the clearly quadratic requirement “calculate the distance between every pair of all objects..”

Step 3 of the algorithm is also quadratic. The medoid of a set of m points is computed in time $O(m^2)$. Given m points the medoid must be one of them, so it suffices to compute, for each point, the sum of the distances to the others ($m \times m$ additions) and a medoid is then identified by taking the minimum. Furthermore, without assumptions on the structure of the distance matrix there cannot be a faster algorithm (with the triangle inequality the medoid can be computed in $O(n^{\frac{3}{2}}$).

Quadratic functions are not linear. If they were, it would mean that, with $a,b \neq 0$$ax^2=bx \mbox{ for all } x$. If that were the case then

$ax^2=bx$

$\Rightarrow ax^2-bx=0$ for all x.

Assuming that a is positive and plugging in $x=\frac{b-\sqrt{b^2+4a}}{2a}$ one would obtain that $ax^2-bx=1$ and it would follow that

$0=1$.

When reading the paper, it was upon noticing this “result” that negative grading came to mind. With a proof that 0=1 we are at, say, a score of -10 for the paper. Turning to the discussion of Fig. 3 of the paper we read that “…the proposed method takes about constant time near zero regardless of the number of objects.”

I suppose a generous reading of this statement is that it is an allusion to Taylor expansion of the polynomial $f(x)=x^2$ around $x=0$. A less generous interpretation is that the figure is deliberately misleading, intended to show linear growth with slope close to zero for what is a quadratic function. I decided to be generous and grade this a -5, leading to a running total of -15.

It seems the authors may have truly believed that their algorithm was linear because in the abstract they begin with “This paper proposes a new algorithm for K-medoids clustering which runs like the K-means algorithm”. It seems possible that the authors thought they had bypassed what they viewed as the underlying cause for quadratic running time of the PAM algorithm, namely the quadratic swap. The K-means algorithm (Lloyd’s algorithm) is indeed linear, because the computation of the mean of a set of points is linear in the number of points (the values for each coordinate can be averaged separately). However the computation of a medoid is not the same as computation of the mean. -5, now a running total of20. The actual running time of the Park-Jun algorithm is shown below:

Replicates on different instances are crucial, because absent from running time complexity calculations are the stopping times which are data dependent. A replicate of Figure 3 is shown below (using the parameters in Table 5 and implementations in MATLAB). The Park-Jun method is called as “small” in MATLAB.

Interestingly, the MATLAB implementation of PAM has been sped up considerably. Also, the “constant time near zero” behavior described by Park and Jun is clearly no such thing. For this lack of replication of Figure 3 another deduction of 5 points for a total of -25.

There is not much more to the Park and Jun paper, but unfortunately there is some. Table 6 shows a comparison of K-means, PAM and the Park-Jun method with true clusters based on a simulation:

The Rand index is a measure of concordance between clusters, and the higher the Rand index the greater the concordance. It seems that the Park-Jun method is slightly better than PAM, which are both superior to K-means in the experiment. However the performance of K-means is a lot better in this experiment than suggested by Figure 2 which was clearly (rotten) cherry picked (Fig2b):

For this I deduct yet another 5 points for a total of -30.

Amazingly, the Park and Jun paper has been cited 479 times. Many of the citations propagate the false claims in the paper, e.g. in referring to Park and Jun, Zadegan et al. write “in each iteration the new set of medoids is selected with running time O(N), where N is the number of objects in the dataset”.  The question is, how many negative points is the repetition of false claims from a paper that is filled with false claims worth?