Is it still true if we drop that assumption, and minimize over all low dimensional embeddings of the points? (I think it is not,

and this corresponds to the idea that “classical scaling MDS”/PCA, optimizes the so called “STRAIN” energy, but not the “SSTRESS” energy.)

Thanks. ]]>

Some minor comments which can improve even further (at least my) understanding:

1. In the pPCA interpretation, you write down the generative model

for the observed t. If the \epsilon is additive Gaussian noise, then why is \psi needed? why not simply write

t ~ N(\mu, W W^T + \sigma^2 I) ?

Also, you didn’t write a generative model for the hidden variables x.

Is x simply a multi-variate Gaussian with mean zero and unit covariance matrix? x ~ N(0, I) ?

2. In Schoenberg’s theorem, there is a quantifier missing so I’m not sure what the Thm. says. I assume that the statement is:

‘.. if there exists s with s’1=1, such that .. is positive semidefinite’ and not ‘for all s such that s’1=1, we have … is positive semi-definite’. Is that correct?

3. You write that classic MDS is obtained when taking s to be a unit vector . But doesn’t this involve choosing which unit vector to use? (where the ‘1’ in one of the coordinates corresponds to one of the data points). I think of MDS as symmetric with respect to the data points, and this choice of s treats one of them differently – or am I missing something here and the choice of s doesn’t matter?

4. You did not write about the computational complexity of computing PCA.

As far as I understand, computing the full singular-value decomposition requires O(n^3) computations, but if you are interested only in the few top eigenvectors and eigenvalues, there are faster (linear?) algorithms. Is it possible to elaborate/give pointers to that?

very nice post. I did not see you mention what I consider to be the simplest and most transparent probabilistic interpretation of PCA. That is, assume your n datapoint from R^p derive from a multi-variate Gaussian in p dimensions. The maximum likelihood estimate of this multi-variate Gaussian has mean equal to the sample mean and a covariance matrix equal to the sample covariance matrix (note: with factors 1/n, not 1/(n-1)).

PCA now simply consists in

1. Diagonalizing this covariance matrix. That is, to find the linear combinations of the original p variables that are statistically independent (in the ML estimate).

2. Decide to approximate the original multi-variate Gaussian by a lower-dimensional one that only retains the first k<p eigenvectors.

In other words, one get the ML multi-variate Gaussian from the data, rotates to a basis in which fluctuations along each axis are independent, and then retains only the axes with the highest variance. That's how I present it to my students.

btw. Although I must admit I have not studied the pPCA papers, it would seem to me that that interpretation is subsumed under the simple interpretation above (the final covariance matrix is simply the sum of the covariance matrix W on the subspace plus the diagonal matrix sigma^2 I).

]]>Thanks for reading carefully! I’ve corrected the typos including some others spotted by critical readers. Thanks to all of you.

]]>1. In “Note that the covariance matrix is the matrix M’M,” I would replace “M’M” with “M’M/(n-1)”. Or perhaps just add “proportional to.”

2. I think this equation “\tilde{X} = min_{M \mbox{ of rank } k} ||X-M||_2” should be an “arg min” rather than a “min.” Also, the equation has rendered with a spurious “\hat{A}” after the “f”.

]]>Thanks!

]]>