Quick questions, not a mathematician but trying to understand how unifrac works – does it actually use the branch lengths to determine how different the species are in the final calculation?

]]>Thanks! Typo fixed.

]]>Typos:

“that every reads is associated” -> “every read”

]]>The assessment of significance is tricky, and there’s still some important work to be done there. Note that the randomization test (that seems to be very commonly used in association with UniFrac and that we describe too) does not have wonderful properties when in the setting of incomplete sampling with non-independent observations like metagenomic sampling. This problem is common whenever the identity of reads are not independent. Imagine, for example, that we have a random observation process on the tree equipped with some collection of “base observations.” Each process takes a random subset of those base observations and then throws down some number of reads for each observation in that set, the number of which has mean >> 1. If the set of base observations is large compared to the number of sample observations, then two draws will always appear significantly different even though they are from the same underlying process.

I didn’t mean to ramble on like this, but I hope someone takes it up. I would think that the folks doing modeling of these ecosystems should have the perspective to decide what we should call the “same” and “different”.

]]>