The Habsburg rulership of Spain ended with an inbreeding coefficient of F=0.254. The last king, Charles II (1661-1700), suffered an unenviable life. He was unable to chew. His tongue was so large he could not speak clearly, and he constantly drooled. Sadly, his mouth was the least of his problems. He suffered seizures, had intellectual disabilities, and was frequently vomiting. He was also impotent and infertile, which meant that even his death was a curse in that his lack of heirs led to a war.

None of these problems prevented him from being married (twice). His first wife, princess Henrietta of England, died at age 26 after becoming deeply depressed having being married to the man for a decade. Only a year later, he married another princess, 23 year old Maria Anna of Neuberg. To put it mildly, his wives did not end up living the charmed life of Disney princesses, nor were they presumably smitten by young Charles II who apparently aged prematurely and looked the part of his horrific homozygosity. The princesses married Charles II because they were forced to. Royals organized marriages to protect and expand their power, money and influence. Coupled to this were primogeniture rules which ensured that the sons of kings, their own flesh and blood and therefore presumably the best-suited to be in power, would indeed have the opportunity to succeed their fathers. The family tree of Charles II shows how this worked in Spain:

Charles

It is believed that the inbreeding in Charles II’s family led to two genetic disorders, combined pituitary hormone deficiency and distal renal tubular acidosis, that explained many of his physical and mental problems. In other words, genetic diversity is important, and the point of this blog post is to highlight the fact that diversity is important in education as well.

The problem of inbreeding in academia has been studied previously, albeit to a limited extent. One interesting article is Navel Grazing: Academic Inbreeding and Scientific Productivity by Horta et al published in 2010 (my own experience with an inbred academic from a department where 39% of the faculty are self-hires anecdotally confirms the claims made in the paper). But here I focus on the downsides of inbreeding of ideas rather than of faculty. For example home-schooling, the educational equivalent of primogeniture, can be fantastic if the parents happen to be good teachers, but can fail miserably if they are not. One thing that is guaranteed in a school or university setting is that learning happens by exposure to many teachers (different faculty, students, tutors, the internet, etc.) Students frequently complain when there is high variance in teaching quality, but one thing such variance ensures is that is is very unlikely that any student is exposed only to bad teachers. Diversity in teaching also helps to foster the development of new ideas. Different teachers, by virtue of insight or error, will occasionally “mutate” ideas or concepts for better or for worse. In other words, one does not have to fully embrace the theory of memes to acknowledge that there are benefits to variance in teaching styles, methods and pedagogy. Conversely, there is danger in homogeneity.

This brings me to MOOCs. One of the great things about MOOCs is that they reach millions of people. Udacity claims it has 1.6 million “users” (students?). Coursera claims 7.1 million. These companies are greatly expanding the accessibility of education. Starving children in India can now take courses in mathematical methods for quantitative finance, and for the first time in history, a president of the United States can discreetly take a freshman course on economics together with its high school algebra prerequisites (highly recommended). But when I am asked whether I would be interested in offering a MOOC I hesitate, paralyzed at the thought that any error I make would immediately be embedded in the brains of millions of innocent victims. My concern is this: MOOCs can greatly reduce the variance in education. For example, Coursera currently offers 641 courses, which means that each courses is or has been taught to over 11,000 students. Many college courses may have less than a few dozen students, and even large college courses rarely have more than a few hundred students. This means that on average, through MOOCs, individual professors reach many more (2 orders of magnitude!) students. A great lecture can end up positively impacting a large number of individuals, but at the same time, a MOOC can be a vehicle for infecting the brains of millions of people with nonsense. If that nonsense is then propagated and reaffirmed via the interactions of the people who have learned it from the same source, then the inbreeding of ideas has occurred.

I mention MOOCs because I was recently thinking about intuition behind Bessel’s correction replacing n with n-1 in the formula for sample variance. Formally, Bessel’s correction replaces the biased formula

s^2_n = \frac{1}{n} \sum_{i=1}^n (x_i - \overline{x})^2

for estimating the variance of a random variable from samples x_1,\ldots,x_n with

s^2_{n-1} = \frac{1}{n-1} \sum_{i=1}^n (x_i-\overline{x})^2.

The switch from to n-1 is a bit mysterious and surprising, and in introductory statistics classes it is frequently just presented as a “fact”. When an explanation is provided, it is usually in the form of algebraic manipulation that establishes the result. The issue came up as a result of a blog post I’m writing about principal components analysis (PCA), and I thought I would check for an intuitive explanation online. I googled “intuition sample variance” and the top link was a MOOC from the Khan Academy:

The video has over 51,000 views with over 100 “likes” and only 6 “dislikes”. Unfortunately, in this case, popularity is not a good proxy for quality. Despite the title promising “review” and “intuition” for “why we divide by n-1 for the unbiased sample variance” there is no specific reason given why is replaced by n-1 (as opposed to another correction). Furthermore, the intuition provided has to do with the fact that x_i-\overline{x} underestimates x_i-\mu (where \mu is the mean of the random variable and \overline{x} is the sample mean) but the explanation is confusing and not quantitative (which it can easily be). In fact, the wikipedia page for Bessel’s correction provides three different mathematical explanations for the correction together with the intuition that motivates them, but it is difficult to find with Google unless one knows that the correction is called “Bessel’s correction”.

Wikipedia is also not perfect, and this example is a good one for why teaching by humans is important. Among the three alternative derivations, I think that one stands out as “better” but one would not know by just looking at the wikipedia page. Specifically, I refer to “Alternate 1″ on the wikipedia page, that is essentially explaining that variance can be rewritten as a double sum corresponding to the average squared distance between points and the diagonal terms of the sum are zero in expectation. An explanation of why this fact leads to the n-1 in the unbiased estimator is as follows:

The first step is to notice that the variance of a random variable is equal to half of the expected squared difference of two independent identically distributed random variables of that type. Specifically, the definition of variance is:

var(X) = \mathbb{E}(X - \mu)^2 where \mu = \mathbb{E}(X). Equivalently, var(X) = \mathbb{E}(X^2) -\mu^2. Now suppose that Y is another random variable identically distributed to X and with X,Y independent. Then \mathbb{E}(X-Y)^2 = 2 var(X). This is easy to see by using the fact that

\mathbb{E}(X-Y)^2 = \mathbb{E}(X^2) + \mathbb{E}(Y^2) - 2\mathbb{E}(X)\mathbb{E}(Y) = 2\mathbb{E}(X^2)-2\mu^2.

This identity motivates a rewriting of the (uncorrected) sample variance s_n in a way that is computationally less efficient, but mathematically more insightful:

s_n = \frac{1}{2n^2} \sum_{i,j=1}^n (x_i-x_j)^2.

Of note is that in this summation exactly n of the terms are zero, namely the terms when i=j. These terms are zero independently of the original distribution, and remain so in expectation thereby biasing the estimate of the variance, specifically leading to an underestimate. Removing them fixes the estimate and produces

s_{n-1}^2 = \frac{1}{2n(n-1)} \sum_{i,j=1, i \neq j}^n (x_i-x_j)^2.

It is easy to see that this is indeed Bessel’s correction. In other words, the correction boils down to the fact that n^2-n = n(n-1), hence the appearance of n-1.

Why do I like this particular derivation of Bessel’s correction? There are two reasons: first, n-1 emerges naturally and obviously from the derivation. The denominator in s_{n-1}^2 matches exactly the number of terms being summed, so that it can be understood as a true average (this is not apparent in its standard form as s_{n-1}^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i-\overline{x})^2. There is really nothing mysterious anymore, its just that some terms having been omitted from the sum because they were non-inofrmative. Second, as I will show in my forthcoming blog post on PCA, the fact that the variance of a random variable is half of the expectation of the squared difference of two instances, is key to understanding the connection between multi-dimensional scaling (MDS) and PCA. In other words, as my student Nicolas Bray is fond of saying, although most people think a proof is either right or wrong, in fact some proofs are more right than others. The connection between Bessel’s correction and PCA goes even deeper: as explained by Saville and Wood in their book Statistical Methods: A Geometric Approach n-1 can be understood to be a reduction in one dimension from the point of view of probabilistic PCA (Saville and Wood do not explicitly use the term probabilistic PCA but as I will explain in my PCA post it is implicit in their book). Finally, there are many subtleties to Bessel’s correction, for example it is an unbiased estimator for variance and not standard deviation. These issues ought to be mentioned in a good lecture about the topic. In other words, the Khan lecture is neither necessary nor sufficient, but unlike a standard lecture where the damage is limited to a small audience of students, it has been viewed more than 50,000 times and those views cannot be unviewed.

In writing this blog post I pondered the irony of my call for added diversity in teaching while I preach my own idea (this post) to a large number of readers via a medium designed for maximal outreach. I can only ask that others blog as well to offer alternative points of view :) and that readers inform themselves on the issues I raise by fact-checking elsewhere. As far as the statistics goes, if someone finds the post confusing, they should go and register for one of the many fantastic MOOCs on statistics! But I reiterate that in the rush to MOOCdom, providers must offer diversity in their offerings (even multiple lectures on the same topic) to ensure a healthy population of memes. This is especially true in Spain, where already inbred faculty are now inbreeding what they teach by MOOCing via Miriada X. Half of the MOOCs being offered in Spain originate from just 3 universities, while the number of potential viewers is enormous as Spanish is now the second most spoken language in the world (thanks to Charles II’s great-great-grandfather, Charles I).

May Charles II rest in peace.