The Habsburg rulership of Spain ended with an inbreeding coefficient of F=0.254. The last king, Charles II (1661-1700), suffered an unenviable life. He was unable to chew. His tongue was so large he could not speak clearly, and he constantly drooled. Sadly, his mouth was the least of his problems. He suffered seizures, had intellectual disabilities, and was frequently vomiting. He was also impotent and infertile, which meant that even his death was a curse in that his lack of heirs led to a war.

None of these problems prevented him from being married (twice). His first wife, princess Henrietta of England, died at age 26 after becoming deeply depressed having being married to the man for a decade. Only a year later, he married another princess, 23 year old Maria Anna of Neuberg. To put it mildly, his wives did not end up living the charmed life of Disney princesses, nor were they presumably smitten by young Charles II who apparently aged prematurely and looked the part of his horrific homozygosity. The princesses married Charles II because they were forced to. Royals organized marriages to protect and expand their power, money and influence. Coupled to this were primogeniture rules which ensured that the sons of kings, their own flesh and blood and therefore presumably the best-suited to be in power, would indeed have the opportunity to succeed their fathers. The family tree of Charles II shows how this worked in Spain:

It is believed that the inbreeding in Charles II’s family led to two genetic disorders, combined pituitary hormone deficiency and distal renal tubular acidosis, that explained many of his physical and mental problems. In other words, genetic diversity is important, and the point of this blog post is to highlight the fact that diversity is important in education as well.

The problem of inbreeding in academia has been studied previously, albeit to a limited extent. One interesting article is Navel Grazing: Academic Inbreeding and Scientific Productivity by Horta *et* *al *published in 2010 (my own experience with an inbred academic from a department where 39% of the faculty are self-hires anecdotally confirms the claims made in the paper). But here I focus on the downsides of inbreeding of *ideas* rather than of faculty. For example home-schooling, the educational equivalent of primogeniture, can be fantastic if the parents happen to be good teachers, but can fail miserably if they are not. One thing that is guaranteed in a school or university setting is that learning happens by exposure to many teachers (different faculty, students, tutors, the internet, etc.) Students frequently complain when there is high variance in teaching quality, but one thing such variance ensures is that is is very unlikely that any student is exposed *only* to *bad* teachers. Diversity in teaching also helps to foster the development of new ideas. Different teachers, by virtue of insight or error, will occasionally “mutate” ideas or concepts for better or for worse. In other words, one does not have to fully embrace the theory of memes to acknowledge that there are benefits to variance in teaching styles, methods and pedagogy. Conversely, there is danger in homogeneity.

This brings me to MOOCs. One of the great things about MOOCs is that they reach millions of people. Udacity claims it has 1.6 million “users” (students?). Coursera claims 7.1 million. These companies are greatly expanding the accessibility of education. Starving children in India can now take courses in mathematical methods for quantitative finance, and for the first time in history, a president of the United States can discreetly take a freshman course on economics together with its high school algebra prerequisites (highly recommended). But when I am asked whether I would be interested in offering a MOOC I hesitate, paralyzed at the thought that any error I make would immediately be embedded in the brains of millions of innocent victims. My concern is this: MOOCs can greatly reduce the variance in education. For example, Coursera currently offers 641 courses, which means that each courses is or has been taught to over 11,000 students. Many college courses may have less than a few dozen students, and even large college courses rarely have more than a few hundred students. This means that on average, through MOOCs, individual professors reach many more (2 orders of magnitude!) students. A great lecture can end up positively impacting a large number of individuals, but at the same time, a MOOC can be a vehicle for infecting the brains of millions of people with nonsense. If that nonsense is then propagated and reaffirmed via the interactions of the people who have learned it from the same source, then the inbreeding of ideas has occurred.

I mention MOOCs because I was recently thinking about intuition behind Bessel’s correction replacing *n* with *n-1* in the formula for sample variance. Formally, Bessel’s correction replaces the biased formula

for estimating the variance of a random variable from samples with

.

The switch from *n *to *n-1* is a bit mysterious and surprising, and in introductory statistics classes it is frequently just presented as a “fact”. When an explanation is provided, it is usually in the form of algebraic manipulation that establishes the result. The issue came up as a result of a blog post I’m writing about principal components analysis (PCA), and I thought I would check for an intuitive explanation online. I googled “intuition sample variance” and the top link was a MOOC from the Khan Academy:

The video has over 51,000 views with over 100 “likes” and only 6 “dislikes”. Unfortunately, in this case, popularity is not a good proxy for quality. Despite the title promising “review” and “intuition” for “why we divide by *n-1 *for the unbiased sample variance” there is no specific reason given why *n *is replaced by *n-1 *(as opposed to another correction). Furthermore, the intuition provided has to do with the fact that underestimates (where is the mean of the random variable and is the sample mean) but the explanation is confusing and not quantitative (which it can easily be). In fact, the wikipedia page for Bessel’s correction provides three different mathematical explanations for the correction together with the intuition that motivates them, but it is difficult to find with Google unless one knows that the correction is called “Bessel’s correction”.

Wikipedia is also not perfect, and this example is a good one for why teaching by humans is important. Among the three alternative derivations, I think that one stands out as “better” but one would not know by just looking at the wikipedia page. Specifically, I refer to “Alternate 1” on the wikipedia page, that is essentially explaining that **variance can be rewritten as a double sum corresponding to the average squared distance between points and the diagonal terms of the sum are zero in expectation****.** An explanation of why this fact leads to the *n-1* in the unbiased estimator is as follows:

The first step is to notice that the variance of a random variable is equal to half of the expected squared difference of two independent identically distributed random variables of that type. Specifically, the definition of variance is:

where . Equivalently, . Now suppose that *Y* is another random variable identically distributed to *X* and with *X,Y *independent. Then . This is easy to see by using the fact that

.

This identity motivates a rewriting of the (uncorrected) sample variance in a way that is computationally less efficient, but mathematically more insightful:

.

Of note is that in this summation exactly *n* of the terms are zero, namely the terms when *i=j*. These terms are zero independently of the original distribution, and remain so *in expectation* thereby biasing the estimate of the variance, specifically leading to an underestimate. Removing them fixes the estimate and produces

.

It is easy to see that this is indeed Bessel’s correction. In other words, the correction boils down to the fact that , hence the appearance of *n-1*.

Why do I like this particular derivation of Bessel’s correction? There are two reasons: first, *n-1* emerges naturally and obviously from the derivation. The denominator in matches exactly the number of terms being summed, so that it can be understood as a true average (this is not apparent in its standard form as . There is really nothing mysterious anymore, its just that some terms having been omitted from the sum because they were non-inofrmative. Second, as I will show in my forthcoming blog post on PCA, the fact that the variance of a random variable is half of the expectation of the squared difference of two instances, is key to understanding the connection between multi-dimensional scaling (MDS) and PCA. In other words, as my student Nicolas Bray is fond of saying, although most people think a proof is either right or wrong, in fact some proofs are more right than others. The connection between Bessel’s correction and PCA goes even deeper: as explained by Saville and Wood in their book Statistical Methods: A Geometric Approach *n-1* can be understood to be a reduction in one *dimension *from the point of view of probabilistic PCA (Saville and Wood do not explicitly use the term probabilistic PCA but as I will explain in my PCA post it is implicit in their book). Finally, there are many subtleties to Bessel’s correction, for example it is an unbiased estimator for *variance *and not *standard deviation*. These issues ought to be mentioned in a good lecture about the topic. In other words, the Khan lecture is neither necessary nor sufficient, but unlike a standard lecture where the damage is limited to a small audience of students, it has been viewed more than 50,000 times and those views cannot be unviewed.

In writing this blog post I pondered the irony of my call for added diversity in teaching while I preach my own idea (this post) to a large number of readers via a medium designed for maximal outreach. I can only ask that others blog as well to offer alternative points of view 🙂 and that readers inform themselves on the issues I raise by fact-checking elsewhere. As far as the statistics goes, if someone finds the post confusing, they should go and register for one of the many fantastic MOOCs on statistics! But I reiterate that in the rush to MOOCdom, providers must offer diversity in their offerings (even multiple lectures on the same topic) to ensure a healthy population of memes. This is especially true in Spain, where already inbred faculty are now inbreeding what they teach by MOOCing via Miriada X. Half of the MOOCs being offered in Spain originate from just 3 universities, while the number of potential viewers is enormous as Spanish is now the second most spoken language in the world (thanks to Charles II’s great-great-grandfather, Charles I).

May Charles II rest in peace.

## 13 comments

Comments feed for this article

May 25, 2014 at 11:17 am

Jesse Hoffthis a very interesting post and raises some important questions about Moocs that i find quite intriguing as a population geneticist. However, i’m sad to say that as much as I appreciate the points you make about that derivation of bessel’s correction (whose name i also did not know), and am much better informed by it than the khan academy video it still doesn’t reach the level of “intuition”, and you didn’t put it in a youtube video. I would say that this judgment is based on the fact that he used a bunch of pictures of number lines and has a nice narrative approach. You had a very well written blog post with latex.

So while Charles’s mom or dad had genetically far more suitable suitors, the power of the most expressible, obvious choice wins out.

Also, i’d just like to throw out that inbreeding is a biological not statistical phenomenon. it is a statistical tendency to be sure. But it can be avoided with the right breeding scheme that results in a very high F and high fitness.

May 25, 2014 at 3:06 pm

Lior PachterPoints well taken. You are right about the value of video, and that is one of many aspects of MOOCs that I think have been really positive. They have developed technology that has made it much easier to put video online. One of my favorites is http://www.numberphile.com

I’ll think about how to make a video on Bessel’s correction and maybe I’ll make it sometime (or cheat and give a talk where someone else records it). Thanks for the impetus!

May 25, 2014 at 11:18 am

Jesse Hoffsorry by narrative approach i meant, narrating style.

May 26, 2014 at 11:03 am

David desJardinsI don’t think the MOOC argument really holds up. Sure, the potential “harm” from a bad lecture that is viewed 50,000 times is worse than that from a bad lecture with an audience of 50. But the “gain” from a good lecture that is viewed 50,000 times is more than that from a good lecture with an audience of 50. The real question is whether the average quality of lectures that are distributed online to large audiences is likely to be higher or lower than those that are performed individually. And the answer to that seems pretty clear.

The argument would be more cogent as an argument against textbooks. One book can crowd out other competitors and create a homogeneity of instruction. We can all point to examples where the seminal or dominant textbook in a particular field just isn’t very good, yet its existence makes other textbooks less likely to be written, so that is an example of harm. But still, few would argue that the world would be better off without any textbooks.

May 26, 2014 at 11:27 am

Lior PachterI agree with your point about textbooks, except for your claim that the analogy of my argument is that the world would be better off without textbooks. I never said that I think the world would be better off without MOOCs. To the contrary, I even linked to statistics MOOCs by Coursera many of which I think are excellent and recommend to my students. What I am saying in the textbook analogy is that the world would be better off with publishers that promote new authors and lower costs. Indeed, there are such publishers and publishing models and my hope is that organizations running MOOCs will, in the same vein, advocate for choice and provide meaningful assessment, so that popularity is not the proxy for quality. The point of my post was to highlight the dangers if that does not turn out to be the case, and to emphasize that when it comes to lectures there cannot be a single “good” optimal way. Diversity is key, and learning different points of view about subjects is important. I do believe the dangers with MOOCs bringing homogeneity to teaching are real, hence my worry about inbreeding of memes. For one thing, MOOCs have the potential to reach global audiences in a way that textbooks never could.

May 26, 2014 at 12:23 pm

David desJardinsI am just dubious about any substantial concern over MOOC homogeneity, because most students today are taught by relatively bad instructors from textbooks, and there’s probably more homogeneity in what they get now than what would emerge from a few high-quality online courses. There’s very little economy of scale from offering just one online course as opposed to two or three (as opposed to the huge economies of scale in replacing thousands and thousands of courses to two or three), and there are still going to be low barriers to entry, and so it doesn’t seem to me that extreme lack of diversity is a likely outcome. “Half of the MOOCs being offered in Spain originate from just 3 universities” doesn’t seem to me to be a particularly high level of concentration or one that would have any real adverse consequences. Because even within a single university some offerings often compete and overlap with others, and the three universities you mention compete with each other much more online than they do offline, and even if they didn’t compete effectively they still only have half of the market. Why do you say that textbooks don’t reach global audiences in the same way that MOOCs do? How are most of the globe learning these subjects, if not from classes based on textbooks?

May 26, 2014 at 4:01 pm

Lior PachterMy comment on textbooks is based on the fact that a search for calculus books on Amazon reveals more than 23,000 entries, and in my personal experience books used vary greatly by geographic region (e.g. different countries certainly use different books). I think the situation is similar in other subjects, so even though you have a point that there can be homogenization with textbooks, and it can be a problem, I think the diversity is far great than (currently) with MOOCs. Also, I doubt many students read their textbooks 🙂 and many faculty diverge from their books. Of course MOOCs are fairly new so there is no reason offerings can’t expand but I do think the bar for creating a MOOC is much higher (at least that is the feedback I’ve heard from my colleagues who have tried, and who tell me the $$ and time investments are substantial). Also currently very few companies/universities control MOOC offerings (much smaller than the number of textbook publishers).

But I guess we’ll see whether homogeneity emerges or not. For now I stand by my concern that there is a danger to be aware of. And while I agree with you that the status quo is imperfect and there is much room for improvement, I strongly disagree that most students today are taught by “relatively bad instructors from textbooks”. The vast majority of faculty and lecturers I’ve encountered in my career, in departments ranging from mathematics to biology, and in many different universities in many countries, care a lot about their teaching and students, and I think by many measures are very good at what they do.

May 26, 2014 at 5:05 pm

David desJardinsAs a professor at one of the leading research universities in the world, I think your colleagues are a highly biased sample of all of the teaching of calculus (for example) that occurs throughout the world. Most people don’t have access to highly talented and capable teachers who have the desire, resources, and capacities needed to teach well. I think if you could observe a truly random sample of all of the world’s instruction you would be pretty depressed at the low quality of it. The 99% of students who aren’t lucky enough to attend a school like UCB are not being served that well by the present system.

The number of “real” textbook publishers is also very few, and less every year (at least in the US and the English-speaking developed world). There may be a lot of calculus textbooks in theory, but I think the distribution of how often they are used follows a power law with a pretty high exponent.

May 26, 2014 at 11:00 pm

Olle HäggströmThanks, Lior, for a very nice post!

I think, however, that your boldfaced observation that “if two samples are selected independently from a single real-valued (continuous) random variable the probability that they are identical is zero” is irrelevant (and the idea that it might be relevant here deserves to infect as few minds as possible… 🙂 ) The important thing about the double-sum rewrite of s^2_n is not that “in this summation exactly n of the terms are zero”, but rather that the diagonal terms have EXPECTATION zero, as opposed to the off-diagonal terms.

May 26, 2014 at 11:10 pm

Lior PachterYou’re correct that I was sloppy. This happened partly because I wanted to avoid defining bias. In any case, I have corrected the text, which is one of the advantages of a blog as opposed to a lecture… infections can be healed 🙂

June 2, 2014 at 2:25 am

Erik van NimwegenDear Lior,

speaking of intellectual inbreeding, your post made me realize that the ‘estimators should be unbiased’ meme is apparently still successfully infecting many in the community. Although I realize the topic at hand is the various ways of proving why the Bessel correction removes the bias from the estimator of the variance, the post does seems to take for granted that it is a good idea to perform this ‘correction’ when estimating the variance. That’s a pity. Because, actually, it is not.

As noted in the wikipedia article on the Bessel correction that you link to (see the section `caveats’), by the orthodox statistical optimality measure of minimizing the mean squared-error, the Bessel correction is never optimal. For example, for samples from a Gaussian the estimate with minimal mean squared error has a ‘correction’ n/(n+1). That is, even more bias!

A more relevant measure in my opinion is a decision theory analysis that identifies which estimate of the variance minimises expected loss under a given loss function. For example, if the samples are drawn from a Gaussian with unknown mean and variance, and we use a quadratic loss function, then the estimate that minimises expected loss is given by the Bayesian posterior mean. With a scale prior 1/v for the variance v (and uniform prior over the mean) this will give a ‘correction’ of n/(n-3) instead of n/(n-1). If one uses absolute error as a loss function then the optimal estimator is the Bayesian posterior median (which has no elegant expression, i.e. its given in terms of the inverse of an incomplete gamma function). If the loss function is such that you only care about getting the answer exactly right, then the maximal posterior value is the optimal estimator (which also has n/(n+1) as the correction factor). In short, depending on your loss function you get different optimal estimators that differ by terms proportional to 1/n. But the Bessel correction never gives the optimal estimator.

Generally, I am not aware of any rigorous theoretical basis for demanding that an estimator should be unbiased, whereas there are many theoretical arguments against this demand (not invariant under parameter transformations, sometimes they do not mathematically exist, sometimes the only unbiased estimators give absurd estimates, and so on). Moreover, this piece of orthodox statistical folklore leads some researchers to do really silly things like estimating a percentage to be negative (I commented on this in a previous post). I think the general fact that estimator bias is a red herring is something quite important for students to be aware of.

January 31, 2015 at 6:52 am

Félix BalazardI agree with Erik. Bias is defined as one term of the bias-variance decomposition of the mean squared error. Considering unbiased estimator does not in general lead to reduced mean squared error as the variance can also increases. In the case of Bessel’s correction, the variance does increase and it does so by more than twice the reduction in bias.

Besides, the use of terms that have an lay understanding for technical quantities is very dangerous. Having a biased estimator is not nearly as bad as it sounds. It is common that a biased estimator will have lower mean squared error than an unbiased one. But people using statistics while not having been warned about this will prefer having an unbiased estimator even if it worsens their prediction.

January 31, 2015 at 8:25 am

Lior PachterI completely agree.