On Satuday I submitted the final grades for Math10A, the new UC Berkeley freshman math class for intended biology majors that I taught this semester. In assigning students their grades, I had a chance to reflect again on the system we use and its substantial shortcomings.

The system is broken, and my grade assignment procedure illustrates why. Math 10a had 223 students this semester, and they were graded according to the following policy: homework 10%, quizzes 10%, midterms 20% each (there were two midterms) and the final 40%. If midterm 1 was skipped then midterm 2 counted 40%. Similarly, if midterm 2 was skipped then the final counted 60%. This produced a raw score for each student and the final distribution is shown below (zeroes not shown):

The distribution seems fairly “reasonable”. One student didn’t do any work or show up and got a 5/100. At the other end of the spectrum some students aced the class. The average score was 74.48 and  the standard deviation 15.06. An optimal raw score distribution should allow for detailed discrimination between students (e.g. if everyone gets the same score thats not helpful). I think my distribution could have been a bit better but I overall I am satisfied with it.  The problem comes with the next step: after obtaining raw scores in a class, the professor has to set cutoffs for A+/A/A-/B+/B/B-/C+/C/C-/D+/D/D-/F. Depending on how the cutoffs are set, the grade distribution can change dramatically. In fact, it is easy to see that any discrete distribution on letter grades is achievable from any raw score distribution. One approach to letter grades would be to fix an A at, say, any raw score greater than or equal 90%, i.e., no curving. I found that threshold on wikipedia. But that is rarely how grades are set, partly because of large variability in the difficulty of exams. Almost every professor I know “curves” to some extent. At Berkeley one can examine grade distributions here.

It turns out that Roger Purves from statistics used to aim for a uniform distribution:

Roger Purves’ Stat 2 grade distribution over the past 6 years.

The increase in C- grades is explained by an artifact of the grading system at Berkeley.  If a student fails the class they can take it again and record the passing grade for their GPA (although the F remains on the transcript). A grade of D is not only devastating for the GPA, but also permanent. It cannot be improved by retaking the class. Therefore many students try to fail when they are doing poorly in a class, and many professors simply avoid assigning Ds. In other words, Purves’ C- includes his Ds. Another issue is that an A+ vs. A does not affect GPA, but an A vs. A- does; the latter is obviously a very subjective difference that varies widely between classes and professors. Note that Roger Purves just didn’t assign A+ grades, presumably because they have no GPA significance (although they do arguably have a psychological impact).

Marina Ratner from math was fond of failing students:

Marina Ratner’s Math 1B, Spring 2009.

In the same semester, in a parallel section, her colleague Richard Borcherds gave the following grades:

Richard Borcherd’s Math 1B, Spring 2009.

Unlike Ratner, Borcherds appears to be averse to failing students. Only 7 students failed out of 441 who were enrolled in his two sections that semester. Fair?

And then there are those who believe in the exponential distribution, for example Heino Nitsche who teaches Chem 1A:

Heino Nitsche’s Chem 1A, Spring 2011.

The variability in grade assignment is astonishing. As can be seen above, curving is prevalent and arbitrary, and the idea that grades have an absolute meaning is not credible. It is statistically highly unlikely that Ratner’s students were always terrible at learning math (whereas Borcherds “luckily” got the good students). Is chemistry inherently easy, to the point where an average student taking the class deserves an A?

This messed up system is different, yet similar in other schools. Sadly, many schools have used letter grading to manipulate GPAs via continual grade inflation. Just three weeks ago on December 3rd, the dean of undergraduate education at Harvard confirmed that the median grade at Harvard is an A- and the most common grade an A. The reasons for grade inflation are manifold. But I can understand it on a personal level. It is tempting for a faculty member to assign As because those are likely to immediately translate to better course evaluations (both internal, and public on sites such as Ninja Courses and ratemyprofessor). Local grade inflation can quickly lead to global inflation as professors, and at a higher level their universities, are competing with each other for the happiest students.

How did I assign letter grades for Math 10A?

After grading the final exams together, my five GSIs started the process of setting letter grade thresholds by examining the grades of “yardstick students”. These were students for which the GSIs felt confident in declaring their absolute knowledge of the material to be at the A,B,C or F levels. We then proceeded to refine the thresholds adding +/- cutoffs by simultaneously trying to respect the yardsticks, while also keeping in mind the overall grade distribution. Finally, I asked the GSIs to consider moving students upward across thresholds if they had shown consistent improvement and commitment throughout the semester (a policy I had informed the students of in class). The result was that about 40% of the students ended with an A. Students failed the class at a threshold where we believed they had not learned enough of the material to proceed to math 10B. I have to say that as far as my job goes, assigning letter grades for courses is the least scientific endeavor I participate in.

What should be done?

Until recently grades were recorded on paper, making it difficult to perform anything but trivial computations on the raw scores or letter grades. But electronic recording of grades allows for more sophisticated analysis. This should be taken advantage of. Suppose that instead of a letter grade, each student’s raw scores were recorded, along with the distribution of class scores. A single letter would immediately be replaced by a meaningful number in context.

I do think it is unfair to grade students only relatively, especially with respect to cohorts that can range in quality. But it should be possible to compute a meaningful custom raw score distribution specific to individual students based on the classes they have taken. The raw data is a 3-way table whose dimensions consist of professors x classes x raw scores. This table is sparse, as professors typically only teach a handful of different courses throughout their career. But by properly averaging the needed distributions as gleaned from this table, it should be possible to produce for each student an overall GPA score, together with a variance of the (student specific) distribution it came from averaged over the courses the student took. The resulting distribution and score could be renormalized to produce a single meaningful number. That way, taking “difficult” classes with professors who grade harshly would not penalize the GPA. Similarly, aiming for easy As wouldn’t help the GPA. And manipulative grade inflation on the part of professors and institutions would be much more difficult.

Its time to level the playing field for students, eliminate the possibility for manipulative grade inflation, and to stop hypocrisy. We need to not only preach statistical and computational thinking to our students, we need to practice it in every aspect of their education.