You are currently browsing the tag archive for the ‘multiple testing’ tag.

When I was a teenager I broke all the rules on Friday night. After dinner I would watch Louis Rukeyser’s Wall Street Week at 8:30pm, and I would be in bed an hour later. On new year’s eve, he had a special “year-end review”, during which he hosted “financial experts” who would opine on the stock market and make predictions for the coming year.

What I learned from Louis Rukeyser was:

1. Never trust men in suits (or tuxedos).

2. It’s easier to perpetrate the 1024 scam than one might think!

Here are the experts in 1999 all predicting increases for the stock market in 2000:

As it turned out, the NASDAQ peaked on March 10, 2000, and within a week and a half had dropped 10%. By the end of the year the dot-com bubble had completely burst and a few years later the market had lost almost 80% of its value.

Predictions on the last day of the 20th century represented a spectacular failure for the “pundits”, but by then I had already witnessed many failures on the show. I’d also noted that almost all the invited “experts” were men. Of course correlation does not imply causation, but I remember having a hard time dispelling the notion that the guests were wrong because they were men. I never wanted to be sexist, but Louis Rukeyser made it very difficult for me!

Gender issues aside, the main lesson I learned from Louis Rukeyser’s show is that it’s easy to perpetrate the 1024 scam. The scam goes something like this: a scammer sends out 1024 emails to individuals that are unlikely to know each other, with each email making a prediction about the performance of the stock market in the coming week. For half the people (512), she predicts the stock market will go up, and for the other half, that it will go down. The next week, she has obviously sent a correct prediction of the market to half the people (this assumes the market is never unchanged after a week). She ignores the 512 people who have received an incorrect prediction, dividing those who received the correct prediction into two halves (256 each). Again, she predicts the performance of the market in the coming week, sending 256 individuals a prediction that the market will go up, and the other 256 a prediction that it will go down. She continues this divide-and-conquer for 10 weeks, at which time there is one individual that has received correct predictions about the movement of the stock market for 2.5 months! This person may believe that the scammer has the ability to predict the market; after all, $(\frac{1}{2})^{10} = 0.00098$ which looks like a very significant p-value. This is when the scammer asks for a “large investment”. Of course what is missing is knowledge of the other prediction emails sent out, or in other words the multiple testing problem.

The Wall Street Week guest panels essentially provided a perfect setting in which to perpetrate this scam. “Experts” that would err would be unlikely to be invited back. Whereas regular winners would be back for another chance at guessing. This is a situation very similar to the mutual fund management market, where managers are sacked when they have a bad year, only to have large firms with hundreds of funds on the books highlight funds that have performed well for 10 years in a row in their annual glossy brochures. But that is not the subject matter of this blog post. Rather, it’s the blog itself.

I wrote and posted my first blog entry (Genesis of *Seq) exactly a year ago. I began writing it for two reasons. First, I thought it could be a convenient and useful forum for discussion of technical developments in computational biology. I was motivated partly by the seqanswers website, which allows users to share information and experience in dealing with high-throughput sequence data. But I was also inspired by the What’s New Blog that has created numerous bridges in the mathematics community via highly technical yet accessible posts that have democratized mathematics. Second, I had noticed an extraordinary abuse of multiple testing in computational biology, and I was desperate for a forum where I could bring the issue to peoples attention. My initial frustration with outlandish claims in papers based on weak statistics had also grown over time to encompass a general concern for lack of rigor in computational biology papers. None of us are perfect but there is a wide gap between perfect and wrong. Computational biology is a field that is now an amalgamation of many subjects and I hoped that a blog would be able to reach the different silos more effectively than publications.

And thus this blog was born on August 19th 2013. I started without a preconception of how it would turn out over time, and I’m happy to say I’ve been surprised by its impact, most notably on myself. I’ve learned an enormous amount from reader feedback, in part via comments on individual posts, but also from private emails to me and in personal conversations. For this (selfish) reason alone, I will keep blogging. I have also been asked by many of you to keep posting, and I’m listening. When I have nothing left to say, I promise I will quit. But for now I have a backlog of posts, and after a break this summer, I am ready to return to the keyboard. Besides, since starting to blog I still haven’t been to Las Vegas.

In reading the news yesterday I came across multiple reports claiming that even casually smoking marijuana can change your brain. I usually don’t pay much attention to such articles; I’ve never smoked a joint in my life. In fact, I’ve never even smoked a cigarette. So even though as a scientist I’ve been interested in cannabis from the molecular biology point of view, and as a citizen from a legal point of view, the issues have not been personal. However reading a USA Today article about the paper, I noticed that the principal investigator Hans Breiter was claiming to be a psychiatrist and mathematician. That is an unusual combination so I decided to take a closer look. I immediately found out the claim was a lie. In fact, the totality of math credentials of Hans Breiter consist of some logic/philosophy courses during a year abroad at St. Andrews while he was a pre-med student at Northwestern. Even being an undergraduate major in mathematics does not make one a mathematician, just as being an undergraduate major in biology does not makes one a doctor. Thus, with his outlandish claim, Hans Breiter had succeeded in personally offending me! So, I decided to take a look at his paper underlying the multiple news reports:

This is quite possibly the worst paper I’ve read all year (as some of my previous blog posts show I am saying something with this statement). Here is a breakdown of some of the issues with the paper:

### 1. Study design

First of all, the study has a very small sample size, with only 20 “cases” (marijuana users), a fact that is important to keep in mind in what follows. The title uses the term “recreational users” to describe them, and in the press release accompanying the article Breiter says that “Some of these people only used marijuana to get high once or twice a week. People think a little recreational use shouldn’t cause a problem, if someone is doing OK with work or school. Our data directly says this is not the case.” In fact, the majority of users in the study were smoking more than 10 joints per week. There is even a person in the study smoking more than 30 joints per week (as disclosed above, I’m not an expert on this stuff but if 30 joints per week is “recreation” then it seems to me that person is having a lot of fun). More importantly, Breiter’s statement in the press release is a lie. There is no evidence in the paper whatsoever, not even a tiny shred, that the users who were getting high once or twice a week were having any problems. There are also other issues with the study design. For example, the paper claims the users are not “abusing” other drugs, but it is quite possible that they are getting high on cocaine, heroin, or ??? as well, an issue that could quite possibly affect the study. The experiment consisted of an MRI scan of each user/control, but only a single scan was done. Given the variability in MRI scans this also seems problematic.

### 2. Multiple testing

The study looked at three aspects of brain morphometry in the study participants: gray matter density, volume and shape. Each of these morphometric analyses constituted multiple tests. In the case of gray matter density, estimates were based on small clusters of voxels, resulting in 123 tests (association of each voxel cluster with marijuana use). Volumes were estimated for four regions: left and right nucleus accumbens and amygdala. Shape was also tested in the same four regions. What the authors should have done is to correct the p-values computed for each of these tests by accounting for the total number of tests performed. Instead, (Bonferroni) corrections were performed separately for each type of analysis. For example, in the volume analysis p-values were required to be less than 0.0125 = 0.05/4. In other words, the extent of testing was not properly accounted for. Even so, many of the results were not significant. For example, the volume analysis showed no significant association for any of the four tested regions. The best case was the left nucleus accumbens (Figure 1C) with a corrected p-value of 0.015 which is over the authors’ own stated required threshold of 0.0125 (see caption). They use the language “The association with drug use, after correcting for 4 comparisons, was determined to be a trend toward significance” to describe this non-effect. It is worth noting that the removal of the outlier at a volume of over $800 mm^3$ would almost certainly flatten the line altogether and remove even the slight effect. It would have been nice to test this hypothesis but the authors did not release any of their data.

Figure 1c.

In the Fox News article about the paper, Breiter is quoted saying ““For the NAC [nucleus accumbens], all three measures were abnormal, and they were abnormal in a dose-dependent way, meaning the changes were greater with the amount of marijuana used,” Breiter said.  “The amygdala had abnormalities for shape and density, and only volume correlated with use.  But if you looked at all three types of measures, it showed the relationships between them were quite abnormal in the marijuana users, compared to the normal controls.” The result above shows this to be a lie. Volume did not significantly correlate with use.

This is all very bad, but things get uglier the more one looks at the paper. In the tables reporting the p-values, the authors do something I have never seen before in a published paper. They report the uncorrected p-values, indicating those that are significant (prior to correction) in boldface, and then put an asterisk next to those that are significant after their (incomplete) correction. I realize my own use of boldface is controversial… but what they are doing is truly insane. The fact that they put an asterisk next to the values significant after correction indicates they are aware that multiple testing is required. So why bother boldfacing p-values that they know are not significant? The overall effect is an impression that more tests are significant than is actually the case. See for yourself in their Table 4:

Table 4.

The fact that there are multiple columns is also problematic. Separate tests were performed for smoking occasions per day, joints per occasion, joints per week and smoking days per week. These measures are highly correlated, but even so multiply testing them requires multiple test correction. The authors simply didn’t perform it. They say “We did not correct for the number of drug use measures because these measures tend not be independent of each other”. In other words, they multiplied the number of tests by four, and chose to not worry about that. Unbelievable.

Then there is Table 5, where the authors did not report the p-values at all, only whether they were significant or not… without correction:

Table 5.

### 3. Correlation vs. causation

This issue is one of the oldest in the book. There is even a wikipedia entry about itCorrelation does not imply causation. Yet despite the fact the every result in the paper is directed at testing for association, in the last sentence of the abstract they say “These data suggest that marijuana exposure, even in young recreational users, is associated with exposure-dependent alterations of the neural matrix of core reward structures and is consistent with animal studies of changes in dendritic arborization.” At a minimum, such a result would require doing a longitudinal study. Breiter takes this language to an extreme in the press release accompanying the article. I repeat the statement he made that I quoted above where I boldface the causal claim: “”Some of these people only used marijuana to get high once or twice a week. People think a little recreational use shouldn’t cause a problem, if someone is doing OK with work or school. Our data directly says this is not the case.” I believe that scientists should be sanctioned for making public statements that directly contradict the content of their papers, as appears to be the case here. There is precedent for this.

[Update April 6, 2014: The initial title of this post was “23andme genotypes are all wrong”. While that was and remains a technically correct statement, I have changed it because the readership of my blog, and this post in particular, has changed. Initially, when I made this post, the readers of the blog were (computational) biologists with extensive knowledge of genotyping and association mapping, and they could understand the point I was trying to make with the title. However in the past few months the readership of my blog has grown greatly, and the post is now reaching a wide public audience. The revised title clarifies that the content of this post relates to the point that low error rates in genotyping can be problematic in the context of genome-wide association reports because of multiple-testing.]

I have been reading the flurry of news articles and blog posts written this week about 23andme and the FDA with some interest. In my research talks, I am fond of displaying 23andme results, and have found that people always respond with interest. On the teaching side, I have subsidized 23andme testing for volunteer students in Math127 who were interested in genetics so that they could learn about personalized genomics first-hand. Finally, a number of my former and current students have worked at 23andme, and some are current employees.

Despite lots of opinions being expressed about the 23andme vs. FDA kerfuffle, I believe that two key points have been ignored in the discussions:

1. All 23andme genotypes that have ever been reported to customers are wrong. This is the case despite very accurate genotyping technology used by 23andme.
2. The interpretation of 23andme results involves examining a large number of odds ratios. The presence of errors leads to a huge multiple-testing problem.

Together, these issues lead to an interesting conundrum for the company, for customers, and for the FDA.

### Blog Stats

• 1,558,256 views