Rapid testing has been a powerful tool to control COVID-19 outbreaks around the world (see Iceland, Germany, …). While many countries support testing through government sponsored healthcare infrastructure, in the United States COVID-19 testing has largely been organized and provided by for-profit businesses. While financial incentives coupled with social commitment have motivated many scientists and engineers at companies and universities to work hard around the clock to facilitate testing, there are also many individuals who have succumbed to greed. Opportunism has bubbled to the surface and scams, swindles, rackets, misdirection and fraud abound. This is happening at a time when workplaces are in desperate need of testing, and demands for testing are likely to increase as schools, colleges and universities start opening up in the next month. Here are some examples of what is going on:

  • First and foremost there is your basic fraud. In July, a company called “Fillakit”, which had been awarded a $10.5 million federal contract to make COVID-19 test kits, was shipping unusable, contaminated, soda bottles. This “business”, started by some law and real estate guy called Paul Wexler, who has been repeatedly accused of fraud, went under two months after it launched amidst a slew of investigations and complaints from former workers. Oh, BTW, Michigan ordered 322,000 Fillakit tubes which went straight to the trash (as a result they could not do a week worth of tests).
  • Not all fraud is large scale. Some former VP at now defunct “Cure Cannabis Solutions” ordered 100 COVID-19 test kits that do who-knows-what at a price of 50c a kit. The Feds seized it. These kits, which were not FDA approved, were sourced from “Anhui DeepBlue Medical” in Hefei, China.
  • To be fair, the Cannabis guy was small fry. In Laredo Texas, some guy called Robert Castañeda received assistance from a congressman to purchase $500,000 of kits from the same place! Anhui DeepBlue Medical sent Castańeda 20,000 kits ($25 a test). Apparently the tests had 20% accuracy. To his credit, the Cannabis guy paid 1/50th the price for this junk.
  • Let’s not forget who is really raking in the bucks though. Quest Diagnostics and LabCorp are the primary testing outfits in the US right now; each is processing around 150,000 tests a day. These are for-profit companies and indeed they are making a profit. The economics is simple: insurance companies reimburse LabCorp and Quest Diagnostics for the tests. The rates are basically determined by the amount that Medicare will pay, i.e. the government price point. Intiially, the reimbursement was set at $51, and well… at that price LabCorp and Quest Diagnostics just weren’t that interested. I mean, people have to put food on the table, right? (Adam Schechter, CEO of LabCorp makes $4.9 million a year; Steve Rusckowski, CEO of Quest Diagnostics, makes $9.9 million a year). So the Medicare reimbursement rate was raised to $100. The thing is, LabCorp and Quest Diagnostics get paid regardless of how long it takes to return test results. Some people are currently waiting 15 days to get results (narrator: such tests results are useless).
  • Perhaps a silver lining lies in the stock price of these companies. The title of this post is “$ How to Profit From COVID-19 Testing $”. I guess being able to take a week or two to return a test result and still get paid $100 for something that cost $30 lifts the stock price… and you can profit!Screen Shot 2020-07-31 at 2.03.23 AM
  • Let’s not forget the tech bros! A bunch of dudes in Utah from companies like Nomi, Domo and Qualtrics signed a two-month contract with the state of Utah to provide 3,000 tests a day. One of the tech executives pushing the initiative, called TestUtah, was a 37-year old founder (of Nomi Health) by the name of Mark Newman. He admitted that “none of us knew anything about lab testing at the start of the effort”. Didn’t stop Newman et al. from signing more than $50 million in agreements with several states to provide testing. tl;dr: the tests had a poor limit of detection, samples were mishandled, throughput was much lower than promised etc. etc. and as a result they weren’t finding positive cases at rates similar to other testing facilities. The significance is summarized poignantly in a New Yorker piece about the debacle:

    “I might be sick, but I want to go see my grandma, who’s ninety-five. So I go to a TestUtah site, and I get tested. TestUtah tells me I’m negative. I go see grandma, and she gets sick from me because my result was wrong, because TestUtah ran an unvalidated test.”

    P.S. There has been a disturbing TestUtah hydroxycholorquine story going on behind the scenes. I include this fact because no post on fraud and COVID-19 would be complete without a mention of hydroxycholoroquine.

  • Maybe though, the tech bros will save the day. The recently launched $5 million COVID-19 X-prize is supposed to incentivize the development of “Frequent. Fast. Cheap. Easy.” COVID-19 testing. The goal is nothing less than to “radically change the world.” I’m hopeful, but I just hope they don’t cancel it like they did the genome prize. After all, their goal of “500 tests per week with 12 hour turnaround from sample to result” is likely to be outpaced by innovation just like what happened with genome sequencing. So in terms of making money from COVID-19 testing don’t hold your breath with this prize.
  • As is evident from the examples above, one of the obstacles to quick riches with COVID-19 testing in the USA is the FDA. The thing about COVID-19 testing is that lying to the FDA on applications, or providing unauthorized tests, can lead to unpleasantries, e.g. jail. So some play it straight and narrow. Consider, for example, SeqOnce, which has developed the Azureseq SARS-CoV-2 RT-qPCR kit. These guys have an “EUA-FDA validated test”: Screen Shot 2020-07-31 at 2.14.45 AM
    This is exactly what you want! You can click on “Order Now” and pay $3,000 for a kit that can be used to test several hundred samples (great price!) and the site has all the necessary information: IFUs (these are “instructions for use” that come with FDA authorized tests), validation results etc. If you look carefully you’ll see that administration of the test requires FDA approval. The company is upfront about this. Except the test is not FDA authorized; this is easy to confirm by examining the FDA Coronavirus EUA site. One can infer from a press release that they have submitted an EUA (Emergency Use Authorization) but while they claim it has been validated, nowhere does it say it has been authorized.Clever eh? Authorized, validated, authorized, validated, authorized, .. and here I was just about to spend $3,000 for a bunch of tests that cannot be currently legally administered. Whew!At least this is not fraud. Maybe it’s better called… I don’t know… a game? Other companies are playing similar games. Gingko Bioworks is advertising “testing at scale, supporting schools and businesses” with an “Easy to use FDA-authorized test” but again this seems to be a product that has “launched“, not one that, you know, actually exists; I could find no Gingko Bioworks test that works at scale that is authorized on the FDA Coronavirus EUA website, and it turns out that what they mean by FDA authorized is an RT-PCR test that they have outsourced to others.  Fingers crossed though- maybe the marketing helped CEO Jason Kelly raise the $70 million his company has received for the effort; I certainly hope it works (soon)!
  • By the way, I mentioned that the SeqOnce operation is a bunch of “guys”. I meant this literally; this is their “team”:
    Screen Shot 2020-07-31 at 2.18.45 AM
    Just one sec… what is up with biotech startups and 100% men leadership teams? See Epinomics, Insight Genetics, Ocean Genomics, Kailos Genetics, Circulogene, etc. etc.)… And what’s up with the black and white thing? Is that to try to hide that there are no people of color?
    I mention the 100% male team because if you look at all the people mentioned in this post, all of them are guys (except the person in the next sentence), and I didn’t plan that, it just how it worked out. Look, I’m not casting shade on the former CEO of Theranos. I’m just saying that there is a strong correlation here.

    Sorry, back to the regular programming…

  • Speaking of swindlers and scammers, this post would not be complete without a mention of the COVID-19 testing czar, Jared Kushner. His secret testing plan for the United States went “poof into thin air“! I felt that the 1 million contaminated and unusable Chinese test kits that he ordered for $52 million deserved the final mention in this post. Sadly, these failed kits appear to be the main thrust of the federal response to COVID-19 testing needs so far, and consistent with Trump’s recent call to, “slow the testing down” (he wasn’t kidding). Let’s see what turns up today at the hearings of the U.S. House Select Subcommittee on Coronavirus, whose agenda is “The Urgent Need for a National Plan to Contain the Coronavirus”.



Today, June 10th 2020, black academic scientists are holding a strike in solidarity with Black Lives Matter protests. I strike with them and for them. This is why:

I began to understand the enormity of racism against blacks thirty five years ago when I was 12 years old. A single event, in which I witnessed a black man pleading for his life, opened my eyes. I don’t remember his face but I do remember looking at his dilapidated brown pants and noticing his hands shaking around the outside of his pockets while he plead for mercy:

“Please baas, please baas, … ”

The year was 1985, and I was visiting my friend Tamir Orbach at his house in Pretoria Tshwane, South Africa, located in Muckleneuk hill. We were playing in the courtyard next to Tamir’s garage, which was adjacent to a retaining wall and a wide gate. Google Satellite now enables virtual visits to anywhere in the world, and it took me seconds to find the house. The courtyard and retaining wall look the same. The gate we were playing in front of has changed color from white to black:

Screen Shot 2020-06-09 at 1.36.08 AM

The house was located at the bottom of a short cul de sac on the slope of a hill. It’s difficult to see from the aerial photo, but in the street view, looking down, the steep driveway is visible. The driveway stones are the same as they were the last time I was at the house in the 1980s:

Screen Shot 2020-06-09 at 1.08.38 AM

We heard some commotion at the top of the driveway. I don’t remember what we were doing at that moment, but I do remember seeing a man sprinting down the hill towards us. I remember being afraid of him. I was afraid of black men. A police officer was chasing him, gun in hand, shouting at the top of his lungs. The man ran into the neighboring property, scaled a wall to leap onto a roof, only to realize he may be trapped. He jumped back onto the driveway, dodged the cop, and and ran back up the hill. I remember thinking that I had never seen a man run so fast. The policeman, by now out of breath but still behind the man, chased close behind with his gun swinging around wildly.

There was a second police officer, who was now visible standing at the top of the driveway, feet apart, and pointing a gun down at the man. We were in the line of fire, albeit quite far away behind the gate. The sprint ended abruptly when the man realized he had, in fact, been trapped. Tamir and I had been standing, frozen in place, watching the events unfold in front of us. Meanwhile the screaming had drawn one of our parents out of the house, concerned about the commotion and asking us what was going on. We walked, together, up the driveway to the street.

The man was being arrested next to a yellow police pickup truck, a staple of South African police at the time and an emblem of police brutality. The police pickup trucks had what was essentially a small jail cell mounted on the flat bed, and they were literal pick up trucks; their purpose was to pick up blacks off the streets.


Dogs were barking loudly in the back of the pickup truck and the man was sobbing.

“Please baas, not the dogs. Not the dogs. Please baas. Please baas…”

The police were yelling at the man.

“Your passbook no good!! No pass!! Your passbook!! You’re going in with the dogs and coming with us!”

“Please… please… ” the man begged. I remember him crying. He was terrified of the dogs. They had started barking so loudly and aggressively that the vehicle was shaking. The man kept repeating “Please… not with the dogs… please… they will kill me. Please… help me. Please… the dogs will kill me.”

He was pleading for his life.


The passbook the police were yelling about was a sort of domestic or internal passport all black people over the age of 16 were required to carry at all times in white areas. South Africa, in 1985, was a country that was racially divided. Some cities were for whites only. Some only for blacks. “Coloureds”, who were defined as individuals of mixed ancestry, were restricted to cities of their own. In his book “Born a Crime“, Trevor Noah describes how these anti-miscegenation laws resulted in it being impossible for him to legally live with his mother when he was a child. Note that Mississippi removed anti-miscegenation laws from its state constitution only in 1987 and Alabama in 2000.

The South African passbook requirement stemmed from a law passed in 1952, with origins dating back to British policies from the 18th century. The law had the following stipulation:

No black person could stay in a white urban area for more than 72 hours unless explicit permission was granted by an employer (required to be white).

The passbook contained behavioral evaluations from employers. Permission to enter an area could be revoked by any government employee for any reason.

All the live-in maids (as they were called) in Pretoria had passbooks permitting them to live (usually in an outhouse) on the property of their “employer”. I put “employer” in quotes because at best they would earn $250 a month (in todays $ adjusted for inflation) would sleep in a small shack outside of a large home, and receive a small budget for food which would barely cover millie pap. In many cases they lived in outhouses without running water, were abused, beaten and raped. Live-in-maids spent months at a time apart from their children and families- they couldn’t leave their jobs for fear of being fired and/or losing their pass permission. Their families couldn’t visit them as they did not have permission, by pass laws, to enter the white areas in which the live-in-maids worked.

Most males had passbooks allowing them only day trips into the city from the black townships in which they lived. Many lived in Mamelodi, a township 15 miles east of Tswhane, and would travel hours to and from work because they were not allowed on white public transport. I lived in Pretoria for 13 years and I never saw Mamelodi.

I may have heard about passbooks before the incident at Tamir’s house, but I didn’t know what they were or how they worked. Learning about pass laws was not part of our social studies or history curriculum. At my high school, Pretoria Boys High School, a Milner school which counts among its alumni individuals such as dilettante Elon Musk and murderer Oscar Pistorius, we learned about the history of South Africa’s white architects, people like Cecil Rhodes (may his name and his memory be erased). There was one black boy in the school when I was there (out of about 1,200 students). He was allowed to attend because he was the son of an ambassador, as if somehow that mitigated his blackness.

South Africa started abandoning its pass laws in 1986, just a few months after the incident I described above. Helen Suzman described it at the time as possibly one of the most eminent government reforms ever enacted. Still, although this was a small step towards dismantling apartheid, Nelson Mandela was still in jail, in Pollsmoor Prison at that time, and he remained imprisoned for 3 more years until he was released from captivity after 27 years in 1990.



We did not stand by idly while the man was being arrested. We asked the police to let him go, or at least to not throw him in with the dogs, but the cops ignored us and dragged the man towards the back of the van. The phrase “kicking and screaming” is bantered about a lot; there is even a sports comedy with that title. That day I saw a man literally kicking and screaming for his life. The back doors of the van were opened and the dogs, tugging against their leash, appeared to be ready to devour him whole. He was tossed inside like a piece of meat.

The ferocity of the police dogs I saw that day was not a coincidence or accident, it was by design. South Africa, at one time, developed a breeding program at Roodeplaat Breeding Enterprises led by German geneticist Peter Geertshen to create a wolf-dog hybrid. Dogs were bred for their aggression and strength. The South African Boerboel is today one of the most powerful dog breeds in the world, and regularly kills in the United States, where it is imported from South Africa.


After encounters with numerous Boerboels, Dobermans, Rottweilers and Pitbull dogs as a child in South Africa I am scared of dogs to this day. I know it’s not rational, and some of my best friends and family have dogs that I adore and love, but the fear lingers. Sometimes I come across a K-9 unit and the terror surfaces. Police dogs are potent police weapons here, today, just as they were in South Africa in the 1980s. There is a long history of this here. Dogs were used to terrorize blacks in the Civil Rights era, and the recent invocation of “vicious dogs” by the president of the United States conjures up centuries of racial terror:

I learned at age 12 that LAW & ORDER isn’t all it’s hyped up to be.


I immigrated to America in August 1988, and imagined that here I would find a land free of the suffocating racism of South Africa. In my South African high school racism was open, accepted and embraced. Nigg*r balls were sold in the campus cafeteria (black licorice balls), and students would tell idiotic “jokes”  in which dead blacks were frequently the punchline. Some of the teachers were radically racist. My German teacher, Frau Webber, once told me and Tamir that she would swallow her pride and agree to teach us despite the fact that we were Jews. But much more pernicious was the systemic, underlying, racism. When I grew up the idea that someday I would go to university and study alongside a black person just seemed preposterous. My friends and I would talk about girls. The idea that any of us would ever date, let alone marry an African girl, was just completely and totally out of the realm of possibility. While my school, teachers and friends were what one would consider “liberal” in South Africa, e.g. many supported the ANC, their support of blacks was largely restricted to the right to vote.

Sadly, America was not the utopia I imagined. In 1989, a year after I immigrated here, Yusef Hawkins was murdered in a hate crime by white youths who thought he was dating a white woman. That was also the year of the “Central Park Five“, in which Trump played a central, disgraceful and racist role. I finished high school in Palo Alto, across a highway from East Palo Alto, and the difference between the cities seemed almost as stark as between the white and black neighborhoods in South Africa. I learned later that this was the result of redlining. My classmates and teachers in Palo Alto were obsessed, in 1989, with the injustices in South Africa. but never once discussed East Palo Alto with me or with each other. I was practicing for the SAT exams at the time and remember thinking Palo Alto : East Palo Alto = Pretoria : Mamelodi.

Three years after that, when I was an undergraduate student studying at Caltech in Los Angeles, the Rodney King beating happened. I saw a black man severely beaten on television in what looked like a clip borrowed from South Africa. My classmates at the time thought it would be exciting to drive to South Central Los Angeles to see the “rioters” up close. They had never visited those areas before,  nor did they return afterwards. I was reminded at the time of the poverty tourism my friends in South Africa would partake in: a tour to Soweto accompanied by guides with guns to see for oneself how blacks lived. Then right back home for a braai (BBQ). My classmates came back from their Rodney King tour excitedly telling stories of violence and dystopia. Then they partied into the night.

I thought about my only classmate, one out of 200, who was actually from South Los Angeles and about the dissonance that was his life and my classmates’ partying.

Now I am a professor, and I am frequently present in discussions on issues such as undergraduate and graduate admissions, and hiring. Faculty talk a lot, sometimes seemingly endlessly, about diversity, representation, gender balance, and so forth and so on. But I’ve been in academia for 20+ years and it was only three years ago, after moving to Caltech, that I attended a faculty meeting with a black person for the first time. Sometimes I look around during faculty meetings and wonder if I am in America or South Africa? How can I tell?


Today is an opportunity for academics to reflect on the murder of George Floyd, and to ask difficult questions of themselves. It’s not for me to say what all the questions are or ought to be. I will say this: at a time when everything is unprecedented (Trump’s tweets, the climate, the stock market, the pandemic, etc. etc.) the murder of George Floyd was completely precedented. His words. The mode of murder. The aftermath. It has happened many times before, including recently. And so it is in academia. The fundamental racism, the idea that black students, staff, and faculty, are not truly as capable as whites, it’s simply a day-to-day reality in academia, despite all the talk and rhetoric to the contrary. Did any academics, upon hearing of the murder of George Floyd, worry immediately that it was one of their colleagues, George Floyd, Ph.D., working at the University of Minnesota who was killed?

I will take the time today to read. I will pick up Long Walk to Freedom, and I will also read #BlackintheIvory. I may read some Alan Paton. I will pause to think about how my university can work to improve the recruitment, mentoring, and experience of black students, staff and faculty. Just some ideas.

All these years since leaving South Africa I’ve had a recurring dream. I fly around Pretoria. The sun has just set and the Union Buildings are lit up, glowing a beautiful orange in the distance. The city is empty. My friends are not there. The man I saw pleading for his life in 1985 is gone. I wonder what the police did to him when he arrived at the police station. I wonder whether he died there, like many blacks at the time did. I fly nervously, trying to remember whether I have my passbook on me. I remember I’m classified white and I don’t need a passbook. I hear dogs barking and wonder where they are, because the city is empty. I wonder what it will feel like when they eat me, and then I remember I’m white and I’m not their target. I hope that I don’t encounter them anyway, and I realize what a privilege it is to be able to fly where they can’t reach me. Then I notice that I’m slowly falling, and barely clearing the slopes of Muckleneuk hill. I realize I will land and am happy about that. I slowly halt my run as my feet gently touch the ground.




The widespread establishment of statistics departments in the United States during the mid-20th century can be traced to a presentation by Harold Hotelling in the Berkeley Symposium on Mathematical Statistics and Probability in 1945. The symposium, organized by Berkeley statistician Jerzy Neyman, was the first of six such symposia that took place every five years, and became the most influential meetings in statistics of their time. Hotelling’s lecture on “The place of statistics in the university” inspired the creation of several statistics departments, and at UC Berkeley, Neyman’s establishment of the statistics department in the 1950s was a landmark moment for statistics in the 20th century.

Neyman was hired in the mathematics department at UC Berkeley by a visionary chair, Griffith Evans, who transformed the UC Berkeley math department into a world-class institution after his hiring in 1934. Evans’ vision for the Berkeley math department included statistics, and Eric Lehmann‘s history of the UC Berkeley statistics department details how Evans’ commitment to diverse areas in the department led him to hire Neyman without even meeting him. However, Evans’ progressive vision for mathematics was not shared by all of his colleagues, and the conservative, parochial attitudes of the math department contributed to Neyman’s breakaway and eventual founding of the statistics department. This dynamic was later repeated at universities across the United States, resulting in a large gulf between mathematicians and statistics (ironically history may be repeating itself with some now suggesting that the emergence of “data science” is a result of conservatism among statisticians leading them to cling to theory rather than to care about data).

The divide between mathematicians and statistics is unfortunate for a number of reasons, one of them being that statistical literacy is important even for the purest of the pure mathematicians. A recent debate on the appropriateness of diversity statements for job applicants in mathematics highlights the need: analysis of data, specifically data on who is in the maths community, and their opinions on the issue, turns out to be central to understanding the matter at hand. Case in point is a recent preprint by two mathematicians:

Joshua Paik and Igor Rivin, Data Analysis of the Responses to Professor Abigail Thompson’s Statement on Mandatory Diversity Statements, arXiv, 2020.

This statistics preprint examines attempts to identify the defining attributes of mathematicians who signed recent letters related to diversity statement requirements in mathematics job searches. I was recently asked to provide feedback on the manuscript, ergo this blog post.


In order to assess the results of any preprint or paper, it is essential, as a first step, to be able to reproduce the analysis and results. In the case of a preprint such as this one, this means having access to the code and data used to produce the figures and to perform the calculations. I applaud the authors for being fully transparent and making available all of their code and data in a Github repository in a form that made it easy to reproduce all of their results; indeed I was able to do so without any problems. 👏

The dataset

The preprint analyzes data on signatories of three letters submitted in response to an opinion piece on diversity statement requirements for job applicants published by Abigail Thompson, chair of the mathematics department at UC Davis. Thompson’s letter compared diversity statement requirements of job applicants to loyalty oaths required during McCarthyism. The response letters range from strong affirmation of Thompson’s opinions, to strong refutation of them. Signatories of “Letter A”, titled “The math community values a commitment to diversity“, “strongly disagreed with the sentiments and arguments of Dr. Thompson’s editorial” and are critical of the AMS for publishing her editorial.” Signatories of “Letter B”, titled “Letter to the editor“, worry about “direct attempt[s] to destroy Thompson’s career and attempt[s] to intimidate the AMS”. Signatories of “Letter C”,  titled “Letter to the Notices of the AMS“, write that they “applaud Abigail Thompson for her courageous leadership [in publishing her editorial]” and “agree wholeheartedly with her sentiments.”

The dataset analyzed by Paik and Rivin combines information scraped from Google Scholar and MathSciNet with data associated to the signatories that was collated by Chad Topaz. The dataset is available in .csv format here.

The Paik and Rivin result

The main result of Paik and Rivin is summarized in the first paragraph of their Conclusion and Discussion section:

“We see the following patterns amongst the “established” mathematicians who signed the three letters: the citations numbers distribution of the signers of Letter A is similar to that of a mid-level mathematics department (such as, say, Temple University), the citations metrics of Letter B are closer to that of a top 20 department such as Rutgers University, while the citations metrics of the signers of Letter C are another tier higher, and are more akin to the distribution of metrics for a truly top department.”

A figure from their preprint summarizing the data supposedly supporting their result, is reproduced below (with the dotted blue line shifted slightly to the right after the bug fix):

Screen Shot 2020-01-17 at 2.41.14 PM

Paik and Rivin go a step further, using citation counts and h-indices as proxies for “merit in the judgement of the community.” That is to say, Paik and Rivin claim that mathematicians who signed letter A, i.e. those who strongly disagreed with Thompson’s equivalence between diversity statements and McCarthy’s loyalty oaths, have less “merit in the judgement of the community” than mathematicians who signed letter C, i.e. those who agreed wholeheartedly with her sentiments.

The differential is indeed very large. Paik and Rivin find that the mean number of citations for signers of Letter A is 2397.75, the mean number of citations for signers of Letter B is 4434.89, and the mean number of citations for signers of Letter C is 6226.816. To control for an association between seniority and number of citations, the computed averages are based only on citation counts of full professors. [Note: a bug in the Paik-Rivin code results in an error in their reporting for the mean for group B. They report 4136.432 whereas the number is actually 4434.89.]

This data seems to support Paik and Rivin’s thesis that mathematicians who support the use of diversty statements in hiring and who strongly disagree with Thompson’s analogy of such statements to McCarthy’s loyalty oaths, are second rate mathematicians, whereas those who agree wholeheartedly with Thompson are on par with professors at “truly top departments”.

But do the data really support this conclusion?

A fool’s errand

Before delving into the details of the data Paik and Rivin analyzed, it is worthwhile to pause and consider the validity of using citations counts and h-indices as proxies for “merit in the judgement of the community”. The authors themselves note that “citations and h-indices do not impose a total order on the quality of a mathematician” and emphasize that “it is quite obvious that, unlike in competitive swimming, imposing such an order is a fool’s errand.” Yet they proceed to discount their own advice, and wholeheartedly embark on the fool’s errand they warn against. 🤔

I examined the mathematicians in their dataset and first, as a sanity check, confirmed that I am one of them (I signed one of the letters). I then looked at the associated citation counts and noticed that out of 1435 mathematicians who signed the letters, I had the second highest number of citations according to Google Scholar (67,694), second only to Terence Tao (71,530). We are in the 99.9th percentile. 👏 Moreover, I have 27 times more citations than Igor Rivin. According to Paik and Rivin this implies that I have 27 times more merit in the judgement of our peers. I should say at least 27 times, because one might imagine that the judgement of the community is non-linear in the number of citations. Even if one discounts such quantitative comparisons (Paik and Rivin do note that Stephen Smale has fewer citations than Terence Tao, and that it would be difficult on that basis alone to conclude that Tao is the better mathematician), the preprint makes use of citation counts to assess “merit in the judgement of the community”, and thus according to Paik and Rivin my opinions have substantial merit. In fact, according to them, my opinion on diversity statements must be an extremely meritorious one. I imagine they would posit that my opinion on the debate that is raging in the math community regarding diversity statement requirements from job applicants is the correct, and definitive one. Now I can already foresee protestations that, for example, my article on “Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation” which has 9,438 citations is not math per se, and that it shouldn’t count. I’ll note that my biology colleagues, after reading the supplement, think it’s a math paper, but in any case, if we are going to head down that route shouldn’t Paik and Rivin read the paper to be sure? And shouldn’t they read every paper of mine, and every paper of every signatory to determine it is valid for their analysis? And shouldn’t they adjust the citation counts of every signatory? Well they didn’t do any of that, and besides, they included me in their analysis so… I proceed…

The citation numbers above are based on Google Scholar citations. Paik and Rivin also analyze MathSciNet citations and state that they prefer them because “only published mathematics are in MathSciNet, and is hence a higher quality data source when comparing mathematicians.” I checked the relationship between Scholar and MathSciNet citations and found that, not surprisingly, they have a correlation of 0.92:

Screen Shot 2020-01-17 at 8.48.17 AM

I’d say they are therefore interchangeable in terms of the authors’ use of them as a proxy for “merit”.

But citations are not proxies for merit. The entire premise of the preprint is ridiculous. Furthermore, even if it was true that citations were a meaningful attribute of the signatories to analyze, there are many other serious problems with the preprint.

The elephant not in the room

Paik and Rivin begin their preprint with a cursory examination of the data and immediately identify a potential problem… missing data. How much data is missing? 64.11% of individuals do not have associated Google Scholar citation data, and 78.82% don’t have MathSciNet citation data. Paik and Rivin brush this issue aside remarking that “while this is not optimal, a quick sample size calculation shows that one needs 303 samples or 21% of the data to produce statistics at a 95% confidence level and a 5% confidence interval.” They are apparently unaware of the need for uniform population sampling, and don’t appear to even think for a second of the possible ascertainment biases in their data. I thought for a second.

For instance, I wondered whether there might be a discrepancy between the number of citations of women with Google Scholar pages vs. women without such pages. This is because I’ve noticed anecdotally that several senior women mathematicians I know don’t have Google Scholar pages, and since senior scientists presumably have more citations this could create a problematic ascertainment bias. I checked and there is, as expected, some correlation between age post-Ph.D. and citation count (cor = 0.36):

Screen Shot 2020-01-17 at 8.19.05 AM

To test whether there is an association between presence of a Google Scholar page and citation number I examined the average number of MathSciNet citations of women with and without Google Scholar pages. Indeed, the average number of citations of women without Google Scholar pages is much lower than those with a Google Scholar page (898 vs. 621). For men the difference is much smaller (1816 vs. 1801). By the way, the difference in citation number between men and women is itself large, and can be explained by a number of differences starting with the fact that the women represented in the database have much lower age post-Ph.D. than the men (17.6 vs. 26.3), and therefore fewer citations (see correlation between age and citations above).

The analysis above suggests that perhaps one should use MathSciNet citation counts instead of Google Scholar. However the extent of missing data for that attribute is highly problematic (78.82% missing values). For one thing, my own MathSciNet citation counts are missing, so there were probably bugs in the scraping. The numbers are also tiny. There are only 46 women with MathSciNet data among all three letter signatories out of 452 women signatories. I believe the data is unreliable. In fact, even my ascertainment bias analysis above is problematic due to the small number of individuals involved. It would be completely appropriate at this point to accept that the data is not of sufficient quality for even rudimentary analysis. Yet the authors continued.

A big word 

Confounder is a big word for a variable that influences both the dependent and independent variable in an analysis, thus causing a spurious association. The word does not appear in Paik and Rivin’s manuscript, which is unfortunate because it is in fact a confounder that explains their main “result”.  This confounder is age. I’ve already shown the strong relationship between age post-Ph.D. and citation count in a figure above. Paik and Rivin examine the age distribution of the different letter signatories and find distinct differences. The figure below is reproduced from their preprint:

Screen Shot 2020-01-17 at 2.15.55 PM

The differences are stark: the mean time since PhD completion of signers of Letter A is 14.64 years, the mean time since PhD completion of signers of Letter B is 27.76 years and the the mean time since PhD completion of signers of Letter C is 35.48 years. Presumably to control for this association, Paik and Rivin restricted the citation count computations to full professors. As it turns out, this restriction alone does not control for age.

The figure below shows the number of citations of letter C signatories who are full professors as a function of their age:

Screen Shot 2020-01-17 at 2.56.54 PM

The red line at 36 years post-Ph.D. divides two distinct regimes. The large jump at that time (corresponding to chronological age ~60) is not surprising: senior professors in mathematics are more famous and have more influence than their junior peers, and their work has had more time to be understood and appreciated. In mathematics results can take many years before they are understood and integrated into mainstream mathematics. These are just hypotheses, but the exact reason for this effect is not important for the Paik-Rivin analysis. What matters is that there are almost no full professors among Letter A signers who are more than 36 years post-Ph.D. In fact, the number of such individuals (filtered for those who have published at least 1 article), is 2. Two individuals. That’s it.

Restricting the analysis to full professors less than 36 years post-Ph.D. tells a completely different story to the one Paik and Rivin peddle. The average number of citations of full professors who signed letter A (2922.72) is higher than the average number of citations of full professors who signed letter C (2348.85). Signers of letter B have 3148.83 citations on average. The figure for this analysis is shown below:

Screen Shot 2020-01-17 at 2.42.48 PM

The main conclusion of Paik and Rivin, that signers of letters A have less merit than signers of letter B, who in turn have less merit than signers of letter C can be seen to be complete rubbish. What the data reveal is simply that the signers of letter A are younger than the signers of the other two letters.

Note: I have performed my analysis in a Google Colab notebook accessible via the link. It allows for full reproducibility of the figures and numbers in this post, and facilitates easy exploration of the data. Of course there’s nothing to explore. Use of citations as a proxy for merit is a fool’s errand.


There are numerous other technical problems with the preprint. The authors claim to have performed “a control” (they didn’t). Several p-values are computed and reported without any multiple testing correction. Parametric approximations for the citation data are examined, but then ignored. Moreover, appropriate zero-inflated count distributions for such data are never considered (see e.g. Yong-Gil et al. 2007).  The results presented are all univariate (e.g. histograms of one data type)- there is not a single scatterplot in the preprint! This suggests that the authors are unaware of the field of multivariate statistics. Considering all of this, I encourage the authors to enroll in an introductory statistics class.

The Russians

In a strange final paragraph of the Conclusion and Discussion section of their preprint, Paik and Rivin speculate on why mathematicians from communist countries are not represented among the signers of letter A. They present hypotheses without any data to back up their claims.

The insistence that some mathematicians, e.g. Mikhail Gromov who signed letters B and C and is a full member at IHES and professor at NYU, are not part of the “power elite” of mathematics is just ridiculous. Furthermore, characterizing someone like Gromov, who arrived in the US from Russia to an arranged job at SUNY Stonybrook (thanks to Tony Phillips) as being a member of a group who “arrived at the US with nothing but the shirts on their backs” is bizarre. 

Diversity matters

I find the current debate in the mathematics community surrounding Prof. Thompson’s letter very frustrating. The comparison of diversity statements to McCarthy’s loyalty oaths is ridiculous. Instead of debating such nonsense, mathematicians need to think long and hard about how to change the culture in their departments, a culture that has led to appallingly few under-represented minorities and women in the field. Under-represented minorities and women routinely face discrimination and worse. This is completely unacceptable.

The preprint by Paik and Rivin is a cynical attempt to use the Thompson kerfuffle to advertise the tired trope of the second-rate mathematician being the one to advocate for greater diversity in mathematics. It’s a sad refrain that has a long history in mathematics. But perhaps it’s not surprising. The colleagues of Jerzy Neyman in his mathematics department could not even stomach a statistician, let alone a woman, let alone a person from an under-represented minority group. However I’m optimistic reading the list of signatories of letter A. Many of my mathematical heroes are among them. The future is theirs, and they are right.

Algorithmic bias is a term used to describe situations where an algorithm systematically produces outcomes that are less favorable to individuals within a particular group, despite there being no relevant properties of individuals in that group that should lead to distinct outcomes from other groups . As “big data” approaches become increasingly popular for optimizing complex tasks, numerous examples of algorithmic bias have been identified, and sometimes the implications can be serious. As a result, algorithmic bias has become a matter of concern, and there are ongoing efforts to develop methods for detecting it and mitigating its effects. However, despite increasing recognition of the problems due to algorithmic bias, sometimes bias is embraced by the individuals it benefits. For example, in her book Weapons of Math Destruction, Cathy O’Neil discusses the gaming of algorithmic ranking of universities via exploitation of algorithmic bias in ranking algorithms. While there is almost universal agreement that algorithmic rankings of universities are problematic, many faculty at universities that do achieve a top ranking choose to ignore the problems so that they can boast of  their achievement.

Of the algorithms that are embraced in academia, Google Scholar is certainly among the most popular. It’s used several times a day by every researcher I know to find articles via keyword searches, and, Google Scholar pages has made it straightforward for researchers to create easily updatable publication lists. These now serve as proxies for formal CV publication lists, with the added benefit (?) that citation metrics such as the h-index are displayed as well (Jacsó, 2012). Provided as an option along with publication lists, the Google Scholar coauthor list of a user can be displayed on the page. Google offers users who have created a Google Scholar page the ability to view suggested coauthors, and authors can then select to add or delete those suggestions. Authors can also add as coauthors individuals not suggested by Google. The Google Scholar co-author rankings and the suggestion lists, are generated automatically by an algorithm that has not, to my knowledge, been disclosed.

Google Scholar coauthor lists are useful. I occasionally click on the coauthor lists to find related work, or to explore the collaboration network of a field that may be tangentially related to mine but that I’m not very familiar with. At some point I started noticing that the lists were almost entirely male. Frequently, they were entirely male. I decided to perform a simple exercise to understand the severity of what appeared to me to be a glaring problem:

Let the Google Scholar coauthor graph be a directed graph GS = (V,E) whose vertices correspond to authors in Google Scholar, and with an edge (v_1,v_2) \in E  from v_1 \in V to v_2 \in V if author v_2 is listed as a coauthor on the main page of author v_1. We define an author to be manlocked (terminology thanks to Páll Melsted) if its out-degree is at least 1, and if every vertex that it is adjacent to (i.e., for which (v,w) is an edge) and that is ranked among the top twenty coauthors by Google Scholar (i.e., w appears on the front page of v), is a male.

For example, the Google Scholar page of Steven Salzberg is not manlocked: of the 20 coauthors listed on the Scholar page, only 18 are men. However several of the vertices it is adjacent to, for example the one corresponding Google Scholar page of Ben Langmead, are manlocked. There are so many manlocked vertices that it is not difficult, starting at a manlocked vertex, to embark on a long manlocked walk in the GS graph, hopping from one manlocked vertex to another. For example, starting with the manlocked Dean of the College of Computer, Mathematical and Natural Sciences at the University of Maryland, we find a manlocked walk of length 14 (I leave it as an exercise for the reader to find the longest walk that this walk is contained in):

Amitabh VarshneyJihad El SanaPeter LindstromMark DuchaineauAlexander HartmaierAnxin MaRoger ReedDavid DyePeter D LeeOluwadamilola O. TaiwoPaul ShearingDonal P. FineganThomas J. Mason → Tobias Neville

A country is doubly landlocked when it is surrounded only by landlocked countries. There are only two such countries in the world: Uzbekistan and Lichtenstein. Motivated by this observation, we define a vertex in the Google Scholar coauthor graph to be doubly manlocked if it is adjacent only to manlocked vertices.

Open problem: determine the number of  doubly manlocked individuals in the Google Scholar coauthor graph.


Why are there so many manlocked vertices in the Google Scholar coauthorship graph? Some hypotheses:

  1. Publications by women are cited less than those of men (Aksnes et al. 2011).
  2. Men tend to publish more with other men and there are many more men publishing than women (see, e.g. Salerno et al. 2019, Wang et al. 2019).
  3. Men who are “equally contributing” co-first authors are more “equal” than their women co-first authors (Broderick and Casadevall 2019). Google Scholar’s coauthor recommendations may give preference to first co-first authors.
  4. I am not privy to Google’s algorithms, but Google Scholar’s coauthor recommendations may also be biased towards coauthors on highly cited papers. Such papers will be older papers. While today the gender ratio today is heavily skewed towards men, it was even more so in the past. For example, Steven Salzberg, who is a senior scientists mentioned above and lists 18 men coauthors out of twenty on his Google Scholar page, has graduated 12 successful Ph.D. students in the past, 11 of whom are men. In other words, the extent of manlocked vertices may be the result of algorithmic bias that is inadvertently highlighting the gender homogeneity of the past.
  5. Many successful and prolific women may not be using Google Scholar (I can think of many in my own field, but was not able to find a study confirming this empirical observation). If this is true, the absence of women on Google Scholar would directly inflate the number of manlocked vertices. Moreover, in surveying many Google Scholar pages, I found that women with Google Scholar pages tend to have more women as coauthors than the men do.
  6. Even though Google Scholar allows for manually adding coauthors, it seems most users are blindly accepting the recommendations without thinking carefully about what coauthorship representation best reflects their actual professional relationships and impactful work. Thus, individuals may be supporting the algorithmic bias of Google Scholar by depending on its automation. Google may be observing that users tend to click on coauthors that are men at a high rate (since those are the ones being displayed) thus reinforcing for itself with data the choices of the coauthorship algorithm.

The last point above (#4) raises an interesting general issue with Google Scholar. While Google Scholar appears to be fully automated, and indeed, in addition to suggesting coauthors automatically the service will also automatically add publications, the Google Scholar page of an individual is completely customizable. In addition to the coauthors being customizable, the papers that appear on a page can be manually added or deleted, and in fact even the authors or titles of individual papers can be changed. In other words, Google Scholar can be easily manipulated with authors using “algorithmic bias” as a cover (“oops, I’m so sorry, the site just added my paper accidentally”). Are scientists actually doing this? You bet they are (I leave it as an exercise for the reader to find examples).

Yesterday I found out via a comment on this blog that Yuval Peres, a person who has been accused by numerous students, trainees, and colleagues of sexual harassment, will be delivering a lecture today in the UC Davis Mathematical Physics and Probability Seminar.

The facts

I am aware of at least 11 allegations by women of sexual harassment by Yuval Peres (trigger warning: descriptions of sexual harassment and sexual assault):

  1. Allegation of sexual harassment of a Ph.D. student in 2007. Sourcedescription of the harassment by the victim.
  2. Allegation of sexual harassment by a colleague that happened when she was younger. Source: description of the harassment by the victim.
  3. Allegation of sexual harassment of a woman prior to 2007. Source: report on sexual harassment allegations against Yuval Peres by the University of Washington (received via a Freedom of Information Act Request).
  4. Allegation of sexual harassment by one of Yuval Peres’ Ph.D. students several years ago. Source: report on sexual harassment allegations against Yuval Peres by the University of Washington (received via a Freedom of Information Act Request).
  5. Allegation of sexual harassment of a colleague. Source: personal communication to me by the victim (who wishes to remain anonymous) via email after I wrote a post about Yuval Peres.
  6. Allegation of sexual harassment of a graduate student. Source: personal communication to me by the victim (the former graduate student who wishes to remain anonymous) via email after I wrote a post about Yuval Peres.
  7. Recent allegations of sexual harassment by 5 junior female scientists who reported unwanted advances by Yuval Peres to persons that leading figures in the CS community describe as “people we trust without a shred of doubt”. Source: a letter circulated by Irit Dinur, Ehud Friedgut and Oded Goldreich.

The details offered by these women of the sexual harassment they experienced are horrific and corroborate each other. His former Ph.D. student (#4 above) describes, in a harrowing letter included in the University of Washington Freedom of Information Act (FOIA) disclosed report, sexual harassment she experienced over the course of two years, and many of the details are similar to what is described by another victim here. The letter describes sexual harassment that had its origins when the student was an undergraduate (adding insult to injury the University of Washington did not redact her name with the FOIA disclosed report). I had extreme difficulty reading some of the descriptions, and believe the identity of the victim should be kept private despite the University of Washington FOIA report, but am including one excerpt here so that it’s clear what exactly these allegations entail (the letter is 4.5 pages long):  

Trigger warning: description of sexual harassment and sexual assault

“While walking down a street he took my hand, I took it away with pressure but he grabbed it by force. I was pretty afraid of getting in a fight with my PhD advisor. He stroked my hand with his fingers. I said stop, but he ignored it. I started talking about math intending to make the situation less intimate. But he used me being distracted and put his arms around my waist touching my bud. I was in shock. We came by a bench. He asked me to sit down. I removed his hands and sat down far from him. He came closer and told me that I had a body like a barbie doll. I changed topic again to math, but he took my hand and kissed the back of my hand. I freed my hand with a sudden move, and saw him leaning towards me touching my hair and trying to kiss me. I felt danger and wanted to go home. Yuval was again holding my hand, but this time there was no resistance from me. I thought if I let him hold my hand it is less likely that he harms me. Arriving at my home he tried to give me a kiss. I was relieved when he drove away.”

The victim sent this letter to the chairs of the mathematics and computer science departments at the University of Washington and made a request:

“I am not the only female who was sexually harassed by Yuval Peres and I am convinced that I was not the last one. Therefore, I hope with this report that you take actions to prevent incidents like this from happening again.”

Instead of passing on the complaint to Title IX, and contrary to claims by some of Yuval Peres’ colleagues that appear in the University of Washington FOIA disclosure report that the case was investigated, the chairs of the University of Washington math and computer science departments (in a jointly signed letter) offered Yuval Peres a path to avoiding investigation:

“As you know from our e-mail to you [last week], your resignation as well as an agreement not to seek or accept another position at the University will eliminate the need for the University to investigate the allegations against you.”

Indeed, Yuval Peres resigned within two months of the complaint with no investigation ever taking place. This is the email the victim received afterwards from the chair of the mathematics department, in response to her request that “I hope with this report that you take actions to prevent incidents like this from happening again”:

“I believe this resolution [Yuval Peres’ resignation] has promptly and effectively addressed your concerns.”

At least 8 women have since claimed that they were sexually harassed.

Seminar and a dinner

As is customary with invited speakers, the organizer of the seminar today wrote to colleagues and student in the math department at UC Davis on Monday letting people know that “there will be a dinner afterward, so please let me know if you are interested in attending.”

Here is a description of a dinner Yuval Peres took his Ph.D. student to, and a summary of the events that led to him and his Ph.D. student walking down the street when he forcibly grabbed her hand:

Trigger warning: description of sexual harassment and sexual assault

“I tried to keep the dinner short, but suddenly he seemed to have a lot of time. He paid in cash in contrast to dinners with other students, and offered to take me home. In his car half way to my place he said he would only take me home if I show him my room (I was living in a shared apartment with other people). I thought it was a joke and said no. He laughed and grabbed my hand. Arriving at home I said goodbye. But when I got out of the car he said that I promised to show him my room. I said that I did not. However, he followed me to the backdoor of the house. Fortunately some of my roommates were at home. It bothered Yuval that we were not alone at my home, so he said we should take a walk outside. I felt uncomfortable but I still needed to talk about my PhD thesis work. While walking down the street he took my hand, I took it away with pressure but he grabbed it with force…”

I wonder how many graduate students at UC Davis will feel comfortable signing up for dinner with Yuval Peres tonight, or even be able to handle attending his seminar after reading of all the sexual harassment allegations against him?

The challenge is particularly acute for women. I know this from comments in the reports of sexual harassment that I’ve read, from the University of Washington FOIA disclosed report, and from personal communication with multiple women who have worked with him or had to deal with him. Isn’t holding seminars (which are an educational program) that women are afraid to attend, and are therefore de facto excluded from and being denied benefit of, in a department that depends heavily on federal funding, a Title IX violation? Title IX federal law states that

“No person in the United States shall, on the basis of sex, be excluded from participation in, be denied the benefits of, or be subjected to discrimination under any education program or activity receiving Federal financial assistance.”

An opinion

It’s outrageous that UC Davis’ math department is hosting Yuval Peres for a seminar and dinner today.

[Update November 10th, 2019: after reading this post a former Ph.D. student at UC Berkeley wrote that “Another PhD student in Berkeley probability and I both experienced this as well. About time this is called out so no more new students are harassed.“]

The arXiv preprint server has its roots in an “e-print” email list curated by astrophysicist Joanne Cohn, who in 1989 had the idea of organizing the sharing of preprints among her physics colleagues. In her recollections of the dawn of the arXiv,  she mentions that “at one point two people put out papers on the list on the same topic within a few days of each other” and that her “impression was that because of the worldwide reach of [her] distribution list, people realized it could be a way to establish precedence for research.” In biology, where many areas are crowded and competitive, the ability to time stamp research before a possibly lengthy journal review and publication process is almost certainly one of the driving forces behind the rapid growth of the bioRxiv (ASAPbio 2016 and Vale & Hyman, 2016).

However the ability to establish priority with preprints is not, in my opinion, what makes them important for science. Rather, the value of preprints is in their ability to accelerate research via the rapid dissemination of methods and discoveries. This point was eloquently made by Stephen Quake, co-president of the Chan Zuckerberg Biohub, at a Caltech Kavli Nanonscience Institute Distinguished Seminar Series talk earlier this year. He demonstrated the impact of preprints and of sharing data prior to journal publication by way of example, noting that posting of the CZ Biohub “Tabula Muris” preprint along with the data directly accelerated two different unrelated projects: Cusanovich et al. 2018 and La Manno et al. 2018. In fact, in the case of La Manno et al. 2018, Quake revealed that one of the corresponding authors of the paper, Sten Linnarsson, had told him that “[he] couldn’t get the paper past the referees without using all of the [Tabula Muris] data”:

Moreover, Quake made clear that the open science principles practiced with the Tabula Muris preprint were not just a one-off experiment, but fundamental Chan Zuckerberg Initiative (CZI) values that are required for all CZI internal research and for publications arising from work the CZI supports: “[the CZI has] taken a pretty aggressive policy about publication… people have to agree to use biorXiv or a preprint server to share results… and the hope is that this is going to accelerate science because you’ll learn about things sooner and be able to work on them”:

Indeed, on its website the CZI lists four values that guide its mission and one of them is “Open Science”:

Open Science
The velocity of science and pace of discovery increase as scientists build on each others’ discoveries. Sharing results, open-source software, experimental methods, and biological resources as early as possible will accelerate progress in every area.

This is a strong and direct rebuttal to Dan Longo and Jeffrey Drazen’s “research parasite” fear mongering in The New England Journal of Medicine.



I was therefore disappointed with the CZI after failing, for the past two months, to obtain the code and data for the preprint “A molecular cell atlas of the human lung from single cell RNA sequencing” by Travaglini, Nabhan et al. (the preprint was posted on the bioRxiv on August 27th 2019). The interesting preprint describes an atlas of 58 cell populations in the human lung, which include 41 of 45 previously characterized cell types or subtypes and the discovery of 14 new ones. Of particular interest to me, in light of some ongoing projects in my lab, is a comparative analysis examining cell type concordance between human and mouse. Travaglini, Nabhan et al. note that 17 molecular types have been gained or lost since the divergence of human and mouse. The results are based on large-scale single-cell RNA-seq (using two technologies) of ~70,000 human and lung peripheral blood cells.

The comparative analysis is detailed in Extended Data Figure S5 (reproduced below), which shows scatter plots of (log) gene counts for homologous human and mouse cell types. For each pair of cell types, a sub-figures also shows the correlation between gene expression and divergent genes are highlighted:


I wanted to understand the details behind this figure: how exactly were cell types defined and homologous cell types identified? What was the precise thresholding for “divergent” genes? How were the ln(CPM+1) expression units computed? Some aspects of these questions have answers in the Methods section of the preprint, but I wanted to know exactly; I needed to see the code. For example, the manuscript describes the cluster selection procedure as follows: “Clusters of similar cells were detected using the Louvain method for community detection including only biologically meaningful principle [sic] components (see below)” and looking “below” for the definition of “biologically meaningful” I only found a descriptive explanation illustrated with an example, but with no precise specification provided. I also wanted to explore the data. We have been examining some approaches for cross-species single-cell analysis and this preprint describes an exceptionally useful dataset for this purpose. Thus, access to the software and data used for the preprint would accelerate the research in my lab.

But while the preprint has a sentence with a link to the software (“Code for demultiplexing counts/UMI tables, clustering, annotation, and other downstream analyses are available on GitHub (https://github.com/krasnowlab/HLCA)”) clicking on the link merely sends one to the Github Octocat.

Screen Shot 2019-10-16 at 4.01.58 PM

The Travaglini, Nabhan et al. Github repository that is supposed to contain the analysis code is nowhere to be found. The data is also not available in any form. The preprint states that “Raw sequencing data, alignments, counts/UMI tables, and cellular metadata are available on GEO (accession GEOXX),” The only data a search for GEOXX turns up is a list of prices on a shoe website.

I wrote to the authors of Travaglini, Nabhan et al. right after their preprint appeared noting the absence of code and data and asking for both. I was told by one of the first co-authors that they were in the midst of uploading the materials, but that the decision of whether to share them would have to be made by the corresponding authors. Almost two months later, after repeated requests, I have yet to receive anything. My initial excitement for the Travaglini, Nabhan et al. single-cell RNA-seq has turned into disappointment at their zero-data RNA-seq.

🦗 🦗 🦗 🦗 🦗 

This state of affairs, namely the posting of bioRxiv preprints without data or code, is far too commonplace. I was first struck with the extent of the problem last year when the Gupta, Collier et al. 2018 preprint was posted without a Methods section (let alone with data or code). Also problematic was that the preprint was posted just three months before publication while the journal submission was under review. I say problematic because not sharing code, not sharing software, not sharing methods, and not posting the preprint at the time of submission to a journal does not accelerate progress in science (see the CZI Open Science values statement above).

The Gupta, Collier et al. preprint was not a CZI related preprint but the Travaglini, Nabhan et al. preprint is. Specifically, Travaglini, Nabhan et al. 2019 is a collaboration between CZ Biohub and Stanford University researchers, and the preprint appears on the Chan Zuckerberg Biohub bioRxiv channel:

Screen Shot 2019-10-18 at 10.57.52 PM

The Travaglini, Nabhan et al. 2019 preprint is also not an isolated example; another recent CZ Biohub preprint from the same lab, Horns et al. 2019,  states explicitly that “Sequence data, preprocessed data, and code will be made freely available [only] at the time of [journal] publication.” These are cases where instead of putting its money where its mouth is, the mouth took the money, ate it, and spat out a 404 error.

angry meryl streep GIF

To be fair, sharing data, software and methods is difficult. Human data must sometimes be protected due to confidentiality constraints, thus requiring controlled access with firewalls such as dbGaP that can be difficult to set up. Even with unrestricted data, sharing can be cumbersome. For example, the SRA upload process is notoriously difficult to manage, and the lack of metadata standards can make organizing experimental data, especially sequencing data, complicated and time consuming. The sharing of experimental protocols can be challenging when they are in flux and still being optimized while work is being finalized. And when it comes to software, ensuring reproducibility and usability can take months of work in the form of wrangling Snakemake and other workflows, not to mention the writing of documentation. Practicing Open Science, I mean really doing it, is difficult work. There is a lot more to it than just dumping an advertisement on the bioRxiv to collect a timestamp. By not sharing their data or software, preprints such as Travaglini, Nabhan et al. 2019 and Horns et al. 2019 appear to be little more than a cynical attempt to claim priority.

It would be great if the CZI, an initiative backed by billions of dollars with hundreds of employees, would truly champion Open Science. The Tabula Muris preprint is a great example of how preprints that are released with data and software can accelerate progress in science. But Tabula Muris seems to be an exception for CZ Biohub researchers rather than the rule, and actions speak louder than a website with a statement about Open Science values.

A few months ago, in July 2019, I wrote a series of five posts about the Melsted, Booeshaghi et al. 2019 preprint on this blog, including a post focusing on a new fast workflow for RNA velocity based on kallisto | bustools.  This new workflow replaces the velocyto software published with the RNA velocity paper (La Manno et al. 2018), in the sense that the kallisto | bustools is more than an order of magnitude faster than velocyto (and Cell Ranger which it builds on), while producing near identical results:


In my blogpost, I showed how we were able to utilize this much faster workflow to easily produce RNA velocity for a large dataset of 113,917 cells from Clark et al. 2019, a dataset that was intractable with Cell Ranger + velocyto.

The kallisto | bustools RNA velocity workflow makes use of two novel developments: a new mode in kallisto called kallisto bus that produces BUS files from single-cell RNA-seq data, and a new collection of new C++ programs forming “bustools” that can be used to process BUS files to produce count matrices. The RNA velocity workflow utilizes the bustools sort, correct, capture and count commands. With these tools an RNA velocity analysis that previously took a day now takes minutes. The workflow, associated tools and validations are described in the Melsted, Booeshaghi et al. 2019 preprint.

Now, in an interesting exercise presented at the Single Cell Genomics 2019 meeting, Sten Linnarsson revealed that he reimplemented the kallisto | bustools workflow in Loompy. Loompy, which previously consisted of some Python tools for reading and writing Loom files, now has a function that runs kallisto bus. It also consists of Python functions that are used to manipulate BUS files; these are Python reimplementations of the bustools functions needed for RNA velocity and produce the same output as kallisto | bustools. It is therefore possible to now answer a question I know has been on many minds… one that has been asked before but not to my knowledge, in the single-cell RNA-seq setting… is Python really faster than C++ ?

To answer this question we (this is an analysis performed with Sina Booeshaghi), performed an apples-to-apples comparison running kallisto | bustools and Loompy on exactly the same data, with the same hardware. We pre-processed both the human forebrain data from La Manno et al. 2018,  and data from Oetjen et al. 2018 consisting of 498,303,099 single-cell RNA-seq reads sequenced from a cDNA library of human bone marrow (SRA accession SRR7881412; see also PanglaoDB).

First, we examined the correspondence between Loopy and bustools on the human forebrain data. As expected, given that the Loompy code first runs the same kallisto as in the kallisto | bustools workflow, and then reimplements bustools, the results are the near identical. In the first plot every dot is a cell (as defined by the velocyto output from La Manno et al. 2018) and the number of counts produced by each method is shown. In the second, the correlation between gene counts in each cell are plotted:

The figures above are produced from the “spliced” RNA velocity matrix. We also examined the “unspliced” matrix, with similar results:

In runtime benchmarks on the Oetjen et al. 2018 data we found that kallisto | bustools runs 3.75 faster than Loompy (note that by default Loompy runs kallisto with all available threads, so we modified the Loompy source code to create a fair comparison). Furthermore, kallisto | bustools requires 1.8 times less RAM. In other words, despite rumors to the contrary, Python is indeed slower than C++ ! 

Of course, sometimes there is utility in reimplementing software in another language, even a slower one. For example, a reimplementation of C++ code could lead to a simpler workflow in a higher level language. That’s not the case here. The memory efficiency of kallisto | bustools makes possible the simplest user interface imaginable: a kallisto | bustools based Google Colab notebook allows for single-cell RNA-seq pre-processing in the cloud on a web browser without a personal computer.

At the recent Single Cell Genomics 2019 meeting, Linnarsson’s noted that Cell Ranger + veloctyto has been replaced by kallisto | bustools:

image (9).png

Indeed, as I wrote on my blog shortly after the publication of Melsted, Booeshaghi et al., 2019, RNA velocity calculations that were previously intractable on large datasets are now straightforward. Linnarsson is right. Users should always adopt best-in-class tools in favor of methods that underperform in accuracy, efficiency, or both. #methodsmatter

This post is the fifth in a series of five posts related to the paper “Melsted, Booeshaghi et al., Modular and efficient pre-processing of single-cell RNA-seq, bioRxiv, 2019“. The posts are:

  1. Near-optimal pre-processing of single-cell RNA-seq
  2. Single-cell RNA-seq for dummies
  3. How to solve an NP-complete problem in linear time
  4. Rotating the knee (plot) and related yoga
  5. High velocity RNA velocity

The following passage about Beethoven’s fifth symphony was written by one of my favorite musicologists:

“No great music has ever been built from an initial figure of four notes. As I have said elsewhere, you might as well say that every piece of music is built from an initial figure of one note. You may profitably say that the highest living creatures have begun from a single nucleated cell. But no ultra-microscope has yet unraveled the complexities of the single living cell; nor, if the spectroscope is to be believed, are we yet very full informed of the complexities of a single atom of iron : and it is quite absurd to suppose that the evolution of a piece of music can proceed from a ‘simple figure of four notes’ on lines in the least resembling those of nature.” – Donald Francis Tovey writing about Beethoven’s Fifth Symphony in Essays in Musical Analysis Volume I, 1935.

This passage conveys something true about Beethoven’s fifth symphony: an understanding of it cannot arise from a limited fixation on the famous four note motif. As far as single-cell biology goes, I don’t know whether Tovey was familiar with Theodor Boveri‘s sea urchin experiments, but he certainly hit upon a scientific truth as well: single cells cannot be understood in isolation. Key to understanding them is context (Eberwine et al., 2013).

RNA velocity, with roots in the work of Zeisel et al., 2011, has been recently adapted for single-cell RNA-seq by La Manno et al. 2018, and provides much needed context for interpreting the transcriptomes of single-cells in the form of a dynamics overlay. Since writing a review about the idea last year (Svensson and Pachter, 2019), I’ve become increasingly convinced that the method, despite relying on sparse data, numerous very strong model assumptions, and lots of averaging, is providing meaningful biological insight. For example, in a recent study of spermatogonial stem cells (Guo et al. 2018), the authors describe two “unexpected” transitions between distinct states of cells that are revealed by RNA velocity analysis (panel a from their Figure 6, see below):


Producing an RNA velocity analysis currently requires running the programs Cell Ranger followed by velocyto. These programs are both very slow. Cell Ranger’s running time scales at about 3 hours per hundred million reads (see Supplementary Table 1 Melsted, Booeshaghi et al., 2019). The subsequent velocyto run is also slow. The authors describe it as taking “approximately 3 hours” but anecdotally the running time can be much longer on large datasets. The programs also require lots of memory.

To facilitate rapid and large-scale RNA velocity analysis, in Melsted, Booeshaghi et al., 2019  we describe a kallisto|bustools workflow that makes possible efficient RNA velocity computations at least an order of magnitude faster than with Cell Ranger and velocyto. The work, a tour-de-force of development, testing and validation, was primarily that of Sina Booeshaghi. Páll Melsted implemented the bustools capture command and Kristján Hjörleifsson assisted with identifying and optimizing the indices for pseudoalignment. We present analysis on two datasets in the paper. The first is single-cell RNA-seq from retinal development recently published in Clark et al. 2019. This is a beautiful paper- and I don’t mean just in terms of the results. Their data and results are extremely well organized making their paper reproducible. This is so important it merits a shout out 👏🏾

See Clark et al. 2019‘s  GEO GSE 118614 for a well-organized and useful data share.

The figure below shows RNA velocity vectors overlaid on UMAP coordinates for Clark et al.’s 10 stage time series of retinal development (see cell [8] in our python notebook):


An overlap on the same UMAP with cells colored by type is shown below:


Clark et al. performed a detailed pseudotime analysis in their paper, which successfully identified genes associated with cell changes during development. This is a reproduction of their figure 2:


We examined the six genes from their panel C from a velocity point of view using the scvelo package and the results are beautiful:


What can be seen with RNA velocity is not only the changes in expression that are extracted from pseudotime analysis (Clark et al. 2019 Figure 2 panel C), but also changes in their velocity, i.e. their acceleration (middle column above). RNA velocity adds an interesting dimension to the analysis.

To validate that our kallisto|bustools RNA velocity workflow provides results consistent with velocyto, we performed a direct comparison with the developing human forebrain dataset published by La Manno et al. in the original RNA velocity paper (La Manno et al. 2018 Figure 4).


The results are concordant, not only in terms of the displayed vectors, but also, crucially, in the estimation of the underlying phase diagrams (the figure below shows a comparison for the same dataset; kallisto on the left, Cell Ranger + velocyto on the right):


Digging deeper into the data, one difference we found between the workflows (other than speed) is the number of reads counts. We implemented a simple strategy to estimate the required spliced and unspliced matrices that attempts to follow the one described in the La Manno et al. paper, where the authors describe the rules for characterizing reads as spliced vs. unspliced as follows:

1. A molecule was annotated as spliced if all of the reads in the set supporting a given molecule map only to the exonic regions of the compatible transcripts.
2. A molecule was annotated as unspliced if all of the compatible transcript models had at least one read among the supporting set of reads for this molecule mapping that i) spanned exon-intron boundary, or ii) mapped to the intron of that transcript.

In the kallisto|bustools workflow this logic was implemented via the bustools capture command which was first use to identify all reads that were compatible only with exons (i.e. there was no pseudoalignment to any intron) and then all reads that were compatible only with introns  (i.e. there was no pseudoalignment completely within an exon). While our “spliced matrices” had similar numbers of counts, our “unspliced matrices” had considerably more (see Melsted, Booeshaghi et al. 2019 Supplementary Figure 10A and B):


To understand the discrepancy better we investigated the La Manno et al. code, and we believe that differences arise from the velocyto package logic.py code in which the same count function

def count(self, molitem: vcy.Molitem, cell_bcidx: int, dict_layers_columns: Dict[str, np.ndarray], geneid2ix: Dict[str, int])

appears 8 times and each version appears to implement a slightly different “logic” than described in the methods section.

A tutorial showing how to efficiently perform RNA velocity is available on the kallisto|bustools website. There is no excuse not to examine cells in context.


This post is the fourth in a series of five posts related to the paper “Melsted, Booeshaghi et al., Modular and efficient pre-processing of single-cell RNA-seq, bioRxiv, 2019“. The posts are:

  1. Near-optimal pre-processing of single-cell RNA-seq
  2. Single-cell RNA-seq for dummies
  3. How to solve an NP-complete problem in linear time
  4. Rotating the knee (plot) and related yoga
  5. High velocity RNA velocity

The “knee plot” is a standard single-cell RNA-seq quality control that is also used to determine a threshold for considering cells valid for analysis in an experiment. To make the plot, cells are ordered on the x-axis according to the number of distinct UMIs observed. The y-axis displays the number of distinct UMIs for each barcode (here barcodes are proxies for cells). The following example is from Aaron Lun’s DropletUtils vignette:


A single-cell RNA-seq knee plot.

High quality barcodes are located on the left hand side of the plot, and thresholding is performed by identifying the “knee” on the curve. On the right hand side, past the inflection point, are barcodes which have relatively low numbers of reads, and are therefore considered to have had failure in capture and to be too noisy for further analysis.

In Melsted, Booeshaghi et al., Modular and efficient pre-processing of single-cell RNA-seq, bioRxiv, 2019, we display a series of plots for a benchmark panel of 20 datasets, and the first plot in each panel (subplot A)is a knee plot. The following example is from an Arabidopsis thaliana dataset (Ryu et al., 2019; SRR8257100)


Careful examination of our plots shows that unlike the typical knee plot made for single-cell RNA-seq , ours has the x- and y- axes transposed. In our plot the x-axis displays the number of distinct UMI counts, and the y-axis corresponds to the barcodes, ordered from those with the most UMIs (bottom) to the least (top). The figure below shows both versions of a knee plot for the same data (the “standard” one in blue, our transposed plot in red):


Why bother transposing a plot? 

We begin by observing that if one ranks barcodes according to the number of distinct UMIs associated with them (from highest to lowest), then the rank of a barcode with x distinct UMIs is given by f(x) where

f(x) = |\{c:\# \mbox{UMIs} \in c \geq x\}|.

In other words, the rank of a barcode is interpretable as the size of a certain set. Now suppose that instead of only measurements of RNA molecules in cells, there is another measurement. This could be measurement of surface protein abundances (e.g. CITE-seq or REAP-seq), or measurements of sample tags from a multiplexing technology (e.g. ClickTags). The natural interpretation of #distinct UMIs as the independent variable and  the rank of a barcode as the dependent variable is now clearly preferable. We can now define a bivariate function f(x,y) which informs on the number of barcodes with at least x RNA observations and tag observations:

f(x,y) = |\{c:\# \mbox{UMIs} \in c \geq x \mbox{ and} \# \mbox{tags} \in c \geq y  \}|.

Nadia Volovich, with whom I’ve worked on this, has examined this function for the 8 sample species mixing experiment from Gehring et al. 2018. The function is shown below:



Here the x-axis corresponds to the #UMIs in a barcode, and the y-axis to the number of tags. The z-axis, or height of the surface, is the f(x,y) as defined above.  Instead of thresholding on either #UMIs or #tags, this “3D knee plot” makes possible thresholding using both (note that the red curve shown above corresponds to one projection of this surface).

Separately from the issue described above, there is another subtle issue with the knee plot. The x-axis (dependent) variable really ought to display the number of molecules assayed rather than the number of distinct UMIs. In the notation of Melsted, Booeshaghi et al., 2019 (see also the blog post on single-cell RNA-seq for dummies), what is currently being plotted is |supp(I)|, instead of |I|. While |I| cannot be directly measured, it can be inferred (see the Supplementary Note of Melsted, Booeshaghi et al., 2019), where the cardinality of I is denoted by k (see also Grün et al,, 2014). If d denotes the number of distinct UMIs for a barcode and n the effective number of UMIs , then k can be estimated by

\hat{k} = \frac{log(1-\frac{d}{n})}{log(1-\frac{1}{n})}.

The function estimating k is monotonic so for the purpose of thresholding with the knee plot it doesn’t matter much whether the correction is applied, but it is worth noting that the correction can be applied without much difficulty.





This post is the third in a series of five posts related to the paper “Melsted, Booeshaghi et al., Modular and efficient pre-processing of single-cell RNA-seq, bioRxiv, 2019“. The posts are:

  1. Near-optimal pre-processing of single-cell RNA-seq
  2. Single-cell RNA-seq for dummies
  3. How to solve an NP-complete problem in linear time
  4. Rotating the knee (plot) and related yoga
  5. High velocity RNA velocity

There is a million dollar prize on offer for a solution to the P vs. NP problem, so it’s understandable that one may wonder whether this blog post is an official entry. It is not.

The title for this post was inspired by a talk presented by David Tse at the CGSI 2017 meeting where he explained “How to solve NP-hard assembly problems in linear time“. The gist of the talk was summarized by Tse as follows:

“In computational genomics there’s been a lot of problems where the formulation is combinatorial optimization. Usually they come from some maximum likelihood formulation of some inference problem and those problems end up being mostly NP-hard. And the solution is typically to develop some heuristic way of solving the NP-hard problem. What I’m saying here is that actually there is a different way of approaching such problems. You can look at them from an information point of view.”

Of course thinking about NP-hard problems from an information point of view does not provide polynomial algorithms for them. But what Tse means is that information-theoretic insights can lead to efficient algorithms that squeeze the most out of the available information.

One of the computational genomics areas where an NP-complete formulation for a key problem was recently proposed is in single-cell RNA-seq pre-processing. After RNA molecules are captured from cells, they are amplified by PCR, and it is possible, in principle, to account for the PCR duplicates of the molecules by making use of unique molecular identifiers (UMIs). Since UMIs are (in theory) unique to each captured molecule, but identical among the PCR duplicates of that captured molecule, they can be used to identify and discard the PCR duplicates. In practice distinct captured molecules may share the same UMI causing a collision, so it can be challenging to decide when to “collapse” reads to account for PCR duplicates.

In the recent paper Srivastava et al. 2019, the authors developed a combinatorial optimization formulation for collapsing. They introduce the notion of “monochromatic arborescences” on a graph, where these objects correspond to what is, in the language of the previous post, elements of the set C. They explain that the combinatorial optimization formulation of UMI collapsing in this framework is to find a minimum cardinality covering of a certain graph by monochromatic arboresences. The authors then prove the following theorem, by reduction from the dominating set decision problem:

Theorem [Srivastava, Malik, Smith, Sudbery, Patro]: Minimum cardinality covering by monochromatic arborescences is NP-complete.

Following the standard practice David Tse described in his talk, the authors then apply a heuristic to the challenging NP-complete problem. It’s all good except for one small thing. The formulation is based on an assumption, articulated in Srivastava et al. 2019 (boldface and strikethrough is mine):

…gene-level deduplication provides a conservative approach and assumes that it is highly unlikely for molecules that are distinct transcripts of the same gene to be tagged with a similar UMI (within an edit distance of 1 from another UMI from the same gene). However, entirely discarding transcript-level information will mask true UMI collisions to some degree, even when there is direct evidence that similar UMIs must have arisen from distinct transcripts. For example, if similar UMIs appear in transcript-disjoint equivalence classes (even if all of the transcripts labeling both classes belong to the same gene), then they cannot have arisen from the same pre-PCR molecule. Accounting for such cases is especially true [important] when using an error-aware deduplication approach and as sequencing depth increases.

The one small thing? Well… the authors never checked whether the claim at the end, namely that “accounting for such cases is especially important”, is actually true. In our paper “Modular and efficient pre-processing of single-cell RNA-seq” we checked. The result is in our Figure 1d:


Each column in the figure corresponds to a dataset, and the y-axis shows the distribution (over cells) of the proportion of counts one can expect to lose if applying naïve collapsing to a gene. Naïve collapsing here means that two reads with the same UMI are considered to have come from the same molecule. The numbers are so small we had to include an inset in the top right. Basically, it almost never happens that there is “direct evidence that similar UMIs must have arisen from distinct transcripts”. If one does observe such an occurrence, it is almost certainly an artifact of missing annotation. In fact, this leads to an…

💡 Idea: prioritize genes with colliding UMIs for annotation correction. The UMIs directly highlight transcripts that are incomplete. Maybe for a future paper, but returning to the matter at hand…

Crucially, the information analysis shows that there is no point in solving an NP-complete problem in this setting. The naïve algorithm not only suffices, it is sensible to apply it. And the great thing about naïve collapsing is that it’s straightforward to implement and run; the algorithm is linear. The Srivastava et al. question of what is the “minimum number of UMIs, along with their counts, required to explain the set of mapped reads” is a precise, but wrong question. In the words of John Tukey: “Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.” 

The math behind Figure 1d is elementary but interesting (see the Supplementary Note of our paper). We work with a simple binomial model which we justify based on the data. For related work see Petukhov et al. 2018. One interesting result that came out of our calculations (work done with Sina Booeshaghi), is an estimate for the effective number of UMIs on each bead in a cell. This resulted in Supplementary Figure 1:


The result is encouraging. While the number of UMIs on a bead is not quite 4^L where L is the length of the UMI (theoretical maximum shown by dashed red line for v2 chemistry and solid red line for v3 chemistry), it is nevertheless high. We don’t know whether the variation is a result of batch effect, model mis-specification, or other artifacts; that is an interesting question to explore with more data and analysis.

As for UMI collapsing, the naïve algorithm has been used for almost every experiment to date as it is the method that was implemented in the Cell Ranger software, and subsequently adopted in other software packages. This was done without any consideration of whether it is appropriate. As the Srivastava et al. paper shows, intuition is not to be relied upon, but fortunately, in this case, the naïve approach is the right one.




Blog Stats

%d bloggers like this: