Zero-data RNA-seq

The arXiv preprint server has its roots in an “e-print” email list curated by astrophysicist Joanne Cohn, who in 1989 had the idea of organizing the sharing of preprints among her physics colleagues. In her recollections of the dawn of the arXiv, she mentions that “at one point two people put out papers on the list on the same topic within a few days of each other” and that her “impression was that because of the worldwide reach of [her] distribution list, people realized it could be a way to establish precedence for research.” In biology, where many areas are crowded and competitive, the ability to time stamp research before a possibly lengthy journal review and publication process is almost certainly one of the driving forces behind the rapid growth of the bioRxiv (ASAPbio 2016 and Vale & Hyman, 2016).

However the ability to establish priority with preprints is not, in my opinion, what makes them important for science. Rather, the value of preprints is in their ability to accelerate research via the rapid dissemination of methods and discoveries. This point was eloquently made by Stephen Quake, co-president of the Chan Zuckerberg Biohub, at a Caltech Kavli Nanonscience Institute Distinguished Seminar Series talk earlier this year. He demonstrated the impact of preprints and of sharing data prior to journal publication by way of example, noting that posting of the CZ Biohub “Tabula Muris” preprint along with the data directly accelerated two different unrelated projects: Cusanovich et al. 2018 and La Manno et al. 2018. In fact, in the case of La Manno et al. 2018, Quake revealed that one of the corresponding authors of the paper, Sten Linnarsson, had told him that “[he] couldn’t get the paper past the referees without using all of the [Tabula Muris] data”:

Moreover, Quake made clear that the open science principles practiced with the Tabula Muris preprint were not just a one-off experiment, but fundamental Chan Zuckerberg Initiative (CZI) values that are required for all CZI internal research and for publications arising from work the CZI supports: “[the CZI has] taken a pretty aggressive policy about publication… people have to agree to use biorXiv or a preprint server to share results… and the hope is that this is going to accelerate science because you’ll learn about things sooner and be able to work on them”:

Indeed, on its website the CZI lists four values that guide its mission and one of them is “Open Science”:

Open Science
The velocity of science and pace of discovery increase as scientists build on each others’ discoveries. Sharing results, open-source software, experimental methods, and biological resources as early as possible will accelerate progress in every area.

This is a strong and direct rebuttal to Dan Longo and Jeffrey Drazen’s “research parasite” fear mongering in The New England Journal of Medicine.

I was therefore disappointed with the CZI after failing, for the past two months, to obtain the code and data for the preprint “A molecular cell atlas of the human lung from single cell RNA sequencing” by Travaglini, Nabhan et al. (the preprint was posted on the bioRxiv on August 27th 2019). The interesting preprint describes an atlas of 58 cell populations in the human lung, which include 41 of 45 previously characterized cell types or subtypes and the discovery of 14 new ones. Of particular interest to me, in light of some ongoing projects in my lab, is a comparative analysis examining cell type concordance between human and mouse. Travaglini, Nabhan et al. note that 17 molecular types have been gained or lost since the divergence of human and mouse. The results are based on large-scale single-cell RNA-seq (using two technologies) of ~70,000 human and lung peripheral blood cells.

The comparative analysis is detailed in Extended Data Figure S5 (reproduced below), which shows scatter plots of (log) gene counts for homologous human and mouse cell types. For each pair of cell types, a sub-figures also shows the correlation between gene expression and divergent genes are highlighted:

F12.large

I wanted to understand the details behind this figure: how exactly were cell types defined and homologous cell types identified? What was the precise thresholding for “divergent” genes? How were the ln(CPM+1) expression units computed? Some aspects of these questions have answers in the Methods section of the preprint, but I wanted to know exactly; I needed to see the code. For example, the manuscript describes the cluster selection procedure as follows: “Clusters of similar cells were detected using the Louvain method for community detection including only biologically meaningful principle [sic] components (see below)” and looking “below” for the definition of “biologically meaningful” I only found a descriptive explanation illustrated with an example, but with no precise specification provided. I also wanted to explore the data. We have been examining some approaches for cross-species single-cell analysis and this preprint describes an exceptionally useful dataset for this purpose. Thus, access to the software and data used for the preprint would accelerate the research in my lab.

But while the preprint has a sentence with a link to the software (“Code for demultiplexing counts/UMI tables, clustering, annotation, and other downstream analyses are available on GitHub (https://github.com/krasnowlab/HLCA)”) clicking on the link merely sends one to the Github Octocat.

Screen Shot 2019-10-16 at 4.01.58 PM

The Travaglini, Nabhan et al. Github repository that is supposed to contain the analysis code is nowhere to be found. The data is also not available in any form. The preprint states that “Raw sequencing data, alignments, counts/UMI tables, and cellular metadata are available on GEO (accession GEOXX),” The only data a search for GEOXX turns up is a list of prices on a shoe website.

I wrote to the authors of Travaglini, Nabhan et al. right after their preprint appeared noting the absence of code and data and asking for both. I was told by one of the first co-authors that they were in the midst of uploading the materials, but that the decision of whether to share them would have to be made by the corresponding authors. Almost two months later, after repeated requests, I have yet to receive anything. My initial excitement for the Travaglini, Nabhan et al. single-cell RNA-seq has turned into disappointment at their zero-data RNA-seq.

🦗 🦗 🦗 🦗 🦗

This state of affairs, namely the posting of bioRxiv preprints without data or code, is far too commonplace. I was first struck with the extent of the problem last year when the Gupta, Collier et al. 2018 preprint was posted without a Methods section (let alone with data or code). Also problematic was that the preprint was posted just three months before publication while the journal submission was under review. I say problematic because not sharing code, not sharing software, not sharing methods, and not posting the preprint at the time of submission to a journal does not accelerate progress in science (see the CZI Open Science values statement above).

The Gupta, Collier et al. preprint was not a CZI related preprint but the Travaglini, Nabhan et al. preprint is. Specifically, Travaglini, Nabhan et al. 2019 is a collaboration between CZ Biohub and Stanford University researchers, and the preprint appears on the Chan Zuckerberg Biohub bioRxiv channel:

Screen Shot 2019-10-18 at 10.57.52 PM

The Travaglini, Nabhan et al. 2019 preprint is also not an isolated example; another recent CZ Biohub preprint from the same lab, Horns et al. 2019, states explicitly that “Sequence data, preprocessed data, and code will be made freely available [only] at the time of [journal] publication.” These are cases where instead of putting its money where its mouth is, the mouth took the money, ate it, and spat out a 404 error.

angry meryl streep GIF

To be fair, sharing data, software and methods is difficult. Human data must sometimes be protected due to confidentiality constraints, thus requiring controlled access with firewalls such as dbGaP that can be difficult to set up. Even with unrestricted data, sharing can be cumbersome. For example, the SRA upload process is notoriously difficult to manage, and the lack of metadata standards can make organizing experimental data, especially sequencing data, complicated and time consuming. The sharing of experimental protocols can be challenging when they are in flux and still being optimized while work is being finalized. And when it comes to software, ensuring reproducibility and usability can take months of work in the form of wrangling Snakemake and other workflows, not to mention the writing of documentation. Practicing Open Science, I mean really doing it, is difficult work. There is a lot more to it than just dumping an advertisement on the bioRxiv to collect a timestamp. By not sharing their data or software, preprints such as Travaglini, Nabhan et al. 2019 and Horns et al. 2019 appear to be little more than a cynical attempt to claim priority.

It would be great if the CZI, an initiative backed by billions of dollars with hundreds of employees, would truly champion Open Science. The Tabula Muris preprint is a great example of how preprints that are released with data and software can accelerate progress in science. But Tabula Muris seems to be an exception for CZ Biohub researchers rather than the rule, and actions speak louder than a website with a statement about Open Science values.

8 comments

Comments feed for this article

October 21, 2019 at 6:03 am

Shaun Mahony

I’m reminded of another strategy for establishing scientific precedence, and it’s flaws: https://wallifaction.tumblr.com/post/116394790067/the-discovery-of-titan-huygens-cipher-and

October 21, 2019 at 10:12 am

Here is another scRNAseq dataset on which Quake is author and holds patents resulting. On biorxiv last 5 months, unavailable until it is through the review process according to a senior author: https://www.biorxiv.org/content/10.1101/350538v2.article-info.

Frustrating as this would be incredibly useful in our work which is tangential at best to the preprint.

October 21, 2019 at 10:31 am

Andrew

But this is pretty typical with efforts such as CZI/HCA. Everyone has equal access to the data, but some people (insiders) are more equal than others.

October 21, 2019 at 11:39 am

James

I want to push back a little against the idea that these papers define “the rule” and the data-sharing in the TM paper was the exception.

You linked the Biohub biorxiv channel–how many of those preprints include data? It looks to me like one or two Biohub-affiliated PIs are the exception.

October 21, 2019 at 11:56 am

Lior Pachter

You’re right that I did not systematically check every preprint (it would be great if CZI did that!), so I can’t say that more than half are not sharing. However I spot checked several single-cell genomics preprints and found missing data in most of them. Even when I found data, in some cases it was in disarray without the needed software to reproduce the results. More to the point, when one of the co-presidents of CZ Biohub has multiple recent preprints not sharing software or data yet waxes lyrical about open science and sharing data then I think it’s fair to say that CZI cannot represent Open Science as a rule.

October 21, 2019 at 2:41 pm

michaelmhoffman

If institutional or funder mandates lead to people posting mockeries of preprints (such as manuscripts with withheld methods sections or false descriptions of data availability), then I feel like we’d be better off without the mandates.

November 18, 2019 at 3:32 pm

Anon

Hey Lior,

Thanks for this post! I too worry about preprints being posted on biorXiv to gain priority, without adequate methods/code/data.

However, whenever I start thinking about what changes I would propose to the bioRxiv screening process, I struggle to come up with something that would improve quality of submissions, without ruining the speed of preprint approvals!

Ideally, someone would read over the methods to check that they are “sufficient” (however you choose to define this), code is bug-free and runs, and data is posted on some public database and is complete!

But all of this starts to sound a lot like standard peer review, and I’m not sure biorXiv even has the band-width to “check” all of these boxes. I suppose you could require something like a GEO link for data, maybe a Github link for code? With respect to a methods section, I can’t think of a procedure that’s efficient and effective. Word count? A cursory review (what criteria would you use to “green-light” a paper?)

I’d be interested to hear your thoughts. Do you have any suggestions of what type of checks could be implemented, that would improve submission quality, without leading to long delays in preprint approvals?

April 7, 2021 at 10:58 am

anon

So just to recap: arXiv was created to establish scientific priority. YOU think it should be used to accelerate scientific discovery. So shame on these researchers who are only using it to establish scientific priority. Don’t they know that YOU think it should be used for something else? 🙂

That being said, I do agree that in a perfect world scientists would share all materials on biorRxiv. However, the solution is not to call out academics like Steve Quake. Instead, we need to fix the incentive structures that are destroying most academic research.

	Blog do Raphael Winc… on The network nonsense of Albert…
	Camelia on All of Us failed
	jeffrey on Yuval Peres
	Michael Rorer on A note on “How the Gaza…
	flyingmonkey on A note on “How the Gaza…
	Wes J on A note on “How the Gaza…
	David McQuillan on A note on “How the Gaza…
	lewi on A note on “How the Gaza…
	David McQuillan on A note on “How the Gaza…
	Izzy on A note on “How the Gaza…

Zero-data RNA-seq

Recent Comments

Top Posts & Pages

Recent posts

Archives

Biology

Computational Biology

Computer Science

Ideas

Math

Medicine

Statistics

Blog Stats

8 comments

Leave a comment Cancel reply

Zero-data RNA-seq

Share this:

Related

Recent Comments

Top Posts & Pages

Recent posts

Archives

Biology

Computational Biology

Computer Science

Ideas

Math

Medicine

Statistics

Blog Stats

8 comments

Leave a comment Cancel reply