Using statistical methods to estimate and take into account experimental measurement errors: a case study using high throughput proteomics data

We published a paper titled “System wide analyses have underestimated protein abundances and the importance of transcription in mammals” in PeerJ on Feb 27, 2014 (https://peerj.com/articles/270/). In our paper we use statistical methods to reanalyze the data of several proteomics papers to assess the relative importance that each step in gene expression plays in determining the variance in protein amounts expressed by each gene. Historically transcription was viewed as the dominant step. More recently, though, system wide analyses have claimed that translation plays the dominant role and that differences in mRNA expression between genes explain only 10-40% of the differences in protein levels. We find that when measurement errors in mRNA and protein abundance data is taken into account, transcription again appears to be the dominant step.

Our study was initially motivated by our observation that the system wide label-free mass spectrometry data of 61 housekeeping proteins in Schwanhäusser et al (2011) have lower expression estimates than their corresponding individual protein measurements based on SILAC mass spectrometry or western blot data. The underestimation bias is especially obvious for proteins with expression levels lower than 10⁶ molecules per cell. We therefore corrected this non linear bias to determine how more accurately scaled data impacts the relationship between protein and mRNA abundance data. We found that a two-part spline model fits well on the 61 housing keep protein data and applied this model to correct the system-wide protein abundance estimates in Schwanhäusser et al (2011). After this correction, our corrected protein abundance estimates show a significantly higher correlation with mRNA abundances than do the uncorrected protein data.

We then investigated if other sources of experimental error could further explain the relatively poor correlation between protein and mRNA levels. We employed two strategies that both use Analysis of Variance (ANOVA) to determine the percent of the variation in measured protein expression levels that is due to each of the four steps: transcription, mRNA degradation, translation, and protein degradation, as well as estimating the measurement errors in each step. ANOVA is a classic statistical method developed by RA Fisher in the 1920s. Despite the fact that this is a well-regarded and standard approach in some fields, its usefulness has not been widely appreciated in genomics and proteomics. In our first strategy, we estimated the variances of errors in mRNA and protein abundances using direct experimental measurements provided by control experiments in the Schwanhäusser et al. paper. Plugging these variances into ANOVA, we found that mRNA levels explain at least 56% of the differences in protein abundance for the 4,212 genes detected by Schwänhausser et al (2011). However, because one major source of error—systematic error of protein measurements—could not be estimated, the true percent contribution of mRNA to protein expression should be higher. We also employed a second, independent strategy to determine the contribution of mRNA levels to protein expression. We show that the variance in translation rates directly measured by ribosome profiling is only 12% of that inferred by Schwanhäusser et al (2011), and that the measured and inferred translation rates correlate poorly. Based on this, our second strategy suggests that mRNA levels explain ∼81% of the variance in protein levels. While the magnitudes of our two estimates vary, they both suggest that transcription plays a more important role than the earlier studies implied and translation a much smaller role.

Finally, we noted that all of the published estimates, as welll as ours given above, only apply to those genes whose mRNA and protein expression was detected. Based on a detailed analysis by Hebenstreit et al. (2012), we estimate that approximately 40% of genes in a given cell within a population express no mRNA. Since there can be no translation in the absence of mRNA, we argue that differences in translation rates can play no role in determining the expression levels for the ∼40% of genes that are non-expressed.

5 comments

Comments feed for this article

March 1, 2014 at 4:37 pm

deboramarks

I am wondering what the implications of this finding is for the size of microRNA effects on protein dose. After all the microRNA hype over the past 10 years or so, we are still left with the open question as to why many microRNAs and target sites are so well conserved, over gezillions of years – yet they seem to alter protein dose only by v small amounts. (Orchestration of many additive small changes across sets of diff genes and noise reduction are two favorite hypotheses to explain this – but not been shown.) So I wonder if the framework of the analysis from this blogged paper would be useful to re-examine the microRNA- mass spec data (Selbach 2008 & Baek 2008 Nature) to learn something new.

March 1, 2014 at 6:41 pm

Mark Biggin

An interesting comment. It is likely that the degree of error in the papers studying changes in protein and mRNA expression +/- micro RNAs are smaller than when straight mRNA and protein abundance data are compared. This is because by using the change in abundance, a large proportion of the systematic biases in the abundance measurements are removed. (change = abundance condition 1 / abundance condition 2).

However, that said if replica data are available the stochastic error could be estimated and then ANOVA used to determine the higher correlation expected between the true change in protein abundance vs the true change in RNA abundance.

It would also be interesting to have direct measurements of translation rates from ribosome profiling for the same conditions.

mark biggin

March 1, 2014 at 7:53 pm

Yep. So because it’s a ratio that’s measured, it’s kind of covered.

March 3, 2014 at 12:40 am

Ioannis Vlachos

I also agree that microRNA action is a factor that can be incorporated, as well as alterative splicing or RNA localization.

Alternatively spliced non-coding RNAs (originating from protein coding loci) are counted as legitimate mRNA expression in most RNA-Seq pipelines.

Furthermore, most RNA-Seq studies utilize RNA extracted from whole cell samples. I think that a correlation analysis performed using RNA extracted solely from the cytosol could also be of interest.

March 4, 2014 at 11:50 am

Thank you, I had not consider these additional sources of RNA error. There are several other errors that we did not discuss in the paper and which would also, like your suggestions, be in addition to our estimate for RNA error from the Nanostring control experiment.

i. Short mRNAs could be under represented due to the size selection step prior to amplification. While no mRNAs are likely smaller than the length cutoff, short fragments from the ends of the mRNA will be a higher proportion of the length for short genes.

ii. cDNA conversion could cause some bias.

iii. I have seen a talk from David Weinberg in David Bartel’s lab. suggesting that poly A selection may be more effective for some genes than for others. At the extreme, the major histones have no poly A tail and thus are extreme outliers in the number of proteins estimated to be produced per mRNA because most of the RNAs are not in the poly A pool selected.

iv. Any proteins that are not quantitively solubilized when the protein extract is made would also have biased protein errors. This bias would not be captured by the spiked in control as this is prior to spike in.

The above may turnout to be small or negligible errors, but until we have controls for this, we cannot be sure.

	Blog do Raphael Winc… on The network nonsense of Albert…
	Camelia on All of Us failed
	jeffrey on Yuval Peres
	Michael Rorer on A note on “How the Gaza…
	flyingmonkey on A note on “How the Gaza…
	Wes J on A note on “How the Gaza…
	David McQuillan on A note on “How the Gaza…
	lewi on A note on “How the Gaza…
	David McQuillan on A note on “How the Gaza…
	Izzy on A note on “How the Gaza…

Using statistical methods to estimate and take into account experimental measurement errors: a case study using high throughput proteomics data

Recent Comments

Top Posts & Pages

Recent posts

Archives

Biology

Computational Biology

Computer Science

Ideas

Math

Medicine

Statistics

Blog Stats

5 comments

Leave a comment Cancel reply

Using statistical methods to estimate and take into account experimental measurement errors: a case study using high throughput proteomics data

Share this:

Related

Recent Comments

Top Posts & Pages

Recent posts

Archives

Biology

Computational Biology

Computer Science

Ideas

Math

Medicine

Statistics

Blog Stats

5 comments

Leave a comment Cancel reply