You are currently browsing the tag archive for the ‘transcription’ tag.

Jingyi Jessica Li and Mark D. Biggin

We published a paper titled “System wide analyses have underestimated protein abundances and the importance of transcription in mammals” in PeerJ on Feb 27, 2014 (https://peerj.com/articles/270/). In our paper we use statistical methods to reanalyze the data of several proteomics papers to assess the relative importance that each step in gene expression plays in determining the variance in protein amounts expressed by each gene. Historically transcription was viewed as the dominant step. More recently, though, system wide analyses have claimed that translation plays the dominant role and that differences in mRNA expression between genes explain only 10-40% of the differences in protein levels. We find that when measurement errors in mRNA and protein abundance data is taken into account, transcription again appears to be the dominant step.

Our study was initially motivated by our observation that the system wide label-free mass spectrometry data of 61 housekeeping proteins in Schwanhäusser et al (2011) have lower expression estimates than their corresponding individual protein measurements based on SILAC mass spectrometry or western blot data. The underestimation bias is especially obvious for proteins with expression levels lower than 106 molecules per cell. We therefore corrected this non linear bias to determine how more accurately scaled data impacts the relationship between protein and mRNA abundance data. We found that a two-part spline model fits well on the 61 housing keep protein data and applied this model to correct the system-wide protein abundance estimates in Schwanhäusser et al (2011). After this correction, our corrected protein abundance estimates show a significantly higher correlation with mRNA abundances than do the uncorrected protein data.

We then investigated if other sources of experimental error could further explain the relatively poor correlation between protein and mRNA levels. We employed two strategies that both use Analysis of Variance (ANOVA) to determine the percent of the variation in measured protein expression levels that is due to each of the four steps: transcription, mRNA degradation, translation, and protein degradation, as well as estimating the measurement errors in each step. ANOVA is a classic statistical method developed by RA Fisher in the 1920s. Despite the fact that this is a well-regarded and standard approach in some fields, its usefulness has not been widely appreciated in genomics and proteomics. In our first strategy, we estimated the variances of errors in mRNA and protein abundances using direct experimental measurements provided by control experiments in the Schwanhäusser et al. paper. Plugging these variances into ANOVA, we found that mRNA levels explain at least 56% of the differences in protein abundance for the 4,212 genes detected by Schwänhausser et al (2011). However,  because one major source of error—systematic error of protein measurements—could not be estimated, the true percent contribution of mRNA to protein expression should be higher. We also employed a second, independent strategy to determine the contribution of mRNA levels to protein expression. We show that the variance in translation rates directly measured by ribosome profiling is only 12% of that inferred by Schwanhäusser et al (2011), and that the measured and inferred translation rates correlate poorly. Based on this, our second strategy suggests that mRNA levels explain ∼81% of the variance in protein levels. While the magnitudes of our two estimates vary, they both suggest that transcription plays a more important role than the earlier studies implied and translation a much smaller role.

Finally, we noted that all of the published estimates, as welll as ours given above, only apply to those genes whose mRNA and protein expression was detected. Based on a detailed analysis by Hebenstreit et al. (2012), we estimate that approximately 40% of genes in a given cell within a population express no mRNA. Since there can be no translation in the absence of mRNA, we argue that differences in translation rates can play no role in determining the expression levels for the ∼40% of genes that are non-expressed.

I  learned today about a cute Fibonacci fact from my friend Federico Ardila:

1/89 = 0.011235
1/9899 = 0.0001010203050813213455
1/998999 = 0.000001001002003005008013021034055089144233377610...

$\vdots$

The pattern is explained in a note in the College Mathematics Journal (2003). Of course Fibonacci numbers are ubiquitous in biology, but in thinking about this pattern I was reminded of a lesser known connection between Fibonacci numbers and biology in the context of the combinatorics of splicing:

A stretch of DNA sequence with $n \geq 1$ acceptor sites and $m \geq 1$ donor sites can produce at most $F_{n+m+1}$ distinct spliced transcripts, where the numbers $F_i$ are the Fibonacci numbers.

The derivation is straightforward using induction: to simplify notation we denote acceptor sites with an open parenthesis “(” and donor sites with a closed parenthesis “)”. We  use the notation $|S|$ for the length of a string S of open and closed parentheses, and denote the maximum number of transcripts that can be spliced from S by $p(S)$. We assume that the theorem is true by induction for $|S| \leq n-1$ (the base case is trivial). Let S be a string with $|S|=n$. Observe that S must have an open parenthesis somewhere that is followed immediately be a closed parentheses. Otherwise we have that $p(S)=1$ (the empty string is considered to be a valid transcript). We therefore have $S= S_1()S_2$ where $S_1$ has open and r closed parentheses respectively, and $S_2$ has $n-k-r-s-2$ open and s closed parentheses respectively. Now notice that

$p(S) \leq F_{r+k+1}F_{n-k-r}+F_{r+k+2}F_{n-k-r-1}+F_{n-1}-F_{r+k+1}F_{n-k-r-1}$.

This can be seen by breaking down the terms as follows: One can take any transcript in $S_1$ and append to it a transcript in “$(S_2$“. Similarly, one can take a transcript in $S_1)$ and append to it a transcript in $S_2$. Transcripts ommitting the interior pair () between $S_1$ and $S_2$ are counted twice, which is fine because one of the copies corresponds to all transcripts that include the interior pair () between $S_1$ and $S_2$. The last two terms account for all transcripts whose last element in $S_1$ is an open parenthesis, and whose first element in $S_2$ is a closed parenthesis. This is counted by considering all transcripts in $S_1S_2$ and subtracting transcripts that do not include a parentheses from each. Finally, using the Fibonacci recurrence and the identity

$F_{n+m} = F_{n+1}F_m + F_nF_{m-1}$

we have

$p(S) \leq F_n + F_{r+k+1}F_{n-k-r-1}+F_{n-1}-F_{r+k+1}F_{n-k-r-1} = F_{n+1}$.

The bound is attained for certain configurations, such as $S = ()() \cdots ()$ with acceptor and donor sites.

The combinatorics is elementary and it only establishes what is already intuitive and obvious: splicing combinatorics dictates that there are a lot of transcripts (exponentially many in the number of acceptor and donor sites) that can, in principle, be spliced together, even from short DNA sequences. The question, then, is why do most genes have so few isoforms? Or do they?