I  learned today about a cute Fibonacci fact from my friend Federico Ardila:

1/89 = 0.011235
1/9899 = 0.0001010203050813213455
1/998999 = 0.000001001002003005008013021034055089144233377610...

$\vdots$

The pattern is explained in a note in the College Mathematics Journal (2003). Of course Fibonacci numbers are ubiquitous in biology, but in thinking about this pattern I was reminded of a lesser known connection between Fibonacci numbers and biology in the context of the combinatorics of splicing:

A stretch of DNA sequence with $n$ acceptor sites and $m$ donor sites can produce at most $F_{n+m+1}$ distinct spliced transcripts, where the numbers $F_i$ are the Fibonacci numbers.

The derivation is straightforward using induction: to simplify notation we denote acceptor sites with an open parenthesis “(” and donor sites with a closed parenthesis “)”. We  use the notation $|S|$ for the length of a string S of open and closed parentheses, and denote the maximum number of transcripts that can be spliced from S by $p(S)$. We assume that the theorem is true by induction for $|S| \leq n-1$ (the base case is trivial). Let S be a string with $|S|=n$. Observe that S must have an open parenthesis somewhere that is followed immediately be a closed parentheses. Otherwise we have that $p(S)=1$ (the empty string is considered to be a valid transcript). We therefore have $S= S_1()S_2$ where $S_1$ has open and r closed parentheses respectively, and $S_2$ has $n-k-r-s-2$ open and s closed parentheses respectively. Now notice that

$p(S) \leq F_{r+k+1}F_{n-k-r}+F_{r+k+2}F_{n-k-r-1}+F_{n-1}-F_{r+k+1}F_{n-k-r-1}$.

This can be seen by breaking down the terms as follows: One can take any transcript in $S_1$ and append to it a transcript in “$(S_2$“. Similarly, one can take a transcript in $S_1)$ and append to it a transcript in $S_2$. Transcripts ommitting the interior pair () between $S_1$ and $S_2$ are counted twice, which is fine because one of the copies corresponds to all transcripts that include the interior pair () between $S_1$ and $S_2$. The last two terms account for all transcripts whose last element in $S_1$ is an open parenthesis, and whose first element in $S_2$ is a closed parenthesis. This is counted by considering all transcripts in $S_1S_2$ and subtracting transcripts that do not include a parentheses from each. Finally, using the Fibonacci recurrence and the identity

$F_{n+m} = F_{n+1}F_m + F_nF_{m-1}$

we have

$p(S) \leq F_n + F_{r+k+1}F_{n-k-r-1}+F_{n-1}-F_{r+k+1}F_{n-k-r-1} = F_{n+1}$.

The bound is attained for certain configurations, such as $S = ()() \cdots ()$ with acceptor and donor sites.

The combinatorics is elementary and it only establishes what is already intuitive and obvious: splicing combinatorics dictates that there are a lot of transcripts (exponentially many in the number of acceptor and donor sites) that can, in principle, be spliced together, even from short DNA sequences. The question, then, is why do most genes have so few isoforms? Or do they?