I  learned today about a cute Fibonacci fact from my friend Federico Ardila:

1/89 = 0.011235
1/9899 = 0.0001010203050813213455
1/998999 = 0.000001001002003005008013021034055089144233377610...

\vdots

The pattern is explained in a note in the College Mathematics Journal (2003). Of course Fibonacci numbers are ubiquitous in biology, but in thinking about this pattern I was reminded of a lesser known connection between Fibonacci numbers and biology in the context of the combinatorics of splicing:

A stretch of DNA sequence with n acceptor sites and m donor sites can produce at most F_{n+m+1} distinct spliced transcripts, where the numbers F_i are the Fibonacci numbers.

The derivation is straightforward using induction: to simplify notation we denote acceptor sites with an open parenthesis “(” and donor sites with a closed parenthesis “)”. We  use the notation |S| for the length of a string S of open and closed parentheses, and denote the maximum number of transcripts that can be spliced from S by p(S). We assume that the theorem is true by induction for |S| \leq n-1 (the base case is trivial). Let S be a string with |S|=n. Observe that S must have an open parenthesis somewhere that is followed immediately be a closed parentheses. Otherwise we have that p(S)=1 (the empty string is considered to be a valid transcript). We therefore have S= S_1()S_2 where S_1 has open and r closed parentheses respectively, and S_2 has n-k-r-s-2 open and s closed parentheses respectively. Now notice that

p(S) \leq F_{r+k+1}F_{n-k-r}+F_{r+k+2}F_{n-k-r-1}+F_{n-1}-F_{r+k+1}F_{n-k-r-1}.

This can be seen by breaking down the terms as follows: One can take any transcript in S_1 and append to it a transcript in “(S_2“. Similarly, one can take a transcript in S_1) and append to it a transcript in S_2. Transcripts ommitting the interior pair () between S_1 and S_2 are counted twice, which is fine because one of the copies corresponds to all transcripts that include the interior pair () between S_1 and S_2. The last two terms account for all transcripts whose last element in S_1 is an open parenthesis, and whose first element in S_2 is a closed parenthesis. This is counted by considering all transcripts in S_1S_2 and subtracting transcripts that do not include a parentheses from each. Finally, using the Fibonacci recurrence and the identity

F_{n+m} = F_{n+1}F_m + F_nF_{m-1}

we have

p(S) \leq F_n + F_{r+k+1}F_{n-k-r-1}+F_{n-1}-F_{r+k+1}F_{n-k-r-1} = F_{n+1}.

The bound is attained for certain configurations, such as S = ()() \cdots () with acceptor and donor sites.

The combinatorics is elementary and it only establishes what is already intuitive and obvious: splicing combinatorics dictates that there are a lot of transcripts (exponentially many in the number of acceptor and donor sites) that can, in principle, be spliced together, even from short DNA sequences. The question, then, is why do most genes have so few isoforms? Or do they?