You are currently browsing the tag archive for the ‘Burrows-Wheeler transform’ tag.

Last year I came across a wonderful post on the arXiv, a paper titled A new approach to enumerating statistics modulo *n* written by William Kuszmaul while he was a high school student participating in the MIT Primes Program for Research in Mathematics. Among other things, Kuszmaul solved **the problem of** **counting** **the number of s****ubsets of the n-element set that sum to k mod m.**

This counting problem is related to beautiful (elementary) number theory and combinatorics, and is connected to ideas in (error correcting) coding theory and even computational biology (the Burrows Wheeler transform). I’ll tell the tale, but first explain my personal connection to it, which is the story of **my first paper that wasn’t, and ****how I was admitted to graduate school:**

In 1993 I was a junior (math major) at Caltech and one of my best friends was a senior, Nitu Kitchloo, now an algebraic topologist and professor of mathematics at Johns Hopkins University. I had a habit of asking Nitu math questions, and he had a habit of solving them. One of the questions I asked him that year was:

**How many s****ubsets of the n-element set sum to 0 mod n?**

I don’t remember exactly why I was thinking of the problem, but I do remember that Nitu immediately started looking at the generating function (the polynomial whose coefficients count the number of subsets for each *n*) and magic happened quickly thereafter. We eventually wrote up a manuscript whose main result was the enumeration of , the number of subsets of whose elements sum to *k mod n*. The main result was

.

We were particularly tickled that the formula contained both Euler’s totient function and the Möbius function. It seemed *nice.* So we decided to submit our little result to a journal.

By now months had passed and Nitu had already left Caltech for graduate school. He left me to submit the paper and I didn’t know where, so I consulted the resident combinatorics expert (Rick Wilson) who told me that he liked the result and hadn’t seen it before, but that before sending it off, just in case, I should should consult with this one professor at MIT who was known to be good at counting. I remember Wilson saying something along the lines of “Richard knows everything”. This was in the nascent days of the web, so I hand wrote a letter to Richard Stanley, enclosed a copy of the manuscript, and mailed it off.

A few weeks later I received a hand-written letter from MIT. Richard Stanley had written back to let us know that he very much liked the result… because it had brought back memories of his time as an undergraduate at Caltech, **when he had worked on the same problem! **He sent me the reference:

R.P Stanley and M.F. Yoder, A study of Varshamov codes for asymmetric channels, JPL Technical Report 32-1526, 1973.

Also included was a note that said “Please consider applying to MIT”.

Stanley and Yoder’s result concerned single-error-correcting codes for what are known as Z-channels. These are communication links where there is an asymmetry in the fidelity of transmission: 0 is reliably transmitted as 0 (probability 1), but 1 may be transmitted as either a 1 or a 0 (with some probability *p*). In a 1973 paper A class of codes for asymmetric channels and a problem from the additive theory of numbers published in the IEEE Transactions of Information Theory, R. Varshamov had proposed a single error correcting code for such channels, which was essentially to encode a message by a bit string corresponding to a subset of whose elements sum to *d mod (n+1)*. It’s not hard to see that since zeroes are transmitted faithfully, it would not be hard to detect a single error (and correct it) by summing the elements corresponding to the bit string. Stanley and Yoder’s paper addressed questions related to enumerating the number of codewords. In particular, they were basically working out the solution to the problem Nitu and I had considered. I guess we could have published our paper anyway as we had a few additional results, for example a theorem explaining how to enumerate zero summing subsets of finite Abelian groups, but somehow we never did. There is a link to the manuscript we had written on my website:

N. Kitchloo and L. Pachter, An interesting result about subset sums, 1993.

One generalization we had explored in our paper was the enumeration of what we called , **the number of subsets of the n-element set that sum to k mod m. **We looked specifically at the problem where and proved that

.

What Kuszmaul succeeded in doing is to extend this result to any *m** < n, *which is very nice for two reasons: first, the work completes the investigation of the question of subset sums (to subsets summing to an arbitrary modulus). More importantly, the technique used is that of thinking more generally about “modular enumeration”, which is the problem of finding remainders of polynomials modulo . This led him to numerous other applications, including results on *q*-multinomial and *q-*Catalan numbers, and to the combinatorics of lattice paths. This is the hallmark of excellent mathematics: a proof technique that sheds light on the problem at hand *and many others*.

One of the ideas that modular enumeration is connected to is that of the Burrows-Wheeler transform (BWT). The BWT was published as a DEC tech report in 1994 (based on earlier work of Wheeler in 1983), and is a transform of one string to another of the same length. To understand the transform, consider the example of a binary string of *s *length *n. *The BWT consists of forming a matrix of all cyclic permutations of *s *(one row per permutation), then sorting the rows lexicographically, and finally outputting the last column of the matrix. For example, the string *001101 *is transformed to *110010*. It is obvious by virtue of the definition that any two strings that are the equivalent up to circular permutation will be transformed to the same string. For example, *110100 *will also transform to *110010*.

Circular binary strings are called necklaces in combinatorics. Their enumeration is a classic problem, solvable by Burnside’s lemma, and the answer is that the number of distinct necklaces of length* n* is given by

.

For odd *n* this formula coincides with the subset sum problem (for subsets summing to *0 mod n*). When *n *is prime it is easy to describe a bijection, but for general odd *n *a simple combinatorial bisection is elusive (see Richard Stanley’s Enumerative Combinatorics Volume 1 Chapter 1 Problem 105b).

The Burrows-Wheeler transform is useful because it can be utilized for constant time string matching while requiring an index whose size is only linear in the size of the target. It has therefore become an indispensable tool for genomics. I’m not aware of an application of the elementary observation above, but as the Stanley-Yoder, Kitchloo-P., Kuszmaul timeline demonstrates (21 years in between publications)… **math moves in decades. **I do think there is some interesting combinatorics underlying the BWT, and that its elucidation may turn out to have practical implications. We’ll see.

A final point: it is fashionable to think that biology, unlike math, moves in years. After all, NIH R01 grants are funded for a period of 3–5 years, and researchers constantly argue with journals that publication times should be weeks and not months. But in fact, lots of **basic research in biology moves in decades **as well, just like in mathematics. A good example is the story of CRISP/Cas9, which began with the discovery of “genetic sandwiches” in 1987. The follow-up identification and interpretation and of CRISRPs took decades, mirroring the slow development of mathematics. Today the utility of the CRISPR/Cas9 system depends on the efficient selection of guides and prediction of off-target binding, and as it turns out, tools developed for this purpose frequently use the Burrows-Wheeler transform. It appears that not only binary strings can form circles, but ideas as well…

William Kuszmaul won 3rd place in the 2014 Intel Science Talent Search for his work on modular enumeration. Well deserved, and thank you!

## Recent Comments