This week PLoS Computational Biology published a guide titled Ten Simple Rules for Reproducible Computational Research, by G.K. Sandve, A. Nekrutenko, J. Taylor and E. Hovig. The guide lists ten rules, including

Rule 6: For Analyses that Include Randomness, Note Underlying Random Seeds

This is somewhat akin to the biological practice of storing cDNA libraries at -20C. For computational biologists the rule might seem a bit excessive at first glance, and not quite at the same level of importance as

Rule 1: for every result, keep track of how it was produced.

Indeed, I doubt that any of my colleagues keep track of their random seeds; I certainly haven’t. But paying attention to (and recording) seeds used in random number generation is extremely important, and I thought I’d share an anecdote from one of my recent projects to make the point.

My student Atif Rahman has been working on a project for which a basic validation of our method required simulating multiple sets of sequencing reads with error after inducing mutations into a reference genome. He started by using wgsim from SAMtools (a point of note is that wgsim only simulates reads with uniform sequencing error but that is another matter). Initially he was running this command

for i in {1..40} do
wgsim -d 200 -N 300000 -h ${genome}.fna ${genome}_${i}_1.fastq 
      ${genome}_${i}_2.fastq
done

Somewhat to his surprise, he found that some of the read sets were eerily similar. Upon closer inspection, he realized what was going on. Multiple datasets were being generated in the same (literal) second, and since he was not setting the seeds, they were being chosen according to the internal (wall) clock– a practice that is common in many programs. As a result those read sets were completely identical. This is because for a fixed seed, a pseudo-random number generator (which is how the computer is generating random numbers) computes the “random” numbers in a deterministic way (for more on random number generation see http://www.random.org/randomness/ ).

One way to circumvent the problem is to first generate a sequence of pseudo-random numbers (of course remembering to record the seed used!) and then to use the resulting numbers as seeds in wgsim using the “-S” option. The quick hack that Atif used was to insert a sleep() between iterations.

In summary, random number generation should not be done randomly.