This week PLoS Computational Biology published a guide titled Ten Simple Rules for Reproducible Computational Research, by G.K. Sandve, A. Nekrutenko, J. Taylor and E. Hovig. The guide lists ten rules, including

Rule 6: For Analyses that Include Randomness, Note Underlying Random Seeds

This is somewhat akin to the biological practice of storing cDNA libraries at -20C. For computational biologists the rule might seem a bit excessive at first glance, and not quite at the same level of importance as

Rule 1: for every result, keep track of how it was produced.

Indeed, I doubt that any of my colleagues keep track of their random seeds; I certainly haven’t. But paying attention to (and recording) seeds used in random number generation is extremely important, and I thought I’d share an anecdote from one of my recent projects to make the point.

My student Atif Rahman has been working on a project for which a basic validation of our method required simulating multiple sets of sequencing reads with error after inducing mutations into a reference genome. He started by using wgsim from SAMtools (a point of note is that wgsim only simulates reads with uniform sequencing error but that is another matter). Initially he was running this command

for i in {1..40} do
wgsim -d 200 -N 300000 -h ${genome}.fna ${genome}_${i}_1.fastq
${genome}_${i}_2.fastq
done

Somewhat to his surprise, he found that some of the read sets were eerily similar. Upon closer inspection, he realized what was going on. Multiple datasets were being generated in the same (literal) second, and since he was not setting the seeds, they were being chosen according to the internal (wall) clock– a practice that is common in many programs. As a result those read sets were completely identical. This is because for a fixed seed, a pseudo-random number generator (which is how the computer is generating random numbers) computes the “random” numbers in a deterministic way (for more on random number generation see http://www.random.org/randomness/ ).

One way to circumvent the problem is to first generate a sequence of pseudo-random numbers (of course remembering to record the seed used!) and then to use the resulting numbers as seeds in wgsim using the “-S” option. The quick hack that Atif used was to insert a sleep() between iterations.

In summary, random number generation should not be done randomly.

### Like this:

Like Loading...

*Related*

## 8 comments

Comments feed for this article

October 25, 2013 at 2:49 pm

Saheli Datta (@SaheliDatta)I believe @emergentnexus has come upon a similar issue recently as well.

In summary, random number generation should not be done randomly.for other permutations of this line, see more at the ostensibly truly random Random.org, which is celebrating its 15th anniversary with an Android App. The most famous version of course is:

“Any one who considers arithmetical methods of producing random digits is, of course, in a state of sin.”

– the frequently quoted Johnny von Neumann,

*apparently* in the extremely obscure and difficult to digitally *or* physically find,

Various techniques used in connection with random digitsJournal of Research of the National Bureau of Standards, Appl. Math. Series (1951), 3, 36-38. So much for Google Books revealing the primary literature!October 25, 2013 at 9:21 pm

pmelstedTry this in R, start the program and quit, answer yes when saving the workspace. Start R again, type runif(1), quit R and don’t save your workspace. Start R the third time, type runif(1) and get the same random number again. Obviously the workspace contains the state of the internal random number generator. Although I can see why this happens from a programming point of view, it breaks the illusion that random numbers are random.

October 25, 2013 at 10:29 pm

Reed CartwrightI’ve run into the seed via time issue before. I solved it by mixing the programs PID into the seed as well. I use this code in several projects:

https://github.com/reedacartwright/jak/blob/master/src/jak.cpp#L79-L91

October 26, 2013 at 6:57 am

Robert FisherGreat post, Lior. Random seeds are crucially important for reproducibility, but this seemingly-obvious point can often fly under the radar. I encountered it in a different context, when driving turbulent fluids using a stochastic driver using an Ornstein-Uhlenbeck process. Strangely, restarting a large three-dimensional turbulent fluid simulation from a checkpoint did not produce consistent results until we realized one had to checkpoint the random values as well to ensure consistency across the restart.

To further complicate matters, one has to be extremely cautious about the type of random number generator used. There was the infamous RANDU generator from IBM in the past, but strange problems persist even to this day. For instance, it turns out that the “standard” intrinsic Fortran number generator “random_number” is actually left unspecified to the compiler providers, with wildly varying results even with identical source code! To ensure full reproducibility, it’s also crucial to specify the precise pseudo-random number generator algorithm used.

Best wishes,

Bob

October 27, 2013 at 8:48 am

Arjun KrishnanGood post! Your statement reminded me of this quote:

“The generation of random numbers is too important to be left to chance.”

—- Robert R. Coveyou, Oak Ridge National Laboratory

October 28, 2013 at 8:48 pm

Kasper HansenThe post only semi-solves the problem.

A good pseudo random number generator produces a sequence of numbers that are indistinguishable from iid. uniform [0,1]. But there is absolutely no guarantee that two sequences, starting from different seeds, are independent, at least for the generators I know of (not that I have an encyclopedic knowledge about this). This is a common mistake, especially in parallel programming where one has to maintain a single random number stream across multiple processes.

In this case, the correct way to do this would be to generate a single big file which is subsequently split into 40 different files, for example using the unix command split (assuming wgsim tries to produce iid samples)

Of course, for the problem you are trying to address, your solution may be good enough for practical purposes.

October 30, 2013 at 5:34 pm

Nicolas Bray“But there is absolutely no guarantee that two sequences, starting from different seeds, are independent, … In this case, the correct way to do this would be to generate a single big file which is subsequently split into 40 different files”

A possibly stupid question: Is there any a priori reason to think that these 40 different chunks coming from one seed will be any more independent than the results coming from 40 different seeds?

November 3, 2013 at 9:09 pm

ross lazarusI’ll bite! Possibly not, if the 40 different seeds are themselves chosen ‘randomly’ and the PRNG has some horrible dependence between numbers spaced by exactly 1 million in every sequence….

Sorry, no proof – just a thought experiment. Assume the PRNG is imperfect (all deterministic ones are!) and there are chunky ‘hyperplanes’ in the pseudorandom generated space. If it so happens that there is some complex but important non-independence between the every millionth number, then using 40 streams that are exactly one million numbers away from each other is probably a really bad idea. Sure, this is unlikely but that’s why I would guess that you would be a shade more likely to have independent sequences if each sequence starts using truly random seeds (lavarand eg) so the 40 sequences at least start independently.