So you’re an academic and you’ve written some bioinformatics software. You heard that:
1. Somebody will build on your code.
Nope. Ok, maybe not never but almost certainly not. There are many reasons for this. The primary reason in my view is that most bioinformatics software is of very poor quality (more on why this is the case in #2). Who wants to read junk software, let alone try to edit it or build on it? Most bioinformatics software is also targeted at specific applications. Biologists who use application specific software are typically not interested in developing or improving software because methods development is not their main interest and software development is not their best skill. In the computational biology areas I have worked in during the past 20 years (and I have reviewed/tested/examined/used hundreds or even thousands of programs) I can count the software programs that have been extended or developed by someone other than the author on a single hand. Software that has been built on/extended is typically of the framework kind (e.g. SAMtools being a notable example) but even then development of code by anyone other than the author is rare. For example, for the FSA alignment project we used HMMoC, a convenient compiler for HMMs, but has anyone ever built on the codebase? Doesn’t look like it. You may have been told by your PI that your software will take on a life of its own, like Linux, but the data suggests that is simply not going to happen. No, Gnu is Not Unix and your spliced aligner is not the Linux kernel. Most likely you alone will be the only user of your software, so at least put in some comments, because otherwise the first time you have to fix your own bug you won’t remember what you were doing in the code, and that is just sad.
2. You should have assembled a team to build your software.
Nope. Although most corporate software these days is built by large teams working collaboratively, scientific software is different. I agree with James Taylor, who in the anatomy of successful computational biology software paper stated that ” A lot of traditional software engineering is about how to build software effectively with large teams, whereas the way most scientific software is developed is (and should be) different. Scientific software is often developed by one or a handful of people.” In fact, I knew you were a graduate student because most bioinformatics software is written singlehandedly by graduate students (occasionally by postdocs). This is actually problem (although not your fault!) Students such as yourself graduate, move on to other projects and labs, and stop maintaining (see #5), let alone developing their code. Many PIs insist on “owning” software their students wrote, hoping that new graduate students in their lab will develop projects of graduated students. But new students are reluctant to work on projects of others because in academia improvement of existing work garners much less credit than new work. After all, isn’t that why you were writing new software in the first place? I absolve you of your solitude, and encourage you to think about how you will create the right incentive structure for yourself to improve your software over a period of time that transcends your Ph.D. degree.
3. If you choose the right license more people will use and build on your program.
Nope. People have tried all sorts of licenses but the evidence suggests the success of software (as measured by usage, or development of the software by the computational biology community) is not correlated with any particular license. One of the most widely used software suites in bioinformatics (if not the most widely used) is the UCSC genome browser and its associated tools. The software is not free, in that even though it is free for academic, non-profit and personal use, it is sold commercially. It would be difficult to argue that this has impacted its use, or retarded its development outside of UCSC. To the contrary, it is almost inconceivable that anyone working in genetics, genomics or bioinformatics has not used the UCSC browser (on a regular basis). In fact, I have, during my entire career, heard of only one single person who claims not to use the browser; this person is apparently boycotting it due to the license. As far as development of the software, it has almost certainly been hacked/modified/developed by many academics and companies since its initial release (e.g. even within my own group). In anatomy of successful computational biology software published in Nature Biotechnology two years ago, a list of “software for the ages” consists of programs that utilize a wide variety of licenses, including Boost, BSD, and GPL/LGPL. If there is any pattern it is that the most common are GPL/LGPL, although I suspect that if one looks at bioinformatics software as a whole those licenses are also the most common in failed software. The key to successful software, it appears, is for it to be useful and usable. Worry more about that and less about the license, because ultimately helping biologists and addressing problems in biomedicine might be more ethical than hoisting the “right” software license flag.
4. Making your software free for commercial use shows you are not against companies.
Nope. The opposite is true. If you make your software free for commercial use, you are effectively creating a subsidy for companies, one that is funded by your university / your grants. You are a corporate hero! Congratulations! You have found a loophole for transferring scarce public money to the private sector. If you’ve licensed your software with BSD you’ve added another subsidy: a company using your software doesn’t have any reason to share their work with the academic community. There are two reasons why you might want to reconsider offering such subsidies. First, by denying yourself potential profits from sale of your software to industry, you are definitively removing any incentive for future development/maintenance of the software by yourself or future graduate students. Most bioinformatics software, when sold commercially, costs a few thousand dollars. This is a rounding error for companies developing cancer or other drugs at the cost of a billion dollars per drug and a tractable proposition even for startups, yet the money will make a real difference to you three years out from your Ph.D. when you’re earning a postdoc salary. A voice from the future tells you that you’ll appreciate the money, and it will help you remember that you really ought to fix that bug reported on GitHub posted two months ago. You will be part of the global improvement of bioinformatics software. And there is another reason to sell your software to companies: help your university incentivize more and better development of bioinformatics software. At most universities a majority of the royalties from software sales go to the institution (at UC Berkeley, where I work, its 2/3). Most schools, especially public universities, are struggling these days and have been for some time. Help them out in return for their investment in you; you’ll help generate more bioinformatics hires, and increase appreciation for your field. In other words, although it is not always practical or necessary, when possible, please sell your software commercially.
5. You should maintain your software indefinitely.
Nope. Someday you will die. Before that you will get a job, or not. Plan for your software to have a limited shelf-life, and act accordingly.
6. Your “stable URL” can exist forever.
Nope. When I started out as a computational biologist in the late 1990s I worked on genome alignment. At the time I was excited about Dynamite: a flexible code generating language for dynamic programming methods used in sequence comparison. This was a framework for specifying bioinformatics related dynamic programming algorithms, such as the Needleman-Wunsch or Smith-Waterman algorithms. The authors wrote that “A stable URL for Dynamite documentation, help and information is http://www.sanger.ac.uk/~birney/dynamite/” Of course the URL is long gone, and by no fault of the authors. The website hosting model of the late 1990s is long extinct. To his credit, Ewan now hosts the Dynamite code on GitHub, following a welcome trend that is likely to extend the life of bioinformatics programs in the future. Will GitHub be around forever? We’ll see. But more importantly, software becomes extinct (or ought to) for reasons other than just 404 errors. For example, returning to sequence alignment, the ClustalW program of 1994 was surpassed in accuracy and speed by many other multiple alignment programs developed in the 2000s. Yet people kept using ClustalW anyway, perhaps because it felt like a “safe bet” with its many citations (eventually in 2011 Clustalw was updated to Clustal Omega). The lag in improving ClustalW resulted in a lot of poor alignments being utilized in genomics studies for a decade (no fault of the authors of ClustalW, but harmful nonetheless). I’ve started the habit of retiring my programs, via announcement on my website and PubMed. Please do the same when the time comes.
7. You should make your software “idiot proof”.
Nope. Your users, hopefully biologists (and not other bioinformatics programmers benchmarking your program to show that they beat it) are not idiots. Listen to them. Back in 2004 Nicolas Bray and I published a webserver for the alignment program MAVID. Users were required to input FASTA files. But can you guess why we had to write a script called checkfasta? (hint: the most popular word processor was and is Microsoft Word). We could have stopped there and laughed at our users, but after much feedback we realized the real issue was that FASTA was not necessarily the appropriate input to begin with. Our users wanted to be able to directly input Genbank accessions for alignment, and eventually Nicolas Bray wrote the perl scripts to allow them to do that (the feature lives on here). The take home message for you is that you should deal with user requests wisely, and your time will be needed not only to fix bugs but to address use cases and requested features you never thought of in the first place. Be prepared to treat your users respectfully, and you and your software will benefit enormously.
8. You used the right programming language for the task.
Nope. First it was perl, now it’s python. First it was MATLAB, now it’s R. First it was C, then C++. First it was multithreading now it’s Spark. There is always something new coming, which is yet another reason that almost certainly nobody is going to build on your code. By the time someone gets around to having to do something you may have worked on, there will be better ways. Therefore, the main thing is that your software should be written in a way that makes it easy to find and fix bugs, fast, and efficient (in terms of memory usage). If you can do that in Fortran great. In fact, in some fields not very far from bioinformatics, people do exactly that. My advice: stay away from Fortran (but do adopt some of the best practice advice offered here).
9. You should have read Lior Pachter’s blog post about the myths of bioinformatics software before starting your project.
Nope. Lior Pachter was an author on the Vista paper describing a program for which the source code was available only “upon request”.