The myths of bioinformatics software

July 10, 2015 in academia | Tags: Bioinformatics, licensing, software | by Lior Pachter

So you’re an academic and you’ve written some bioinformatics software. You heard that:

1. Somebody will build on your code.

Nope. Ok, maybe not never but almost certainly not. There are many reasons for this. The primary reason in my view is that most bioinformatics software is of very poor quality (more on why this is the case in #2). Who wants to read junk software, let alone try to edit it or build on it? Most bioinformatics software is also targeted at specific applications. Biologists who use application specific software are typically not interested in developing or improving software because methods development is not their main interest and software development is not their best skill. In the computational biology areas I have worked in during the past 20 years (and I have reviewed/tested/examined/used hundreds or even thousands of programs) I can count the software programs that have been extended or developed by someone other than the author on a single hand. Software that has been built on/extended is typically of the framework kind (e.g. SAMtools being a notable example) but even then development of code by anyone other than the author is rare. For example, for the FSA alignment project we used HMMoC, a convenient compiler for HMMs, but has anyone ever built on the codebase? Doesn’t look like it. You may have been told by your PI that your software will take on a life of its own, like Linux, but the data suggests that is simply not going to happen. No, Gnu is Not Unix and your spliced aligner is not the Linux kernel. Most likely you alone will be the only user of your software, so at least put in some comments, because otherwise the first time you have to fix your own bug you won’t remember what you were doing in the code, and that is just sad.

2. You should have assembled a team to build your software.

Nope. Although most corporate software these days is built by large teams working collaboratively, scientific software is different. I agree with James Taylor, who in the anatomy of successful computational biology software paper stated that ” A lot of traditional software engineering is about how to build software effectively with large teams, whereas the way most scientific software is developed is (and should be) different. Scientific software is often developed by one or a handful of people.” In fact, I knew you were a graduate student because most bioinformatics software is written singlehandedly by graduate students (occasionally by postdocs). This is actually problem (although not your fault!) Students such as yourself graduate, move on to other projects and labs, and stop maintaining (see #5), let alone developing their code. Many PIs insist on “owning” software their students wrote, hoping that new graduate students in their lab will develop projects of graduated students. But new students are reluctant to work on projects of others because in academia improvement of existing work garners much less credit than new work. After all, isn’t that why you were writing new software in the first place? I absolve you of your solitude, and encourage you to think about how you will create the right incentive structure for yourself to improve your software over a period of time that transcends your Ph.D. degree.

3. If you choose the right license more people will use and build on your program.

Nope. People have tried all sorts of licenses but the evidence suggests the success of software (as measured by usage, or development of the software by the computational biology community) is not correlated with any particular license. One of the most widely used software suites in bioinformatics (if not the most widely used) is the UCSC genome browser and its associated tools. The software is not free, in that even though it is free for academic, non-profit and personal use, it is sold commercially. It would be difficult to argue that this has impacted its use, or retarded its development outside of UCSC. To the contrary, it is almost inconceivable that anyone working in genetics, genomics or bioinformatics has not used the UCSC browser (on a regular basis). In fact, I have, during my entire career, heard of only one single person who claims not to use the browser; this person is apparently boycotting it due to the license. As far as development of the software, it has almost certainly been hacked/modified/developed by many academics and companies since its initial release (e.g. even within my own group). In anatomy of successful computational biology software published in Nature Biotechnology two years ago, a list of “software for the ages” consists of programs that utilize a wide variety of licenses, including Boost, BSD, and GPL/LGPL. If there is any pattern it is that the most common are GPL/LGPL, although I suspect that if one looks at bioinformatics software as a whole those licenses are also the most common in failed software. The key to successful software, it appears, is for it to be useful and usable. Worry more about that and less about the license, because ultimately helping biologists and addressing problems in biomedicine might be more ethical than hoisting the “right” software license flag.

4. Making your software free for commercial use shows you are not against companies.

Nope. The opposite is true. If you make your software free for commercial use, you are effectively creating a subsidy for companies, one that is funded by your university / your grants. You are a corporate hero! Congratulations! You have found a loophole for transferring scarce public money to the private sector. If you’ve licensed your software with BSD you’ve added another subsidy: a company using your software doesn’t have any reason to share their work with the academic community. There are two reasons why you might want to reconsider offering such subsidies. First, by denying yourself potential profits from sale of your software to industry, you are definitively removing any incentive for future development/maintenance of the software by yourself or future graduate students. Most bioinformatics software, when sold commercially, costs a few thousand dollars. This is a rounding error for companies developing cancer or other drugs at the cost of a billion dollars per drug and a tractable proposition even for startups, yet the money will make a real difference to you three years out from your Ph.D. when you’re earning a postdoc salary. A voice from the future tells you that you’ll appreciate the money, and it will help you remember that you really ought to fix that bug reported on GitHub posted two months ago. You will be part of the global improvement of bioinformatics software. And there is another reason to sell your software to companies: help your university incentivize more and better development of bioinformatics software. At most universities a majority of the royalties from software sales go to the institution (at UC Berkeley, where I work, its 2/3). Most schools, especially public universities, are struggling these days and have been for some time. Help them out in return for their investment in you; you’ll help generate more bioinformatics hires, and increase appreciation for your field. In other words, although it is not always practical or necessary, when possible, please sell your software commercially.

5. You should maintain your software indefinitely.

Nope. Someday you will die. Before that you will get a job, or not. Plan for your software to have a limited shelf-life, and act accordingly.

6. Your “stable URL” can exist forever.

Nope. When I started out as a computational biologist in the late 1990s I worked on genome alignment. At the time I was excited about Dynamite: a flexible code generating language for dynamic programming methods used in sequence comparison. This was a framework for specifying bioinformatics related dynamic programming algorithms, such as the Needleman-Wunsch or Smith-Waterman algorithms. The authors wrote that “A stable URL for Dynamite documentation, help and information is http://www.sanger.ac.uk/~birney/dynamite/” Of course the URL is long gone, and by no fault of the authors. The website hosting model of the late 1990s is long extinct. To his credit, Ewan now hosts the Dynamite code on GitHub, following a welcome trend that is likely to extend the life of bioinformatics programs in the future. Will GitHub be around forever? We’ll see. But more importantly, software becomes extinct (or ought to) for reasons other than just 404 errors. For example, returning to sequence alignment, the ClustalW program of 1994 was surpassed in accuracy and speed by many other multiple alignment programs developed in the 2000s. Yet people kept using ClustalW anyway, perhaps because it felt like a “safe bet” with its many citations (eventually in 2011 Clustalw was updated to Clustal Omega). The lag in improving ClustalW resulted in a lot of poor alignments being utilized in genomics studies for a decade (no fault of the authors of ClustalW, but harmful nonetheless). I’ve started the habit of retiring my programs, via announcement on my website and PubMed. Please do the same when the time comes.

7. You should make your software “idiot proof”.

Nope. Your users, hopefully biologists (and not other bioinformatics programmers benchmarking your program to show that they beat it) are not idiots. Listen to them. Back in 2004 Nicolas Bray and I published a webserver for the alignment program MAVID. Users were required to input FASTA files. But can you guess why we had to write a script called checkfasta? (hint: the most popular word processor was and is Microsoft Word). We could have stopped there and laughed at our users, but after much feedback we realized the real issue was that FASTA was not necessarily the appropriate input to begin with. Our users wanted to be able to directly input Genbank accessions for alignment, and eventually Nicolas Bray wrote the perl scripts to allow them to do that (the feature lives on here). The take home message for you is that you should deal with user requests wisely, and your time will be needed not only to fix bugs but to address use cases and requested features you never thought of in the first place. Be prepared to treat your users respectfully, and you and your software will benefit enormously.

8. You used the right programming language for the task.

Nope. First it was perl, now it’s python. First it was MATLAB, now it’s R. First it was C, then C++. First it was multithreading now it’s Spark. There is always something new coming, which is yet another reason that almost certainly nobody is going to build on your code. By the time someone gets around to having to do something you may have worked on, there will be better ways. Therefore, the main thing is that your software should be written in a way that makes it easy to find and fix bugs, fast, and efficient (in terms of memory usage). If you can do that in Fortran great. In fact, in some fields not very far from bioinformatics, people do exactly that. My advice: stay away from Fortran (but do adopt some of the best practice advice offered here).

9. You should have read Lior Pachter’s blog post about the myths of bioinformatics software before starting your project.

Nope. Lior Pachter was an author on the Vista paper describing a program for which the source code was available only “upon request”.

34 comments

Comments feed for this article

July 10, 2015 at 6:31 am

Justin St. Giles Payne

In other words, although it is not always practical or necessary, when possible, please sell your software commercially.

Yeah, no. Please don’t.

1) Not every non-academic entity is a cash-flush pharm or VC-friendly startup. For us in the government/regulatory sphere, “free for non-commercial use” has a host of problems, not least of which that such verbiage more or less requires (as we see it) us to at least investigate the possibility of contracting, but none of us actually have contract authority. What you’re telling us is that even though you’ve created a tool that we could develop regulatory applications for, we’re not going to be able to use it for potentially two years (and I’ve seen acquisition negotiations go even longer.)

You may think that government scientists/regulators constitute “non-profit”, but that’s not entirely accurate, and acquisitions law doesn’t tell us how to proceed when the tool has a price, but only for non-academic users. And it’s really difficult for government researchers to try to negotiate anything with an outfit that doesn’t have its own government-acquisitions lawyers, because they then look to us for advice on how to proceed with actually getting paid and we don’t know, either. (Above our pay grade!)

So we just avoid those tools. Commercial-only paid licensing might very well avoid transferring public funds to the corporate sphere (somehow?) but it also avoids translating the benefits of your work into an impact on public health. That’s also something to consider.

2) More prosaically – the efforts you’ll expend to put a content gate around your source code, so that commercial entities don’t take advantage, are exactly the same efforts that will make it impossible to containerize your tool, and we’re rapidly approaching the age when un-containerizable tools (like those hosted at Sourceforge, frequently, or those whose source is available only from a web form) simply won’t be relevant because they simply won’t be deployable in a modern bioinformatics service-oriented architecture. (Trust me, it’s a thing.)

I’m not saying don’t make money off your software. Remember as the author you have unique and marketable experience in how it should be deployed and used, so market that. Deploy to the cloud and sell Software-as-a-service by whatever model you prefer. Sell a version packaged with a nice gui and security features. You can do all of that on the side of a postdoc, if you want. Best of both worlds, maybe.

But saying “commercial users have to pay for this” runs up on a lot of corner cases, groups and organizations that aren’t a private business but aren’t strictly academic, either, at least if by “academic” what we really mean is “non-profits with shoestring budgets.” For that matter, non-commerical licensing hangs up a lot of legitimately academic groups, as well: service providers who you have now put into the position of having to determine which of their users are commercial, so they can be billed on your behalf.

Please don’t sell your software. Sell anything but your software. Sell deployment. Sell support. Sell services. Sell GUI sugar on top with all the sprinkles. Sell the documentation, even. But selling your software isn’t really different than selling your paper in how it turns the spirit of scientific practice more or less on its head.

July 10, 2015 at 11:23 am

John Didion

Licensing certainly is a complex issue, with excellent points by both Lior and Justin. As a postdoc at the NIH, I largely ignore the issue, because I assume that my research constitutes not-for-profit work, but given Justin’s comments maybe I need to reevaluate this. On the other hand, any software I create is public domain (under 17 U.S.C. § 105), so I don’t really have to think about what license to use.

I wonder if there’s room for a “Good Citizen” license. Something along the lines of: “Because restrictive licenses can create unintended consequences for the use of software even by not-for-profit entities, we license this software for all uses with proper attribution to the authors. However, if you use this software to make a profit, you are strongly urged to be a Good Citizen and donate to its ongoing development and support. While you are not obligated to do so, if we find out that you’re getting rich off of the hard work of underpaid grad students and postdocs, we’re going to do our best to publicly shame you into fulfilling your Good Citizen duty.”

July 10, 2015 at 3:29 pm

homolog.us

http://www.dbad-license.org/

July 10, 2015 at 6:37 am

Klaas Vandepoele

“I’ve started the habit of retiring my programs, via announcement on my website and PubMed.”

FYI, the PubMed Commons comment redirects to a URL that does not exist.

July 10, 2015 at 7:22 am

Lior Pachter

That is exactly my point #6! FSA however is very much an active project in the hands of Robert Bradley at the FHCRC. The server went down recently due to a machine crashing in the math department where I work, and we are working on getting the server back up (unfortunately a difficult task because some of our scripts were lost).

July 10, 2015 at 1:09 pm

Vandepoele lab (@plaza_genomics)

Thanks for the update, I am pleased to read FSA is officially not declared death (yet).

Would FSA be a good choice to align highly diverged promoter sequences (of distantly related species), or would local rearrangements interfere with the ‘global’ property of the aligner?

July 10, 2015 at 1:20 pm

Lior Pachter

I should have clarified that the FSA source code is available here, it is only the web-server that is down. To answer your question, I imagine that FSA would not be ideal for distant promoter alignments, precisely for the reason you gave.

July 10, 2015 at 6:57 am

homolog.us

Will I Use Kallisto? Definitely, Most Likely and Never

http://www.homolog.us/blogs/blog/2015/07/10/will-i-use-kallisto-definitely-most-likely-and-never/

July 10, 2015 at 7:24 am

Lior Pachter

Thanks for the post. I don’t know about my coauthors on kallisto but I personally agree with assessments (i),(ii) and (iii) in your post.

July 10, 2015 at 7:21 am

Fashionable Librarian

Reblogged this on Concierge Librarian.

July 10, 2015 at 10:39 am

arthroceph

Good stuff, but how about “the 8 myths of bioinformatics software”?
Because no. 9 cannot be a myth, You wrote a post today, 10 July, and it’s already mythical? O.K., I understand it’s self-deprecating humour, but then best to exclude it from the rest of the list? Cheers!

July 10, 2015 at 10:43 am

Lior Pachter

Indeed, I alluded to Russell’s paradox. I confess.

July 10, 2015 at 2:30 pm

Aerval

While i agree with most of your arguments, i miss bioinformatics pipelines in these arguments. It maybe true that 90% of the software is not used again but the remaining 10 (or whatever) % may at somepoint be included in Galaxy or any other new shiny pipeline environment. If your license is now too limited to be included in the pipelines license, you have a problem (as suggested by Titus Brown for kallisto and Galaxy).

Anther point I disaggree with is that nobody builds on your software. It may be true that mostly likely nobody will take a small part of your software and include it somehow in another. But what happens is that you add small bits to a software that it is missing for your particular application (For example I added a line with the total number of base pairs to FastQC). For this it is of huge advantange when you have good accessible code.

July 10, 2015 at 3:03 pm

Lior Pachter

The kind of addition you are discussing (e.g. for FastQC) is no problem as long as the software has available source code (I’m a strong advocate for that, in fact I think it is essential). That is unrelated to the issue of how it is licensed (unfortunately, the widely used term “open source” confounds the two). As for kallisto, it is intended to run on a desktop (or even laptop) and we have worked hard to eliminate unnecessary parameters, so we fail to see its value in complex pipeline front-ends that are designed to help with the problem of running complex software that requires large compute resources and has many knobs to tune. Perhaps Titus Brown is right that kallisto is not well suited for Galaxy.

July 10, 2015 at 3:34 pm

Uri Laserson

Regarding licensing, I think that in many cases, the software is really not all that valuable, especially compared with the value of the data (which is also generally given away for free). I’d prefer the most permissive license possible (e.g., Apache) in order to encourage the greatest number of people to use my software (including industry). This will help build a larger community of users, and ultimately improve the quality of the software as well. It’s always a shame when people in industry won’t use a good software tool because it’s, for example, GPL’d or something exotic. Also, if your software is complex enough, companies will pay you to help them with it anyway.

July 10, 2015 at 7:04 pm

Brian Bushnell

I also work for Berkeley, and you can hardly fault programmers for giving their programs freely to industry! It’s in the hands of the legal department. I specifically requested a license that was free for academic use and required payment for commercial use, but I didn’t get to write the license. So it’s free for everyone.

Incidentally, you can add another finger to your hand… not one, but TWO other people have extended my code! 🙂

July 11, 2015 at 2:21 am

Lior Pachter

There are many additional interesting comments related to this blog post on Hacker News: https://news.ycombinator.com/item?id=9863721

July 11, 2015 at 6:42 am

binay panda

i wonder how folks view biological methods/reagents developed in a lab, funded by taxpayers’ money, different from a software developed in the same lab!! should the same rule, that applies to methods/reagents not be applied to a piece of software? biologists aren’t comfortable paying for software, period. they, however, at times do buy software out of compulsion and when they do, they rather feel disgruntled. its much easier for biologists to spend millions on reagents but v hard to convince them to spend even hundreds for a good piece of software. the real problem is that most biologists feel that the real work is biology and by the way, in order to do good biology, sometimes you need some help/support, and thats bioinformatics & software. unless this changes, the real value of a tool/software will not change. btw, this has been my own experience. its possible others have different views.

July 12, 2015 at 6:37 am

Titus Brown

Hi Lior, thanks for the post – here are my thoughts:

http://ivory.idyll.org/blog/2015-response-to-software-myths.html

July 12, 2015 at 3:33 pm

Bill

I agree with almost everything, but in #8, My Advice: Learn Fortran (2008). With the expected growth in the size of data sets, moving to distributed-memory parallel computing seems inevitable. Which means software developers need to learn how to do that. Understanding how to write a parallel algorithm is the hard part, but having an implementation method that is easy helps. Fortran is one on a very short list of languages that incorporate distributed-memory programming capability directly into the language syntax and semantics. And it is the only one on that list with the backing of an ISO standard. Fortran 2015 is on its way, with enhancements to the parallel programming features. ISO standardization promotes code portability, and the huge existing code base makes it very unlikely that Fortran would disappear in the future.

July 12, 2015 at 12:45 pm

David

Re myth 1: I doubt this is specific to bioinf software. I would have thought that most software packages never get built on. Just like most papers hardly get read. So what?

July 13, 2015 at 10:14 am

Radhouane Aniba (@radaniba)

Lior great post. Although I agree with most of what you said, I was more reading a ‘personal judgement’ of what-not-to-do when developing a bioinformatics software. Lately I discovered that samtools in its early version are not using the right exit code in case of a problem which makes it hardly used within a pipeline in case we need to capture the problems. Does this make of it a bad quality software ? Nope ! 🙂 It would be great if you append your post with what your readers could do in order to contribute to a good quality software in bioinformatics (besides I disagree with the bioinformatics specification, software are software wherever they are applied). I am a strong believer of code review prior to a paper review. Although this makes people angry when I say that, but the truth is most of the softwares published have a life expectancy of 2 years (the 2 years spent by a grad student or a postdoc working on it), there is unfortunately no training for opensource research, there is only training on hide-hide-hide-to-not-get-scooped …. damn it am scooped !
So : Should papers using code be reviewed before the paper review itself ?

July 13, 2015 at 2:36 pm

pjotrlinux

We just published Toward effective software solutions for big biology, http://www.nature.com/nbt/journal/v33/n7/full/nbt.3240.html

July 13, 2015 at 3:57 pm

GenomeHunter

I think one of the biggest losers are startups founded by (usually poor) graduates. There is no easy way they can afford to pay for all of the licensing fees. I’m not sure how this can be mitigated.

July 15, 2015 at 7:15 am

Jesse Hoff (@hoffsbeefs)

There are models that can make software available for something other than cash. They’re just trickier to implement, especially if legal fees are tight.

one is a milestone fee for development an use. If software becomes part of very successful for pay tool, money goes back to developer/uni.

vast majority of commercial users would never pay.

July 15, 2015 at 8:01 am

Jon Goldberg

But look at it this way. By using a BSD license you are giving your software to yourself. Chances are you will change jobs over the course of your career. The BSD allows you no-hassle access to YOUR software if you move on from the place where you created it.

July 17, 2015 at 6:11 am

cbouyio

Reblogged this on CBouyio.

July 17, 2015 at 10:11 am

Andreas Prlić

There are a number of scientific software projects that manage to get re-used and where others are building upon. However one can not just expect that a dump of the raw source into the public domain is sufficient for that. It requires time, marketing, communication, and significant time commitment by the developer(s) to build up and maintain a community.
http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1002802

July 19, 2015 at 11:53 am

Heng Li (@lh3lh3)

#4 Personally, I made my tools free to commercials because… 1) I don’t have the time (and the type of characteristics) to sell them. I actually appreciate companies that repackage and sell my tools or related services. They made my tools more widely available without my own efforts. 2) There are many tools of similar functionality to mine that are also comparable in multiple metrics. If I commercialized my tools, users would just switch to the alternatives. I have seen this happening to some extent. 3) Many bioinformatics tools are small. Once the paper and the source code are made available, it is not that hard for a company to reimplement the tools with a small team (see #2). It is non-trivial to prevent companies from commercializing your ideas — ironically, the clearer your paper and source code are, the easier the companies can learn your ideas.

#2 I agree bioinformatics software are often built by small teams. That is an observation, a fact. You seem to suggest this is a problem (as you said “more on why this is the case in #2”). I have a different view. I think we can build a tool with a few people because most bioinformatics tools are and should be small and relatively indepenent. It is actually wrong to assemble a large team to work on a simple task. That will only add unnecessary complications and slow down development. Bioinformatics software having poor quality is irrelevant to the size of development team. It is because most PhDs are not experienced enough to write good code. Writing good code takes tremendous exercises we often don’t have the time for as PhDs.

July 19, 2015 at 12:20 pm

Lior Pachter

Hi Heng,

Just to be clear, I *agree* with you that its good to develop bioinformatics tools in small teams. In addition to the reasons you gave (with which I agree), I’d add that often the difficult part is designing the algorithm, and in my experience algorithm development is also much more fruitful when done in (very) small teams.

Lior

July 19, 2015 at 7:53 pm

Heng Li (@lh3lh3)

Thanks for the clarification.

July 28, 2015 at 12:48 pm

Alex

As a PhD student in the physical sciences, I have heard people say things similar to your comment: “…most PhDs are not experienced enough to write good code. Writing good code takes tremendous exercises we often don’t have the time for as PhDs.”

But writing good code is important, both for the scientific community and the individual (not to mention building “transferable skills”). With this in mind, do you have specific advice for cultivating the discipline and practice of writing “good” code? Or specific types of exercises that lead to such discipline?

July 29, 2015 at 6:10 am

Titus Brown

Alex, there’s a whole community of people that are thinking about this in conjunction with science. I would point you at the Software Carpentry subcommunity in particular; follow @gvwilson on Twitter, for example. We have written papers (http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1001745) and been trying to discover and adapt good agile practices (for one small example: http://ivory.idyll.org/blog/2015-growing-sustainable-software-development-process.html) that can be applied more generally. Let me know if you have any questions about where to get started ;).

July 26, 2015 at 6:35 am

Pacala

Regarding #4 I wish that the things would be so easy and clear. These days out there are a lot of hybrid research institutes which have an academic side (i.e. non-profit) and a commercial one.

Also the bioinformatics software business is more like a lottery. Only very rare winners. I think that sticking such restrictive licenses to bioinformatics software in academia would only encourage the hype of getting rich doing bioinformatics software.

	Blog do Raphael Winc… on The network nonsense of Albert…
	Camelia on All of Us failed
	jeffrey on Yuval Peres
	Michael Rorer on A note on “How the Gaza…
	flyingmonkey on A note on “How the Gaza…
	Wes J on A note on “How the Gaza…
	David McQuillan on A note on “How the Gaza…
	lewi on A note on “How the Gaza…
	David McQuillan on A note on “How the Gaza…
	Izzy on A note on “How the Gaza…

The myths of bioinformatics software

Recent Comments

Top Posts & Pages

Recent posts

Archives

Biology

Computational Biology

Computer Science

Ideas

Math

Medicine

Statistics

Blog Stats

34 comments

Leave a comment Cancel reply

The myths of bioinformatics software

Share this:

Related

Recent Comments

Top Posts & Pages

Recent posts

Archives

Biology

Computational Biology

Computer Science

Ideas

Math

Medicine

Statistics

Blog Stats

34 comments

Leave a comment Cancel reply