You are currently browsing the category archive for the ‘academia’ category.

Six years ago I received an email from a colleague in the mathematics department at UC Berkeley asking me whether he should participate in a study that involved “collecting DNA from the brightest minds in the fields of theoretical physics and mathematics.”  I later learned that the codename for the study was “Project Einstein“, an initiative of entrepreneur Jonathan Rothberg with the goal of finding the genetic basis for “math genius”. After replying to my colleague I received an inquiry from another professor in the department, and then another and another… All were clearly flattered that they were selected for their “brightest mind”, and curious to understand the genetic secret of their brilliance.

I counseled my colleagues not to participate in this ill-advised genome-wide association study. The phenotype was ill-defined and in any case the study would be underpowered (only 400 “geniuses” were solicited), but I believe many of them sent in their samples. As far as I know their DNA now languishes in one of Jonathan Rothberg’s freezers. No result has ever emerged from “Project Einstein”, and I’d pretty much forgotten about the ego-driven inquiries I had received years ago. Then, last week, I remembered them when reading a series of blog posts and associated commentary on evolutionary biology by some of the most distinguished mathematicians in the world.

1. Sir Timothy Gowers is blogging about evolutionary biology?

It turns out that mathematicians such as Timothy Gowers and Terence Tao are hosting discussions about evolutionary biology (see On the recently removed paper from the New York Journal of Mathematics, Has an uncomfortable truth been suppressed, Additional thoughts on the Ted Hill paper) because some mathematician wrote a paper titled “An Evolutionary Theory for the Variability Hypothesis“, and an ensuing publication kerfuffle has the mathematics community up in arms. I’ll get to that in a moment, but first I want to focus on the scientific discourse in these elite math blogs. If you scroll to the bottom of the blog posts you’ll see hundreds of comments, many written by eminent mathematicians who are engaged in pseudoscientific speculation littered with sexist tropes. The number of inane comments is astonishing. For example, in a comment on Timothy Gowers’ blog, Gabriel Nivasch, a lecturer at Ariel University writes

“It’s also ironic that what causes so much controversy is not humans having descended from apes, which since Darwin people sort-of managed to swallow, but rather the relatively minor issue of differences between the sexes.”

This person’s understanding of the theory of evolution is where the Victorian public was at in England ca. 1871:

Editorial_cartoon_depicting_Charles_Darwin_as_an_ape_(1871)

In mathematics, just a year later in 1872, Karl Weierstrass published what at the time was considered another monstrosity, one that threw the entire mathematics community into disarray. The result was just as counterintuitive for mathematics as Darwin’s theory of evolution was for biology. Weierstrass had constructed a function that is uniformly continuous on the real line, but not differentiable on any interval:

f(x) = \sum_{n=0}^{\infty} \left( \frac{1}{2} \right)^ncos({11}^n\pi x).

Not only does this construction remain valid today as it was back then, but lots of mathematics has been developed in its wake. What is certain is that if one doesn’t understand the first thing about Weierstrass’ construction, e.g. one doesn’t know what a derivative is, one won’t be able to contribute meaningfully to modern research in analysis. With that in mind consider the level of ignorance of someone who does not even understand the notion of common ancestor in evolutionary biology, and who presumes that biologists have been idle and have learned nothing during the last 150 years. Imagine the hubris of mathematicians spewing incoherent theories about sexual selection when they literally don’t know anything about human genetics or evolutionary biology, and haven’t read any of the relevant scientific literature about the subject they are rambling about. You don’t have to imagine. Just go and read the Tao and Gowers blogs and the hundreds of comments they have accrued over the past few days.

2. Hijacking a journal

To understand what is going on requires an introduction to Igor Rivin, a professor of mathematics at Temple University and, of relevance in this mathematics matter, an editor of  the New York Journal of Mathematics (NYJM). Last year Rivin invited the author of a paper on the variability hypothesis to submit his work to NYJM. He solicited two reviews and published it in the journal. For a mathematics paper such a process is standard practice at NYJM,  but in this case the facts point to Igor Rivin hijacking the editorial process to advance a sexist agenda. To wit:

  • The paper in question, “An Evolutionary Theory for the Variability Hypothesis” is not a mathematics or biology paper but rather a sexist opinion piece. As such it was not suitable for publication in any mathematics or biology journal, let alone in the NYJM which is a venue for publication of pure mathematics.
  • Editor Igor Rivin did not understand the topic and therefore had no business soliciting or handling review of the paper.
  • The “reviewers” of the paper were not experts in the relevant mathematics or biology.

To elaborate on these points I begin with a brief history of the variability hypothesis. Its origin is Darwin’s 1875 book on “The Descent of Man and Selection in Relation to Sex” which was ostensibly the beginning of the study of sexual selection. However as explained in Stephanie Shields’ excellent review, while the variability hypothesis started out as a hypothesis about variance in physical and intellectual traits, at the turn of 20th century it morphed to a specific statement about sex differences in intelligence. I will not, in this blog post, attempt to review the entire field of sexual selection nor will I discuss in detail the breadth of work on the variability hypothesis. But there are three important points to glean from the Shields review: 1. The variability hypothesis is about intellectual differences between men and women and in fact this is what “An evolutionary theory for the variability hypothesis” tries really hard to get across. Specifically, that the best mathematicians are males because of biology. 2. There has been dispute for over a century about the extent of differences, should they even exist, and 3. Naïve attempts at modeling sexual selection are seriously flawed and completely unrealistic. For example naïve models that assume the same genetic mechanism produces both high IQ and mental deficits are ignoring ample evidence to the contrary.

Insofar as modeling of sexual selection is concerned, there was already statistical work in the area by Karl Pearson in 1895 (see “Note on regression and inheritance in the case of two parents“). In the paper Pearson explicitly considers the sex-specific variance of traits and the relationship of said variance to heritability. However as with much of population genetics, it was Ronald Fisher, first in the 1930s (Fisher’s principle) and then later in important work from 1958 what is now referred to as Darwin-Fisher theory (see, e.g. Kirkpatrick, Price and Arnold 1990) who significantly advanced the theory of sexual selection. Amazingly, despite including 51 citations in the final arXiv version of “An Evolutionary Theory for the Variability Hypothesis”, there isn’t a single reference to prior work in the area. I believe the author was completely unaware of the 150 years of work by biologists, statisticians, and mathematical biologists in the field.

What is cited in “An Evolutionary Theory for the Variability Hypothesis”? There is an inordinate amount of cherry picking of quotes from papers to bolster the message the author is intent on getting across: that there are sex-differences in variance of intelligence (whatever that means), specifically males are more variable. The arXiv posting has undergone eight revisions, and somewhere among these revisions there is even a brief cameo by Lawrence Summers and a regurgitation of his infamous sexist remarks. One of the thorough papers reviewing evidence for such claims is “The science of sex differences in science and mathematics” by Halpern et al. 2007. The author cherry picks a quote from the abstract of that paper, namely that “the reasons why males are often more variable remain elusive.” and follows it with a question posed by statistician Howard Wainer that implicitly makes a claim: “Why was our genetic structure built to yield greater variation among males than females?” An actual reading of the Halpern et al. paper reveals that the excess of males in the top tail of the distribution of quantitative reasoning has dramatically decreased during the last few decades, an observation that cannot be explained by genetics. Furthermore, females have a greater variability in reading and writing than males. They point out that these findings “run counter to the usual conclusion that males are more variable in all cognitive ability domains”. The author of “An Evolutionary Theory for the Variability Hypothesis” conveniently omits this from a very short section titled “Primary Analyses Inconsistent with the Greater Male Variability Hypothesis.” This is serious amateur time.

One of the commenters on Terence Tao’s blog explained that the mathematical theory in “An Evolutionary Theory for the Variability Hypothesis” is “obviously true”, and explained its premise for the layman:

It’s assumed that women only pick the “best” – according to some quantity X percent of men as partners where X is (much) smaller than 50, let’s assume. On the contrary, men are OK to date women from the best Y percent where Y is above 50 or at least greater than X.

Let’s go with this for a second, but think about how this premise would have to change to be consistent with results for reading and writing (where variance is higher in females). Then we must go with the following premise for everything to work out:

It’s assumed that men only pick the “best” – according to some quantity X percent of women as partners where X is (much) smaller than 50, let’s assume. On the contrary, women are OK to date men from the best Y percent where Y is above 50 or at least greater than X.

Perhaps I should write up this up (citing only studies on reading and writing) and send it to Igor Rivin, editor at the New York Journal of Mathematics as my explanation for my greater variability hypothesis?

Actually, I hope that will not be possible. Igor Rivin should be immediately removed from the editorial board of the New York Journal of Mathematics. I looked up Rivin’s credentials in terms of handling a paper in mathematical biology. Rivin has an impressive publication list, mostly in geometry but also a handful of publications in other areas. He, and separately Mary Rees, are known for showing that the number of simple closed geodesics of length at most L grows polynomially in L (this result was the beginning of some of the impressive results of Maryam Mirzakhani who went much further and subsequently won the Fields Medal for her work). Nowhere among Rivin’s publications, or in many of his talks which are online, or in his extensive online writings (on Twitter, Facebook etc.) is there any evidence that he has a shred of knowledge about evolutionary biology. The fact that he accepted a paper that is completely untethered from the field in which it purports to make an advance is further evidence of his ignorance.

Ignorance is one thing but hijacking a journal for a sexist agenda is another. Last year I encountered a Facebook thread on which Rivin had commented in response to a BuzzFeed article titled A Former Student Says UC Berkeley’s Star Philosophy Professor Groped Her and Watched Porn at Work. It discussed a lawsuit alleging that John Searle had sexually harassed, assaulted and retaliated against a former student and employee. While working for Searle the student was paid $1,000 a month with an additional $3,000 for being his assistant. On the Facebook thread Igor Rivin wrote

igorfb

Here is an editor of the NYJM suggesting that a student should have effectively known that if she was paid $36K/year for work as an assistant of a professor (not a high salary for such work), she ought to expect sexual harassment and sexual assault as part of her job. Her LinkedIn profile (which he linked to) showed her to have worked a summer in litigation. So he was essentially saying that this victim prostituted herself with the intent of benefiting financially via suing John Searle. Below is, thankfully, a quick and stern rebuke from a professor of mathematics at Indiana University:

thurstonreply

I mention this because it shows that Igor Rivin has a documented history of misogyny. Thus his acceptance of a paper providing a “theory” for “higher general intelligence” in males, a paper in an area he knows nothing about to a journal in pure mathematics is nothing other than hijacking the editorial process of the journal to further a sexist agenda.

How did he actually do it? He solicited a paper that had been rejected elsewhere, and sent it out for review to two reviewers who turned it around in 3 weeks. I mentioned above that the “reviewers” of the paper were not experts in the relevant mathematics or biology. This is clear from an examination of the version of the paper that the NYJM accepted. The 51 references were reduced to 11 (one of them is to the author’s preprint). None of the remaining 10 references cite any relevant prior work in evolutionary biology on sexual selection. The fundamental flaws of the paper remain unaddressed. The entire content of the reviews was presumably something along the lines of “please tone down some of the blatant sexism in the paper by removing 40 gratuitous references”. In defending the three week turnaround Rivin wrote (on Gowers’ blog) “Three weeks: I assume you have read the paper, if so, you will have found that it is quite short and does not require a huge amount of background.” Since when does a mathematician judge the complexity of reviewing a paper by its length? I took a look at Rivin’s publications; many of them are very short. Consider for example “On geometry of convex ideal polyhedra in hyperbolic 3-space”. The paper is 5 pages with 3 references. It was received 15 October 1990 and in revised form 27 January 1992. Also excuse me, but if one thinks that a mathematical biology paper “does not require a huge amount of background” then one simply doesn’t know any mathematical biology.

3. Time for mathematicians to wet their paws

The irony of mathematicians who believe they are in the high end tail of some ill-specified distribution of intelligence demonstrating en masse that they are idiots is not lost on those of us who actually work in mathematics and biology. Gian-Carlo Rota’s ghost can be heard screaming from Vigevano “The lack of real contact between mathematics and biology is either a tragedy, a scandal, or a challenge, it is hard to decide which!!” I’ve spent the past 15 years of my career focusing on Rota’s call to address the challenge of making more contacts between mathematics and biology. The two cultures are sometimes far apart but the potential for both fields, if there is real contact, is tremendous. Not only can mathematics lead to breakthroughs in biology, biology can also lead to new theorems in mathematics. In response to incoherent rambling about genetics on Gowers’ blog, Noah Snyder, a math professor at Indiana University gave sage advice:

I really wish you wouldn’t do this. A bunch of mathematicians speculating about stuff they know nothing about is not a good way to get to the truth. If you really want to do some modeling of evolutionary biology, then find some experts to collaborate or at least spend a year learning some background.

What he is saying is  די קאַץ האָט ליב פֿיש אָבער זי װיל ניט די פֿיס אײַננעצן (the cat likes fish but she doesn’t want to wet her paws). If you’re a mathematician who is interested in questions of evolutionary biology, great! But first you must get your paws wet. If you refuse to do so then you can do real harm. It might be tempting to imagine that mathematics is divorced from reality and has no impact or influence on the world, but nothing could be farther from the truth. Mathematics matters. In the case discussed in this blog post, the underlying subtext is pervasive sexism and misogyny in the mathematics profession, and if this sham paper on the variance hypothesis had gotten the stamp of approval of a journal as respected as NYJM, real harm to women in mathematics and women who in the future may have chosen to study mathematics could have been done. It’s no different than the case of Andrew Wakefield‘s paper in The Lancet implying a link between vaccinations and autism. By the time of the retraction (twelve years after publication of the article, in 2010), the paper had significantly damaged public health, and even today its effects, namely death as a result of reduced vaccination, continue to be felt. It’s not good enough to say:

“Once the rockets are up,
who cares where they come down?
That’s not my department,”
says Wernher von Braun.

From Wernher von Braun by Tom Lehrer.

Some anti-Semitism is justified

Whenever you interview fat people, you feel bad, because you know you’re not going to hire them

Japan should be bombed for dragging its feet on supporting the Human Genome Project

All our social policies are based on the fact that [Africans] intelligence is the same as ours – whereas all the testing says not really

I think having all these women around makes it more fun for the men but they’re probably less effective

I’m not a racist in a conventional way

There is a biochemical link between exposure to sunlight and sexual urges.. that’s why you have Latin lovers

[The] historic curse of the Irish.. is not alcohol, it’s not stupidity.. it’s ignorance

People say it would be terrible if we made all girls pretty. I think [doing so by genetic selection] would be great

By choice [Rosalind Franklin] did not emphasize her feminine qualities.. There was never lipstick to contrast with her straight black her, while at the age of thirty-one her dresses showed all the imagination of English blue-stocking adolescents. So it was quite easy to imagine her the product of an unsatisfied mother who unduly stressed the desirability of professional  careers that could save bright girls from marriages to dull men.. Clearly Rosy had to go or be put in her place. The former was obviously preferable because given her belligerent moods, it would be very difficult for Maurice [Wilkins] to maintain a dominant position that would allow him to think unhindered about DNA.. The thought could not be avoided that the best home for a feminist was another person’s lab

The one aspect of the Jewish brain that is not first class is that Jews are said to be bad in thinking in three dimensions.. it is true

Women are supposedly bad at three dimensions

[Rosalind Franklin] couldn’t think in three dimensions very well

[Rosalind Franklin] had Aspergers

People ask about [Rosalind Franklin] and I always say ‘autism’

[Francis Crick] may have been a bit autistic

I think now we’re in a terrible sitution where we should pay the rich people to have children.. if we don’t encourage procreation of wealthier citizens, IQ levels will most definitely fall.

Men are a bit strange and their strangest quality is their ability to understand mathematics

[Rosalind] Franklin couldn’t do maths

Indians in [my] experience [are] servile.. because of selection under the caste system

Women at Oxford and Cambridge are better than Harvard and Yale because they know their job is to look pretty and get a rich husband

People who have to deal with black employees find [that they are equal] not true

[As a female scientist] you won’t be taken seriously if you have children

Fat people are more sexual

East Asian students [tend] to be conformist, because of selection for conformity in ancient Chinese society

[Linus Pauling] was probably always half-insane

Anyone who would hire an ecologist is out of his mind

[Rosalind Franklin] was a loser

The wider your face, the more likely you are [to be violent].. Senator Jim Webb has the broadest face I’ve ever seen on any man

We already accept that most couples don’t want a Down child. You would have to be crazy to say you wanted one, because that child has no future.

Disabled individuals are genetic losers

[With IVF] all hell will break loose, politically and morally, all over the world

If we knew our son would develop schizophrenia, we wouldn’t have had him

My former colleagues are pinkos and shits

We should perform genome-wide association studies of women who have given up their children for adoption in order to find the ‘loveless gene’

[X University]- it used to be such a wonderful place. And then they started admitting women!

Catholics are more likely to forgive than Jews

If you could find the gene which determines sexuality and a woman decides she doesn’t want a homosexual child, well, let her

 

photo-6_Fotor

 

I’m thrilled to announce that I will be moving to Caltech next year where I will be professor of computational biology!

Some people have asked me why I’m moving. First and foremost, we (my family) feel it is the right move for us as for a variety of reasons that I won’t get into here. For me personally, Caltech represents a unique, special, and extraordinary opportunity because it is an institution that fosters an environment facilitating research and teaching that, inasmuch as possible, is unencumbered by the minutiae of academia. In particular, Caltech is unintimidated by disciplinary boundaries, and enables a culture that I’ve yearned for my whole career. It doesn’t throw hundreds of millions of dollars at a football team (although the basketball team is doing pretty well). Its priorities are aligned with mine.

I’m leaving behind Berkeley, a university I started working at 17 years ago as a visiting assistant professor. I’ll miss Berkeley. I still remember the January 1999 phone call from Prof. Tsit Yuen Lam, announcing my appointment. I was honored to have been invited to conduct research and to teach at one of the world’s great institutions. Berkeley was, and still is, distinguished by it’s mission of providing world-class affordable public education. I can’t think of any university in the world that has done as well in pursuing this noble goal. Consider, for example, that UC Berkeley has almost as many Pell Grant recipients as all eight Ivy League schools combined. But with time, as I was allowed to drop the prefixes in my title, I found myself increasingly aware of the structure, organization and financing of the university. Two numbers that I learned have stuck in my mind: today, state funding comprises only 13% of the budget (likely even lower next year), less than half of what it was when I arrived. At the same time, tuition has increased by over a factor of three during the same time period. The squeeze has harmed the institution not just because of reductions in resources (though there have been many), but also because of the strain placed on the morale and mission of the university. Over time I started to question whether its world-class education was sustainable, and lamented that its affordability was becoming a myth. Over the past two years I’ve become increasingly aware that the reality of the university is at odds with my values. I’m sad for the University of California and for the citizens who are being harmed by the blows it is taking, and very much wish that the state will protect and nurture its education treasure. But I will be rooting for it from the sidelines.

I can’t wait to start at Caltech, and look forward to the next phase of my career!

My doppelgänger, Charlie Eppes, who developed algebraic statistics for computational biology at “CalSci” (Caltech).

The Journal lmpact Factor (JIF) was first proposed by Eugene Garfield of Institute for Scientific Information (ISI) fame in 1955. It is a journal specific yearly citation measure, defined to be the average number of citations per paper of the papers published in the preceding two years. Obsession with the impact factor in the face of widespread recognition of its shortcomings as a tool for judging the value of science is an unfortunate example of “the tragedy of the commons”.

Leaving aside for a moment the flaws of the JIF, one may wonder whether journals do in fact have any impact? By “impact”one might imagine something along the lines of the simple definition in the Merriam-Webster Dictionary: “to have a strong and often bad effect on (something or someone)” and as an object for the impact one could study the researchers who publish, the scientific community as a whole, or the papers themselves. On the question of impact on papers, common sense suggests that publishing in a high profile journal helps a paper succeed and there is pseudoscience to support that case. However there is little in the way of direct measurement. Twitter to the rescue.

At the end of last year my twitter account was approaching 5,000 followers. Inspired by others, I found myself reflecting on this “milestone” and in anticipation of the event, I started to ponder the scientific utility of amassing such a large numbers of followers. There is, of course, a lot of work being done on natural language processing of twitter feeds, but it struck me that with 5,000 followers I was in a position to use twitter for proactive experimentation rather than just passive mining. Impact factors, followers, and twitter… it was just the right mix for a little experiment…

In my early tweeting days I encountered a minor technical issue with links to papers: it was unclear to me whether I should use link shorteners (and if so which service?) or include direct links to articles in my tweets. I initially thought that using link shorteners would save me characters but I quickly discovered that this was not the case. Eventually, following advice from fellow twitterati, I began tweeting articles only with direct links to the journal websites. Last year, when twitter launched free analytics for all registered users, I started occasionally examining the stats for article tweets, and I began to notice quantitatively  what I had always suspected intuitively: tweets of Cell, Nature and Science (CNS) articles were being circulated much more widely than those of other journals. Having use bit.ly, the natural question to ask was how do tweets of journal articles with the journal names compare to tweets with anonymized links?

Starting in August of 2015, I began occasionally tweeting articles about 5 minutes apart, using the exact same text (the article title or brief description) but doing it once with the article linked via the journal website so that the journal name was displayed in the link and once with an a bit.ly link that revealed nothing about the journal source. Twitter analytics allowed me to see, for each tweet, a number of (highly correlated) tweet statistics, and I settled on measuring the number of clicks on the link embedded in the tweet. By switching the order of named/anonymized tweets I figured I could control for a temporal effect in tweet appearance, e.g. it seemed likely that users would click on the most recent links on their feed resulting in more views/clicks etc. for later tweets identical except for link type . Ideally this control would have been performed by A/B testing but that was not a possibility (see Supplementary Materials and Notes). I did my tweeting manually, generally waiting a few weeks between batches of tweets so that nobody would catch on to what I was doing (and thereby ruin the experiment). I was eventually caught forcing me to end the experiment but not before I squeezed in enough tests to achieve a significant p-value for something.

I hypothesized that twitter users will click on articles when, and only when, the titles or topics reflect research of interest to them. Thus, I expected not to find a difference in analytics between tweets made with journal names as opposed to bit.ly links. Strikingly, tweets of articles from Cell, Nature and Science journals (CNS) all resulted in higher clicks on the journal title rather than the anonymized link (p-value 0.0078). The average effect was a ratio of 2.166 between clicks on links with the journal name in comparison to clicks on bit.ly links. I would say that this number is the real journal impact factor of what are now called the “glamour journals” (I’ve reported it to three decimal digits to be consistent with the practice of most journals in advertising their JIFs). To avoid confusion with the standard JIF, I call my measured impact factor the RIF (relative impact factor).

Untitled 3

One possible objection to the results reported above is that perhaps the RIF reflects an aversion to clicking on bit.ly links, rather than a preference for clicking on (glamour) journal links. I decided to test that by performing the same test (journal link vs. bit.ly link) with PLoS One articles:

Untitled 4

Strikingly, in three out of the four cases tested users displayed an aversion to clicking on PLoS One links. Does this mean that publishing in PLoS One is career suicide? Certainly not (I note that I have published PLoS One papers that I am very proud of, e.g. Disordered Microbial Communities in Asthmatic Airways), but the PLoS One RIF of 0.877 that I measured (average ratio of journal:bit.ly clicks, as explained above) is certainly not very encouraging for those who hope for science to be journal name blind. It also suggests that the RIF of glamour journals does not reflect an aversion to clicking on bit.ly links, but rather an affinity for.. what else to call it but.. glamour.

Academics frequently complain that administrators are at fault for driving researchers to  emphasize JIFs, but at the recent Gaming Metrics meeting I attend UC Davis University Librarian MacKenzie Smith pointed out something which my little experiment confirms: “It’s you!

Supplementary Material and Notes

The journal Nature Communications is not obviously a “glamour journal”, however I included it in that category because the journal link name began nature.com/… Removing the Nature Communications tweet from the glamour analysis increases the glamour journal RIF to 2.264.

The ideal platform for my experiment is an A/B testing setup, and as my former coauthor Dmitry Ryaboy , head of the experimentation team at twitter explains in a blog post, twitter does perform such testing on users for internal purposes. However I could not perform A/B testing directly from my account, hence the implementation of the design described above.

I tried to tweet the journal/bit.ly tweets exactly 5 minutes apart, but once or twice I got distracted reading nonsense on twitter and was delayed by a bit. Perhaps if I’d been more diligent (and been better at dragging out the experiment) I’d have gotten more and better data. I am comforted by the fact that my sample size was >1.

Twitter analytics provided multiple measures, e.g. number of retweets, impressions, total engagements etc., but I settled on link clicks because that data type gave the best results for the argument I wanted to make. The table with the full dataset is available for download from here (or in pdf). The full list of tweets is here.

So you’re an academic and you’ve written some bioinformatics software. You heard that:

1. Somebody will build on your code.

Nope. Ok, maybe not never but almost certainly not. There are many reasons for this. The primary reason in my view is that most bioinformatics software is of very poor quality (more on why this is the case in #2). Who wants to read junk software, let alone try to edit it or build on it? Most bioinformatics software is also targeted at specific applications. Biologists who use application specific software are typically not interested in developing or improving software because methods development is not their main interest and software development is not their best skill. In the computational biology areas I have worked in during the past 20 years (and I have reviewed/tested/examined/used hundreds or even thousands of programs) I can count the software programs that have been extended or developed by someone other than the author on a single hand. Software that has been built on/extended is typically of the framework kind (e.g. SAMtools being a notable example) but even then development of code by anyone other than the author is rare. For example, for the FSA alignment project we used HMMoC, a convenient compiler for HMMs, but has anyone ever built on the codebase? Doesn’t look like it. You may have been told by your PI that your software will take on a life of its own, like Linux, but the data suggests that is simply not going to happen. No, Gnu is Not Unix and your spliced aligner is not the Linux kernel. Most likely you alone will be the only user of your software, so at least put in some comments, because otherwise the first time you have to fix your own bug you won’t remember what you were doing in the code, and that is just sad.

2. You should have assembled a team to build your software.

Nope. Although most corporate software these days is built by large teams working collaboratively, scientific software is different. I agree with James Taylor, who in the anatomy of successful computational biology software paper stated that ” A lot of traditional software engineering is about how to build software effectively with large teams, whereas the way most scientific software is developed is (and should be) different. Scientific software is often developed by one or a handful of people.” In fact, I knew you were a graduate student because most bioinformatics software is written singlehandedly by graduate students (occasionally by postdocs). This is actually problem (although not your fault!) Students such as yourself graduate, move on to other projects and labs, and stop maintaining (see #5), let alone developing their code. Many PIs insist on “owning” software their students wrote, hoping that new graduate students in their lab will develop projects of graduated students. But new students are reluctant to work on projects of others because in academia improvement of existing work garners much less credit than new work. After all, isn’t that why you were writing new software in the first place? I absolve you of your solitude, and encourage you to think about how you will create the right incentive structure for yourself to improve your software over a period of time that transcends your Ph.D. degree.

3. If you choose the right license more people will use and build on your program.

Nope. People have tried all sorts of licenses but the evidence suggests the success of software (as measured by usage, or development of the software by the computational biology community) is not correlated with any particular license. One of the most widely used software suites in bioinformatics (if not the most widely used) is the UCSC genome browser and its associated tools. The software is not free, in that even though it is free for academic, non-profit and personal use, it is sold commercially. It would be difficult to argue that this has impacted its use, or retarded its development outside of UCSC. To the contrary, it is almost inconceivable that anyone working in genetics, genomics or bioinformatics has not used the UCSC browser (on a regular basis). In fact, I have, during my entire career, heard of only one single person who claims not to use the browser; this person is apparently boycotting it due to the license. As far as development of the software, it has almost certainly been hacked/modified/developed by many academics and companies since its initial release (e.g. even within my own group). In anatomy of successful computational biology software published in Nature Biotechnology two years ago, a list of “software for the ages” consists of programs that utilize a wide variety of licenses, including Boost, BSD, and GPL/LGPL. If there is any pattern it is that the most common are GPL/LGPL, although I suspect that if one looks at bioinformatics software as a whole those licenses are also the most common in failed software. The key to successful software, it appears, is for it to be useful and usable. Worry more about that and less about the license, because ultimately helping biologists and addressing problems in biomedicine might be more ethical than hoisting the “right” software license flag.

4. Making your software free for commercial use shows you are not against companies.

Nope. The opposite is true. If you make your software free for commercial use, you are effectively creating a subsidy for companies, one that is funded by your university / your grants. You are a corporate hero! Congratulations! You have found a loophole for transferring scarce public money to the private sector. If you’ve licensed your software with BSD you’ve added another subsidy: a company using your software doesn’t have any reason to share their work with the academic community. There are two reasons why you might want to reconsider offering such subsidies. First, by denying yourself potential profits from sale of your software to industry, you are definitively removing any incentive for future development/maintenance of the software by yourself or future graduate students. Most bioinformatics software, when sold commercially, costs a few thousand dollars. This is a rounding error for companies developing cancer or other drugs at the cost of a billion dollars per drug and a tractable proposition even for startups, yet the money will make a real difference to you three years out from your Ph.D. when you’re earning a postdoc salary. A voice from the future tells you that you’ll appreciate the money, and it will help you remember that you really ought to fix that bug reported on GitHub posted two months ago. You will be part of the global improvement of bioinformatics software. And there is another reason to sell your software to companies: help your university incentivize more and better development of bioinformatics software. At most universities a majority of the royalties from software sales go to the institution (at UC Berkeley, where I work, its 2/3). Most schools, especially public universities, are struggling these days and have been for some time. Help them out in return for their investment in you; you’ll help generate more bioinformatics hires, and increase appreciation for your field. In other words, although it is not always practical or necessary, when possible, please sell your software commercially.

5. You should maintain your software indefinitely.

Nope. Someday you will die. Before that you will get a job, or not. Plan for your software to have a limited shelf-life, and act accordingly.

6. Your “stable URL” can exist forever.

Nope. When I started out as a computational biologist in the late 1990s I worked on genome alignment. At the time I was excited about Dynamite: a flexible code generating language for dynamic programming methods used in sequence comparison. This was a framework for specifying bioinformatics related dynamic programming algorithms, such as the Needleman-Wunsch or Smith-Waterman algorithms. The authors wrote that “A stable URL for Dynamite documentation, help and information is http://www.sanger.ac.uk/~birney/dynamite/” Of course the URL is long gone, and by no fault of the authors. The website hosting model of the late 1990s is long extinct. To his credit, Ewan now hosts the Dynamite code on GitHub, following a welcome trend that is likely to extend the life of bioinformatics programs in the future. Will GitHub be around forever? We’ll see. But more importantly, software becomes extinct (or ought to) for reasons other than just 404 errors. For example, returning to sequence alignment, the ClustalW program of 1994 was surpassed in accuracy and speed by many other multiple alignment programs developed in the 2000s. Yet people kept using ClustalW anyway, perhaps because it felt like a “safe bet” with its many citations (eventually in 2011 Clustalw was updated to Clustal Omega). The lag in improving ClustalW resulted in a lot of poor alignments being utilized in genomics studies for a decade (no fault of the authors of ClustalW, but harmful nonetheless). I’ve started the habit of retiring my programs, via announcement on my website and PubMed. Please do the same when the time comes.

7. You should make your software “idiot proof”.

Nope. Your users, hopefully biologists (and not other bioinformatics programmers benchmarking your program to show that they beat it) are not idiots. Listen to them. Back in 2004 Nicolas Bray and I published a webserver for the alignment program MAVID. Users were required to input FASTA files. But can you guess why we had to write a script called checkfasta? (hint: the most popular word processor was and is Microsoft Word). We could have stopped there and laughed at our users, but after much feedback we realized the real issue was that FASTA was not necessarily the appropriate input to begin with. Our users wanted to be able to directly input Genbank accessions for alignment, and eventually Nicolas Bray wrote the perl scripts to allow them to do that (the feature lives on here). The take home message for you is that you should deal with user requests wisely, and your time will be needed not only to fix bugs but to address use cases and requested features you never thought of in the first place. Be prepared to treat your users respectfully, and you and your software will benefit enormously.

8. You used the right programming language for the task.

Nope. First it was perl, now it’s python. First it was MATLAB, now it’s R. First it was  C, then C++.  First it was multithreading now it’s Spark. There is always something new coming, which is yet another reason that almost certainly nobody is going to build on your code. By the time someone gets around to having to do something you may have worked on, there will be better ways. Therefore, the main thing is that your software should be written in a way that makes it easy to find and fix bugs, fast, and efficient (in terms of memory usage). If you can do that in Fortran great. In fact, in some fields not very far from bioinformatics, people do exactly that. My advice: stay away from Fortran (but do adopt some of the best practice advice offered here).

9. You should have read Lior Pachter’s blog post about the myths of bioinformatics software before starting your project.

Nope. Lior Pachter was an author on the Vista paper describing a program for which the source code was available only “upon request”.

Blog Stats

  • 1,886,240 views
%d bloggers like this: