Is there a reason why the source code of software mentioned in research papers is not released? I understand that research papers are more about the general idea of accomplishing something than implementation details, but I don’t get why they don’t release the code.
For example, this paper ends with:
Results
The human line drawing system is implemented through the Qt framework in C++ using OpenGL, and runs on a 2.00 GHz Intel dual core processor workstation without any additional hardware assistance. We can interactively draw lines while the system synthesizes the new path and texture.
Do they keep the source code closed intentionally because of a monetization they intend to make with it, or because of copyright ?
24
Several reasons come to mind.
- Code is too big for article. For a short period of time, interesting projects were short enough to be published with the paper that described them. This can still happen, but many projects of sufficiently large size to be interesting have grown too big to be published with the papers that describe them.
- Public hosts not free or durable. Until recently, cheap, durable, easy to access public hosts were not available.
- Publishing a paper is easier than publishing a project. Some people have time to publish a paper or a project, but not both.
- Incentives tied to role. Many years ago I asked a colleague about product development and patents and got the word that most people there pretty much did one or the other. As with paper writers (think academia) and open source developers, rewards are geared toward one work product or the other.
- Self motivation. The desire to describe ideas or to implement code is not always present in equal parts in the same person. Many of my professors openly admitted that they either never coded very much, or were many years away from having coded fluently. Similarly, many developers barely want to write comments in their code or when they commit to source control.
- Durability of project hosting and work product is also an issue. Who wants to link somewhere that might be gone a few years from now and as a result, diminish the value of the paper.
- Tradition. Publishers are oriented toward reviewing and publishing papers, but might not be ready to take on the same evaluation for projects.
Also the traditional views on what is a sensible level of reproducibility varies among fields. A chemist publishing a paper about a new synthesis method is expected to write down enough detail for another chemist to perform the synthesis. She’d not be expected to ship the educts and product to the journal. Readers who want to use/reproduce the paper are expected to buy their own educts and do the synthesis themselves in their lab (though they may ask to come and visit the lab to see how it is done in practice). Neither would a biologist be expected to attach his new transgenic mice to the paper. This view on reproducibility corresponds to e.g. giving a (pseudo-code) description of the algorithm as opposed to shipping the actual implementation. - Naked code can be shocking. It takes a lot less polishing to proof-read a paper length document than to code inspect, code review, and quality assure a project. I have a lot of code I would be more comfortable telling you about than showing you. Hopefully things are moving forward to a point where we will all write beautiful code, but if your code was rushed, barely or doesn’t completely work, you might be more comfortable not sharing the executables or the source.
- Closed source. Not everyone has embraced open source. Many papers are written about work for DoD, commercial projects, or privately funded projects where there are benefits from exposure of the project to the public, but there are still trade secrets or first to market advantages that could be eroded by open sourcing the code or other work products.
- Publish further work based on this code. If the code is not published it may give the author an advantage in publishing followup work. Other competing researchers may need to reimplement the work which may take precious time.
21
Read Randall LeVeque’s presentation on “Top 10 Reasons to Not Share Your Code (and why you should anyway)” http://faculty.washington.edu/rjl/talks/LeVeque_CSE2011.pdf
He argues compellingly that code is analogous to proofs in Mathematics, and invites us to consider a world where proofs aren’t published, because they are too long, or too ugly, or don’t work in the edge cases, or might be worth money, or someone might steal it…
Basically, if you are doing science, then you should publish your code. Otherwise, you are doing alchemy and you can fly right back to the dark ages and die of plague as far as I’m concerned.
4
Generally, the programs used to produce the papers results are only tools, and only the results matter. So they are not placed on the paper which presents the context, the methodology, the results and a discussion about them.
But results must be reproducible. And then, when the data sources on which the paper is based are publicly available, the programs transforming them into results are generally required too. They are often placed “somewhere” on the Web if it doesn’t raise any patent/copyright issue. Or, at least, the authors must send you the programs if you ask them.
6
It is not closed source. The software simply hasn’t been published at all.
Short answer:
There are several reasons not to publish the software, but it’s uncommon to publish the software in a closed-source manner.
Long answer:
Closed source means that the software has been published and the source-code has not. But the common case is that neither the software nor the source-code has been published.
In my experience (I work in atmospheric science), authors are very happy if you contact them and ask if you can get their software (including source-code, of course) for doing research. If I’m going to write a paper with a project based on theirs, they will at least get a citation out of it (good!), but probably get a co-authored paper out of it (because of course, they didn’t document their software so that someone can use it without their help). A relatively cheap co-author paper, so that’s even better.
The real question is:
Why don’t they publish the software?
There are several reasons for this:
- Published software needs documentation. Usually, people don’t like to write documentation.
- Published software may attract users. Users may have questions. This takes time (but see above).
- Published software may require non-trivial maintenance.
- Publishing software requires hosting.
- People may feel embarrassed about the poor quality of their source code.
The list could be made longer. It deserves to be a separate question, over at Academia.SE, not here.
(Note that in my group, we do publish our software — licensed under GPL)
2
That might sound cynical, but in my experience research papers are not written to be easy to understand or simple to reproduce. Instead, in the research community it is more important to have an article that sounds and looks very scientific. For that reason most authors transform their code into mathematical formulas and try to prove that their algorithm is mathematically correct. Usually the number of pages for such an article is limited so there is no space left to publish the code. Yet, of course this would not limit any author to link to the complete code with an URL…
One could assume that if code is not published, either the authors want to mometarize their findings, or (what I personally think is the case more often) they are afraid that people would see that their research is not as awesome as they claim. Often results only apply to a very limited number of cases.
Also, I have seen that from one simple program/algortihm several research papers are spin off. If code would be published, it would be difficult to write any further papers on the same topic. So knowledge is held back in order to publish it over time in little slices.
Always keep in mind that at universities, it is not so much the results or the applicability of research that is important, but the number of papers you publish. It’s sad, but true.
13
Aside from the intent to monetize, I do not see a good reason for leaving the source code out of research papers. There is a small movement starting that proposes supplying the source code as a rule to publishing any research that depends on software in some way, shape, or form. You can read more about it, it’s called the Science Code Manifesto.
1
The above answers miss a few practical reasons which frequently arise in Computer Graphics (the area in which the paper mentioned by the author was published). Code Release varies greatly between fields in CS – for example in Machine Learning, code is usually published. In Human Computer Interaction, code is almost never published.
I have released quite a bit of code in Computer Graphics, and while I do think authors should release their code, there are many simple, non-conspiracy-theory reasons why they don’t. For example
1) Most Computer Graphics research projects involve collaboration between multiple researchers, often at different institutions, each providing some piece of the puzzle (ie algorithms, libraries, etc). To release working code, all researchers have to agree. This is rarely a simple discussion and usually it is easier to avoid the issue.
2) Often the code for a single paper is embedded in a larger codebase being developed within a lab. That codebase will contain other unpublished work. Separating out the code for a single project is a lot of work, often with no immediate benefit to the people who have to do this work (see incentive below).
3) Universities often have IP rights to the code. Hence, it is necessary to contact an “innovations office” who will make your life endlessly difficult, wanting you to document the “invention” so they can patent it, etc, before you open-source it. In some cases the university can even deny the permission to release source (this varies between institutions, and is greatly complicated by (1) )
4) Lots of Computer Graphics research is done by Corporations. In that case the authors do not own the code either, and have to get permission from Lawyers to release the code. Lawyers have little to no incentive to say yes.
5) There is no incentive to publish code. Most Computer Graphics research code is never used by anyone else. Even if it is, for general-purpose code you usually just get an acknowledgement (worthless in terms of your CV). If you are lucky you will get a citation. Hiring committees and Grant agencies generally don’t care one bit if you released your code. So, time spent prepping code for release is time wasted that could have been spent on another paper. (There are people actively trying to change this in Computer Graphics).
6) There are incentives to not publish code. Code can sometimes turn into startup companies, be licensed to existing companies, etc. This funds future research. We all gotta eat.
1
It depends. A person writing a paper, or their supervisor, decides what should be done with the source code. Sometimes, people make the project an open source.
Sometimes, projects are usually funded by companies, meaning it’s their property. In those cases, paper’s author is not allowed to show the code.
It’s usually a matter of page limitations. If the algorithm is exceedingly short, it oftentimes is represented, at least as pseudocode, in the paper. On the other hand, if the printed version of the underlying code is even a handful of pages long, printing the code would leave no room for the meat of the article. A journal article that is ten pages long is a long article.
Not making the source available creates a potential for fraud. Because of this potential, many journals now require that authors submit their source code as supplemental information (which is obtainable from the journal if you have access; a a hefty subscription fee may be involved). Some others journal requires the authors to release their source code to anyone who asks for it. Yet other journals are still in the dark ages; the source code isn’t required for submission and the authors aren’t required to release it.
The easiest thing to do is to ask the authors if they can supply the source code to you. The authors’ email addresses are typically listed in most journal papers nowadays.
1
My experience as a scientist (5 papers published) is that often times it is not required by the journal to release the code which was used to create the results. That is not saying that journals would not accept the scripts. Many journals allow online supplementary material. Some journals geared towards algorithms and such (e.g. Computers and Geosciences) require you to add the source of an algorithm, but this is more an exception than a rule.
In addition to the culture at the journals, for scientists code is just a means to an end. Many are not professional software developers. Because many regard the code as just a tool to express science, they do not feel the urgency to also publish the code. In addition, polishing your code to the point where it could be published takes a lot of work. A scientist is paid to do science, not write software.
4
More often than not, the actual programme is just a tool to get to the end, rather than the product in its own right. Giving full details of the source code would be akin to providing a full drawing of the pen used to sign the report, and/or schematics of the PC.
Having said that, especially where peer reviewing is being invited, the source code will be available – although under some form of Non Disclosure Agreement (NDA) – as there is inherently Intellectual Property embodied within the program.
If you are genuinely interested in the code, I suggest @Buttons’ comment is the best advice: Ask them 🙂
A lot depends on the purpose for which the code was written. If it was to demonstrate a point, it may well be that it is not optimised, and therefore not ideal that it get released. If the underlying concepts and methodology are valid, then it should be possible to recreate the outcome of the code from scratch. There may be issues of copyright and ownership as well.
In principle, it is not technically impossible to release the code but the reasons for which it might not be released are varied. There probably isn’t a simple answer to this question for that reason. In specific cases maybe you could ask the researchers concerned.
Incentives matter and the incentives of researchers are generally to ensure that they can produce a steady stream of papers the incrementally build on each other. Graduate students generally need 3-5 published papers that they can turn into individual chapters of their thesis in order to graduate. Junior faculty need to generate as many publications as they can before their tenure review. For that reason, most academic papers are really paper n
in a series. For example, the paper you reference builds on a paper the same group published a year before and discusses the ground the next paper is likely to cover.
Publishing the source code potentially allows another researcher in a different group to produce paper n+1
before the original author does or at least to produce a paper that covers a significant fraction of the ground that the author was expecting to cover as part of this research stream. If that happens, the graduate student could easily find him or herself spending another 6-12 months in grad school in order to produce enough research output to graduate. The faculty member may end up with one fewer published paper when tenure review time comes around. Both of these are obviously large blows to the careers of the researcher. Add in the fact that academic applications are often part of the research efforts of multiple people within a research group (either directly or because they share certain components) and there is pressure within the research group not to release code that might end up hurting someone that you work with every day.
You often get similar sorts of discussions in fields where gathering raw data is time consuming and highly distributed. In astronomy, for example, a research group may spend years gathering data before they have enough information to publish one paper. But they’ll then use that data to produce a series of papers. Research groups are very reluctant to share more of their data sets than absolutely necessary because it becomes too easy for other groups to free-ride on the time that was invested gathering the data in order to reap the rewards of actually analyzing the data.
Eventually, a lot of this code will get released just like the astronomical data eventually gets released. That often comes when the author reaches the end of that series of papers or when most of the research groups that are working on similar topics have similar engines so releasing the code no longer gives a new researcher a competitive advantage.
It would be ideal for science if the data and code was released more quickly. But that would often harm the scientific researcher and that is whose incentives matter in this case.
1
As someone who has done this (on the student side) several times in the past: oftentimes the professors writing the paper never even see the source-code themselves. They’ll have their grad students write the code, and then only ask for the final executable (or even just a confirmation of the result) when it’s complete.
Also, often the code written is not very readable anyways, because the students just hacked it together to get it done, and because (though they’re very bright) grad students with no real-world experience tend not to be the world’s best coders…
Most of the reasons I can think of have already been raised here, but I thought I would add two more that actually happened to me:
The journal has no idea what to do.
For one of the papers I was working on, I decided that I was absolutely, without question going to include the source code (the whole point of the paper was data visualization) and example data to go along with it. So along with the submission I attached Electronic Supplements 1 and 2 – an R script with my code, and a CSV file with the data needed for said R script.
The journal, as it turns out, can only take electronic supplements if they’ve been shoehorned into Word files. After trying for the better part of a day to get the R script in that form, I gave up and decided not to include the code as a supplement. I could have hosted it at my University, but as a graduate student I knew that I was going to lose my account there in ~1 year – open source isn’t of any use if its immediately overtaken with linkrot.
I ended up hosting it on GitHub and putting a reference to that in the paper, but that was because I really wanted the code to go in. I can see, especially since most people in my field don’t use something like GitHub, just deciding that the effort wasn’t going to be worth the handful of people who would download it, and who could email me anyway if they really want to.
The journal just isn’t interested
I inserted some small details about the code itself into a paper on request from a reviewer, but its a clinical journal (read: no one codes), it doesn’t allow electronic supplements, and again, adding the source code would likely have been more trouble than it was worth.
Ironically, if anyone did go looking for the code, it is (or soon will be) open source, but I was already running on the edge of ‘This is growing distractingly technical’ and I decided that the brief, ‘make the reviewer happy’ mention was all I was going to do.
The paper you cited is already 28 pages, and most of the content is about the design decisions that are related to solving the problem (stated in the title).
The code is the final step to validate the design. It is not trivial, but it is not the part that adds value in the results of the paper, especially if you were to consider the space it would take up.
Not every case is the same. Some papers do give source code, or at least pseudo code. Some editors don’t allow it. Some allow it, but because of space, the authors don’t include it. One journal where I published source code formatted it as “figures” and the electronic version has it as image data, even though I submitted it as text.
Many times the implementation (i.e. the software doesn’t matter) but increasingly the implementation DOES affect the results.
Anytime the implementation matters… the source code should definitely be made available! The more that the results depend on the implementation or computational methods the more important it becomes to post the source code.
1
I’d like to add a few points on the type of code I deal with as a chemometrician (chemist doing data analysis):
-
People who write data analysis code (like I do) are comparatively few compared to the people who use that code. “Custom code written in house” does not mean that the authors wrote it – could be colleagues’ code so the authors cannot publish it.
-
A separate publication of the code may be planned, and the code’s author (or the supervisor) may be concerned that the novelty is lost if the code has been (partially) made public before.
Even if the journal where code publication is intended for doesn’t object to the code having been available publicly before, the pure concern of the supervisor (or someone in the IP office) can be enough to stop the publication of the code. -
Data analysis code is often tailored to the data. It doesn’t make too much sense without the data. (You may argue that the data should be published anyways, but that is a different question and off topic here.)
In any way, at my institute, we archive raw data and data analysis code together with the paper. Default policy is not (yet?) to make them publicly available, but they would certainly be available on request. -
(The traditional view on what is reproducibility in chemistry corresponds rather to a description (possibly pseudo-code) of the algorithm than to shipping the actual source code)
-
Many of my colleagues use interactive tools for their data analysis which do not log the steps of the data analysis. So there is no source code that could be published. The data analyis corresponds less to a programming than to a lab approach: you do things and write down what you do and observe in your lab book.
1