The problem
I write a lot of exploratory code in my research. As I go along, I put functionality that I’d like to reuse in a central location. A project might look like this:
./mylib
./exploration
/experiment_1
/experiment_2
/experiment_3
Where each experiment uses some functionality from mylib
.
Now I come along and start my fourth experiment. In the process, I may need to change my library in some backwards-incompatible way. Now I can’t re-run my first three experiments without updating their codes to be compatible with the newest library.
Note: As of now, I keep mylib
in version control, and exploration
in version control, using git. This means that all of the experiments are in the same repository. This is done so that a single push
or fetch
and merge
in experiments
syncs all of my experiments between computers. I feel that there may be a better way, but that might be for another question…
Possible solutions
- I could bite the bullet and update the old experiments manually whenever I need to run them (bad, tedious, but straightforward).
- I could “vendorize” my library by copying it whenever I make a new experiment. (Bad, bugfixes have to be inserted into each copy).
- Since I keep my library in version control, I could tag points in the library’s history by whatever is required by an experiment. When I want to run experiment n, I’d checkout tag n. (Better, but what if I want to run two experiments simultaneously? It also seems like there should be a way to automatically use a specific version of the library.)
- Whenever I start an experiment, I’ll make a new branch in the library. In each experiment folder, I’ll clone the library repo and checkout the correct branch. (This seems reasonable, though it is perhaps wasteful of space, since I’m duplicated all branches when I clone. Also, I might have a lot of experiments, meaning that there will be lots and lots of branches in my repository, cluttering things unnecessarily.)
Should I reconsider any of these above solutions?
I have also heard about git’s subtrees and submodules, and while they sound like they might be the answer to my problem, I want to get the input of more knowledgeable coders before sinking time into a rabbit-hole.
This is not so much a question of the version control system you are using, but more of your general configuration management strategy. First think about your strategy, then check how you map this to your VCS.
Each version of your library you release into “production” should have a unique version number. You should keep track which of your “experiments” uses which version of the lib, and which experiments you still have “under maintenance”. This lets you find out for which older versions of your lib you may need “maintenance releases”, and for which you can omit them. The different version numbers can be included in the file name of your lib if that helps you to use them in parallel (if that’s necessary depends on your physical library management / resolving strategy).
Lets say you have 3 versions lib_v1.0, lib_v2.0 an lib_v3.0, each one used by experiment1, experiment2 and experiment3. Now, during development of experiment4, you made incompatible changes to lib_v4 and find a bug which affects all former versions. Lets further assume you immediately fix that bug in V4. Now you have the following alternatives
-
don’t fix the bug in older versions. For example, experiment1 is not “in production” any more – then there is no need to fix the bug in V1.0. Or you know for sure experiment1 is not affected by the bug, and you know “experiment1” is the only program using your lib, then there is also no need to fix the bug in V1.0
-
upgrade all affected experiments to your current lib_v4. This can become tedious, but with @RobertHarveys suggestion using an Adapter (or to avoid breaking changes) it may be a feasible solution
-
if upgrading the affected experiments is too much effort, consider to port the bugfix down to your older lib versions (so creating lib V1.1, V2.1, V3.1)
Of course, you can mix that strategies: experiment 2 and 3 may be easily switched to V4, while experiment1 needs to stick to lib V1, then you will only have to port the bugfix down to V1. That leaves you with lib_V1.1 in maintenance and lib V4 in active development, but no need to maintain V2 and V3 any more.
What you should avoid is to have more than one version tree of your lib under “active development”. When you decide to improve an older experiment, either stick with the library version it is currently linked to, or switch to the newest library version for this older experiment.
A remark about version control: this development model maps easily to each VCS which supports tagging and maintenance branches (in other words: basic features supplied by any decent software worth the title “VCS”).
After spending the afternoon reading, it looks like git subtree
is what I’m after. In this approach, I keep my library in version control with git, and each experiment goes into a separate repository. When I start an experiment, I pull the latest version of the library in in with a git subtree add
. Each experiment has its own version of the library. If I want to update an experiment to use a new version of the library, I can make a branch, do a git subtree pull
, patch up my experiment code to work with the new interface, and merge back into master.
The great thing about git subtree
is that the history of the experiment is tied to the history of the library that I am using. This is incredibly useful in exploratory research where reproducibility is paramount. For example, I might run an experiment with version 1 of a library and get a certain result. Later, after updating the experiment’s code and moving to version 2 of the library, I might re-run the experiment and find, to my surprise, that the result is different. With subtrees, if I have the commit hash that produced my original result, I can restore my experiment to the exact state it was when I ran it originally, library and all, by simply checking out that commit.
1
Use an Adapter.
An adapter is a class that converts from one version of an API to another. On one side of the adapter is the original API. On the other side is the API for the newest version of your library.
You could, of course, simply make your library backwards-compatible, retaining the old API calls for the benefit of your existing experiments.
3
I think what you need is somewhere to store and managed versioned artifacts. As you have noticed, this is slightly different from keeping the librar under source control. How to do this depends on the language: usually all language communities tend to reinvent this sort of thing. For instance I would use:
- SBT for Scala
- Maven for Java
- Bower for client-side JavaScript or CoffeeScript
- NPM for node JavaScript or CoffeeScript
All of these tools work both with a remote repository or in local mode, where packages are cached somewhere on your filesystem. Bower works with git and tags, but makes local checkouts on each project.
Since you mention Python, the closest equivalent I can think of is Pip, but I am not sure wheter Pip allows you to use local packages.
1
I quite like how http://cocoapods.org/ does it. It can be use publicly or privately.
We have a similar issue where we a developing multiple objective-c controls that will be used in many applications. We couldn’t make breaking changes as this would mean possibly breaking a project you know nothing about. This seriously hampers progress / innovation.
So with cocoapods we basically have a repo that lists the names / versions and locations of all of our controls (a list of podspecs). Inside cocoapods we say, for this project use
- control1 – v0.0.1
- control4 – v0.0.7
- control5 – v0.0.2
Then cocoapods will go to each controls repo and pull out a tag with the name of the version number specified e.g. “0.0.1”.
This is working quite well for us being able to use specific versions of multiple libraries in many different projects.
Not sure if cocoapods will support your platform, you might need to build something yourself or just come up with your own process, but the idea works well.
Keep your library in one git-repository and your experiments in an other. Use git submodules to keep track of versions of the library. It’s actually built for this…
4
You could keep binaries for each library with a version name in each, e.g.
mylib/thelibname-alpha.dll
mylib/thelibname-beta.dll
Your tests then reference the relevant version. If you need to patch a library, all tests using it will benefit, but other tests will be unaffected.
The reason for doing this is the same as for embedding a version string in any library – you can control precisely which one you are using, and control what is the “active” version using symbolic links if you want. Take a look in /lib
on a Linux host and you’ll see exactly this arrangement being used:
$ ls -l /lib/libaudit.so.*
lrwxrwxrwx 1 root root 17 Feb 8 14:24 /lib/libaudit.so.1 -> libaudit.so.1.0.0
-rwxr-xr-x 1 root root 112224 Mar 14 2012 /lib/libaudit.so.1.0.0