In light of recent revelations about widespread government monitoring of data stored by online service providers, zero-knowledge services are all the rage now.
A zero-knowledge service is one where all data is stored encrypted with a key that is not stored on the server. Encryption and decryption happens entirely on the client side, and the server never sees either plaintext data or the key. As a result, the service provider is unable to decrypt and provide the data to a third party, even if it wanted to.
To give an example: SpiderOak can be viewed as a zero-knowledge version of Dropbox.
As programmers, we rely heavily on, and trust some of our most sensitive data – our code – to a particular class of online service providers: code hosting providers (like Bitbucket, Assembla, and so on). I am of course talking about private repositories here – the concept of zero-knowledge does not make sense for public repositories.
My questions are:
-
Are there any technological barriers to creating a zero-knowledge code hosting service? For example, is there something about the network protocols used by popular version control systems like SVN, Mercurial, or Git that would make it difficult (or impossible) to implement a scheme where the data being communicated between the client and the server is encrypted with a key the server does not know?
-
Are there any zero-knowledge code hosting services in existence today?
18
You can encrypt each line seperately. If you can afford to leak your file names and approximate line lengths and the line numbers on which lines changes occur, you can use something like this:
https://github.com/ysangkok/line-encryptor
As each line is encrypted seperately (but with the same key), the uploaded changes will (like usually) only involve the relevant lines.
If it is presently not convenient enough, you could make two Git repositories, one with plaintext and one with ciphertext. When you commit in the plaintext repository (which is local), a commit hook could take the diff and run it through the line encryptor referenced above, which would apply it to the ciphertext repository. The ciphertext repository changes would be committed and uploaded.
The line encryptor above is SCM agnostic, but can read unified diff files (of plaintext) and encrypt the changes and apply them to the ciphertext. This makes it usable on any SCM that will generate you a unified diff (like Git).
8
I don’t think there are any barriers – consider SVN, what gets sent to the server for storage is the delta between what the previous and current version of your code – so you change 1 line, just that line gets sent to the server. The server then ‘blindly’ stores it without doing any inspection of the data itself. If you encrypted the delta and sent that instead, there would be no impact on the server, in fact you wouldn’t even need to modify the server at all.
There are other bits that might matter, such as meta data properties that are not easily encryptable – such as mime type – but others could be encrypted, eg comments in the history log, just as long as you know you have to decrypt them on the client to view. I’m not sure if the directory structure would be visible, I think it would not be visible due to the way SVN stores directories, but its possible I’m wrong. This might not matter to you if the contents are secure however.
This would mean you couldn’t have a web site with the various code view features, no server-side repository browser or log viewer. No code diffs, no online code review tools.
Something like this already exists, to a point, Mozy stores your data encrypted with your private key (you can use their own, and they make noises about “if you lose your own key, too bad, we can’t restore your data for you”, but that’s more targeted at the common user). Mozy also stores a history of your files, so you can retrieve previous versions. Where it falls down is that upload is on a regular basis, not checkin when you want, and I believe it discards old versions when you run out of storage space. But the concept is there, they could modify it to provide secure source control using their existing system.
2
I hate to do one of those ‘this isn’t quite going to answer your question’ answers.. but..
I can think of two ready solutions which should address these worries.
-
Host a private Git server on your own. Then put that server on a VPN to which you give your team members access. All communication to and from the server would be encrypted, and you could of course encrypt the server at the OS-level.
-
BitSync should do the trick as well. Everything would be encrypted, and in a huge network which would be available from anywhere. Might actually be a really good application of all this BitCoin/BitMessage/BitSync technology..
Lastly, the folks over at https://security.stackexchange.com/ might have some more insight.
2
As I understand it, the way git pull
works is that the server sends you a pack file that contains all the objects that you want, but don’t have currently. And vice versa for git push
.
I think you couldn’t do it like this directly (because this means the server has to understand the objects). What you could do instead is to let the server work just with a series of encrypted pack files.
To do pull
, you download all the pack files that were added since your last pull
, decrypt them and apply to your git repo. To do push
, you first have to do pull
, so that you know the state of the server. If there are no conflicts, you create a pack file with your changes, encrypt it and upload it.
With this approach, you would end up with large number of tiny pack files, which would be quite inefficient. To fix that, you could download a series of pack files, decrypt, combine them into one pack file, encrypt and upload them to the server, marking them as a replacement for that series.