The first answer to an old, recently active question linked to a video which talks about how Google repository is done.
One interesting thing which was mentioned is the fact that everything is build from source, without relying on binaries. This helps avoiding issues with dependencies becoming obsolete but still being used in other projects, an issue I indeed encountered a lot.
How is it technically possible? If I try the same thing in my company, even considering the huge gap between the scale of my company codebase and the scale of Google’s one, it wouldn’t be possible for two reasons:
-
The IDE (Visual Studio) will quickly become unresponsive, given that is suffers a lot at even small solutions containing, say, 50 projects.
-
Any static analysis would be crunched by the size of the whole codebase. For example code metrics or static checking of code contracts would hardly be possible (code contracts would probably take days or weeks).
-
With continuous integration, compiling would take a huge amount of time too and would crunch the servers as soon as a project with lots of dependencies is modified, requiring a large tree of projects to be recompiled.
How can a small company circumvent those issues and be able to:
-
Use the IDE without being affected by poor performance,
-
Compile the code after each commit without crunching the server, even when the consequences of a change require a large amount of the codebase to be recompiled?
1
You are assuming a traditional build process, and Google’s process is anything but traditional. There’s a series of articles in the Engineering Tools blog that explain their process in some detail, elaborating on the 2010 presentation: Tools for Continuous Integration at Google Scale:
- Build in the Cloud: Accessing Source Code
- Build in the Cloud: How the Build System works
- Build in the Cloud: Distributing Build Steps
- Build in the Cloud: Accessing Source Code
- Testing at the speed and scale of Google
To summarise, they use a custom distributed build system that allows for a very high degree of parallelism and automation, taking full advantage of their existing infrastructure. It also relies heavily on caching, with a 90% overall cache hit rate.
But how can you apply all this in your company? The first step is distributing compiling, and for that you’d need:
- A cloud
- A distributed compiler
- A compiler cache
On a gcc development environment, setting up a compile farm is relatively easy. distcc takes care of distribution and ccache takes care of caching, and they work beautifully together. I don’t know of any similar tools for Microsoft’s ecosystem (I’m assuming you are using a Microsoft language based on your choice of IDE), but I do know that MSBuild can run builds in parallel, taking advantage of multi-core CPUs. Not really a compile farm, but certainly a step in the right direction.
- You don’t have to build all 2000 projects to use just the ones you need.
- The Go language was specifically designed to alleviate this problem by making compile times very fast. it was one of the reasons they invented the language.
That said, I’d be wary of “just-in-time building,” unless the code that is being deployed company-wide has been verified as part of a (more or less) formal release cycle, and is not just a random nightly build. H
Having 5000 developers accessing 2000 projects that are all in a continuous state of flux sounds like a recipe for disaster, and Google hires very smart people, so I’m quite certain that’s not what is actually happening.
6