Working in an open source project has the result that sometimes developers with no experience in the codebase attempt to “learn the code” on their own.
There is a reccurring trap I’ve seen, where new developers try to:
understand everything before doing anything.
These tend to be developers who only ever worked on their own projects, or at least smaller codebases, where this may work out OK.
My impression is that they try to read the code a bit like it’s a book, hoping that somewhere near the end they will have an “ahah” moment where they will understand it and start writing new code and being productive.
(I’m simplifying a bit here).
This always seems to end badly.
Rather than questioning their approach, they complain the code is confusing and needs more comments.
(Whether or not to include comments is a big topic. For purpose of discussion we can assume the code is reasonably well-commented.)
Not to suggest the code is perfect either, but some developers manage to learn it and become productive.
To me the approach is flawed to begin with, but asking since this is such a common assumption/mistake.
When developers have already tried and failed to enter a large codebase by reading over the code:
What are some better alternative approaches to suggest?
Note on the “volunteers for open source” aspect (added for clarification)
It’s come up in replies to this topic that the part about volunteers isn’t important and that any new developer on a large codebase would run into these problems.
While this may be true in some cases, there is a difference that volunteers aren’t employed by the organization, they can do what they like, how they like – and may do a significant amount of work on their own without asking for guidance, further, they are free to ignore all advice too. Even if their work is rejected for example, they may continue to develop it, start a fork etc….
This typically isn’t the case a developer employed to work on closed source software.
19
From my experience diving into large, unfamiliar code bases, I would say the key thing is to understand how the program is split up, and then focus only on the piece you’re changing. This is just as true of codebases I work on every day as it is of codebases I’ve never seen before.
If the program is well-designed, the modules/components/pieces/whatever will have clear boundaries between them, and clear black-box expectations for each other’s behavior, which make it very easy to work within one of them with little to no detailed knowledge of the others. Even programs with questionable design usually have plenty of modularization, but merely fail to make it obvious where the boundaries between modules are (typically through what you might call “excess coupling”). If for some bizarre reason nobody can tell the new guy how the program is organized, he may be able to very slowly piece it together by looking at main() or stepping through things with a debugger, while resisting the temptation to dig too deep into any one function or class.
The new developer instinct that they must “understand everything before doing anything” is actually true within a single module. For someone relatively new, an appropriate “module” may be a single .cpp file implementing one class, or it might be 5-10 files implementing some thread pool scheduling black magic. It’s only the “cross-cutting changes” affecting several modules that truly need someone familiar with the whole codebase.
Edit: ideasman42 has clarified that “volunteer”ness is relevant to the question because volunteers on open-source projects don’t get tutorials/walkthroughs like employees normally do. To that, I would say: If it’s a remotely decent open-source project, there should be lots of public, browsable discussions about past and present issues which ought to provide lots of high-level insight into how the code is structured (and why it’s structured that particular way). Platforms like Github make this exceptionally straightforward.
2
After writing the original version of this answer (see remark below), I came to the realization that instead of analyzing the mental gap present in new developers, I should have instead explored what is the single most important thing that new developers need to do to close the mental gap. Here is my assertion.
The best thing new developers in this situation can do, is to receive a comprehensive training on using every piece and bit of functionality that this software has to offer.
Assuming the new developer has no obvious adequacy in understanding code, what other inadequacy may that person has? It is the user’s perspective that the new developer is lacking.
Put in another way. Suppose the new developer need to somehow “cause a certain specific piece of code inside the library to be executed”. No matter what. How does the developer do that? To do this-and-that on the application, which internally triggers this-and-that piece of code.
This is the up-to-date onboarding principle in quite some leading software companies. In fact, anyone reading “trendy” industry websites should have had mentioned this approach. There is nothing remarkable or ingenious about that. It’s just hindsight.
Bonus: an analogy.
To help hit the nail on the head, here is one:
How do you learn to write a novel? Do you begin by memorizing the dictionary?
Trying to read and fully understand the source code is somewhat like learning vocabulary with the help of the dictionary. However, without actually using the newly learned vocabulary, the meaning and definitions probably meant very little to you, and your command of those words will soon fade away.
To truly learn something, one has to learn it from the various perspectives according to how it is used by the people.
To a lesser extent, being able to execute a single modularized piece of code taken out of the library in a test environment is also a bonus. It helps a new developer affirm one’s understanding about how that piece of code execute.
This need is partially fulfilled by having a unit testing suite. However, it should be pointed out that a new developer will also need a “messy experimentation environment” for testing out both production-ready and novel code. Thus, the new developer will typically need both.
Below is my earlier answer (now deprecated), kept here for reference.
Overall view
When discussing about this issue, one must make distinction between amateur-usage and production-usage of an open-source library.
The distinction does not lie in whether the programmer-user is hired, paid or making revenue.
The distinction lies in: Whether the programmer-user are intimately aware of tens or hundreds of different use cases that the library is currently capable of.
The problem is that if the programmer-user lacks awareness, the contributed code changes are quite likely to be rejected for being inadequate in handling.
Any failure of even a single use case, caused by any code change, is called a “regression”.
This is discouraging, but shouldn’t be seen as a signal to discourage or stop one from learning. One must accept the learning curve and press on.
Opportunities to contributed locally-scoped or modular changes
Sometimes, it is possible to identify some locally-scoped improvements in an open-source library. Small enough that one does not need the aforementioned broad awareness to being working on.
Sometimes, some libraries are designed (architecturally) in a way that new functionality can be “bolted on” elegantly. If that functionality succumbs to quality issues, it is trivial to be taken out by its maintainers.
When those opportunities arise, it would be possible for new programmers to contribute.
Identifying such opportunities is not easy task. Sometimes, it takes an expert member of the maintainers to do it.
The whole premise of, say, Google Summer of Code (GSoC), as well as similar initiatives from other companies or organizations, is based on having an army of mentors (the expert members among maintainers) to identify such opportunities, and for the financial sponsor (Google) to reimburse both the mentors and the fresh programmers in education to work together.
Opportunities that arise due to a “bullet-tracing” event
This situation is actually the most common in the commercial-usage of open-source libraries.
A software company makes use of open-source libraries for production purposes (those that are revenue-generating, or necessary for the company or organization’s day-to-day operations).
One day, a programmer identifies a defect in the library. The programmer has professional knowledge in the trade, and has been making knowledge gains on that library due to the everyday usage of that library.
The programmer conducts an intense effort (in terms of time, money as in wages, technical support contacts, and bug tracking database searches) to identify the root cause of the defect, and to try to come up with an acceptable code fix for that.
Now, because of this intensive collaborative effort, the programmer may be able to make a contribution to this open-source library, despite not being well aware of other unrelated aspects of the library’s functionality or use cases.
Thus, this is an exception to the overall view mentioned in the beginning.
“Handholding”, “onboarding”, and other activities that can bring new developers up to speed.
Handholding refers to project mentors patiently giving (sometimes prescribing, or even spoon-feeding) baby-steps to a programming-person who is completely new to the library, or about the application domain.
Onboarding refers to giving someone an organization overview of the library, assuming that:
- The person has some knowledge of the application domain, and is capable of learning more of it on one’s own effort
- The person has some knowledge of the software architecture, and is capable of learning more of it
- The person already has good programming skills and do not need any handholding in that aspect.
- The person will being using that library everyday, therefore it is expected that learning progress will be fast.
In a commercial setting, only programmers who are eligible for “onboarding” will be given consideration for joining a software project. Programmers who need “handholding” will never be given a chance.
The caveat: Most library’s developer documentation is written for adequate onboarding, but not for handholding.
An example of handholding documentation is:
- OpenCV Tutorials
- HIPR2 – Image Processing Learning Resources,
though this is not a software library but rather an educational material.
Compare that to the onboarding documentation of comparable libraries:
- OpenCV – How to contribute
The reality is that: unless your library has become so prominently featured in the STEM education as part of the national STEM initiative, it is unlikely that many many books and beginner tutorials will be developed for it.
- Search for “OpenCV” on Amazon
Similar can be said of: MATLAB, R, Octave, etc. (Though MATLAB is a proprietary software package, and thus one would not find open-source contribution opportunities there.)
Doctrine (organization of software projects), and indoctrination.
It appears that traditional open-source libraries (those that began their roots in the 1990s or earlier) do not use handholding or onboarding. Instead, they maintain an authoritative document saying what the library should do.
I call this indoctrination. Neither the library code nor the maintainers are perfect; but everything is moving towards perfection according to the doctrine.
Examples of open-source library “doctrines”:
- zlib
- libtiff
This is jokingly referred to as “RTFM” (*), but amid the hundreds of pages of documentation, most open-source libraries will have a single prominently featured page of inerrancy. Sometimes it is simply called "README"
(without a file extension) at the root of the package. This is the doctrine of that library.
It specifies a lot of “specification” or “behavior contract” for the library.
(Note) For decency, let’s say RTFM is an acronym for “read the founders’ manuscripts.”
A final irrelevant rant – to any online people elsewhere who said college doesn’t seem that critically important.
The economic importance of a proper college education should become apparent, if one manages to read through all of the above.
It saves a fresh graduate from being in need of handholding for an opportunity, period.
Otherwise, the fresh graduate would have trouble finding the first employment.
Mandatory disclaimer: not directly employed by any companies named above.
1