My company (let’s call them Acme Technology) has a library of approximately one thousand source files that originally came from its Acme Labs research group, incubated in a development group for a couple years, and has more recently been provided to a handful of customers under non-disclosure. Acme is getting ready to release perhaps 75% of the code to the open source community. The other 25% would be released later, but for now, is either not ready for customer use or contains code related to future innovations they need to keep out of the hands of competitors.
The code is presently formatted with #ifdefs that permit the same code base to work with the pre-production platforms that will be available to university researchers and a much wider range of commercial customers once it goes to open source, while at the same time being available for experimentation and prototyping and forward compatibility testing with the future platform. Keeping a single code base is considered essential for the economics (and sanity) of my group who would have a tough time maintaining two copies in parallel.
Files in our current base look something like this:
> // Copyright 2012 (C) Acme Technology, All Rights Reserved.
> // Very large, often varied and restrictive copyright license in English and French,
> // sometimes also embedded in make files and shell scripts with varied
> // comment styles.
>
>
> ... Usual header stuff...
>
> void initTechnologyLibrary() {
> nuiInterface(on);
> #ifdef UNDER_RESEARCH
> holographicVisualization(on);
> #endif
> }
And we would like to convert them to something like:
> // GPL Copyright (C) Acme Technology Labs 2012, Some rights reserved.
> // Acme appreciates your interest in its technology, please contact [email protected]
> // for technical support, and www.acme.com/emergingTech for updates and RSS feed.
>
> ... Usual header stuff...
>
> void initTechnologyLibrary() {
> nuiInterface(on);
> }
Is there a tool, parse library, or popular script that can replace the copyright and strip out not just #ifdefs, but variations like #if defined(UNDER_RESEARCH), etc.?
The code is presently in Git and would likely be hosted somewhere that uses Git. Would there be a way to safely link repositories together so we can efficiently reintegrate our improvements with the open source versions? Advice about other pitfalls is welcome.
2
It seems like it wouldn’t be too difficult to write a script to parse the preprocessors, compare them to a list of defined constants (UNDER_RESEARCH
, FUTURE_DEVELOPMENT
, etc.) and, if the directive can be evaluated to false given what’s defined, remove everything up to the next #endif
.
In Python, I’d do something like,
import os
src_dir = 'src/'
switches = {'UNDER_RESEARCH': True, 'OPEN_SOURCE': False}
new_header = """// GPL Copyright (C) Acme Technology Labs 2012, Some rights reserved.
// Acme appreciates your interest in its technology, please contact [email protected]
// for technical support, and www.acme.com/emergingTech for updates and RSS feed.
"""
filenames = os.listdir(src_dir)
for fn in filenames:
contents = open(src_dir+fn, 'r').read().split('n')
outfile = open(src_dir+fn+'-open-source', 'w')
in_header = True
skipping = False
for line in contents:
# remove original header
if in_header and (line.strip() == "" or line.strip().startswith('//')):
continue
elif in_header:
in_header = False
outfile.write(new_header)
# skip between ifdef directives
if skipping:
if line.strip() == "#endif":
skipping = False
continue
# check
if line.strip().startswith("#ifdef"):
# parse #ifdef (maybe should be more elegant)
# this assumes a form of "#ifdef SWITCH" and nothing else
if line.strip().split()[1] in switches.keys():
skipping = True
continue
# checking for other forms of directives is left as an exercise
# got this far, nothing special - echo the line
outfile.write(line)
outfile.write('n')
I’m sure there are more elegant ways to do it, but this is quick and dirty and seems to get the job done.
1
I was thinking about passing your code through the preprocessor to only expand macros, thus outputting only the interesting part in the #ifdef
s.
Something like this should work:
gcc -E yourfile.c
But:
- You’ll lose all comments. You can use
-CC
to (kind of) preserve them, but then you’ll still have to strip off the old copyright notice #include
s are expanded too, so you’ll end up with a big file containing all the content of the included header files- You’ll lose “standard” macros.
There might be a way to limit which macros are expanded; however my suggestion here is to split up things, instead of doing (potentially hazardous) processing on the files (by the way, how would you plan to maintain them after? e.g. reintroduce code from the opensource version into your closed source?).
That is, try putting the code you want to opensource in in external libraries as much as possible, then use them as you would with any other library, integrating with other “custom” closed-source libraries.
It might take a bit longer at first to figure out how to restructure things, but it’s definitely the right way to accomplish this.
1
I have a solution but it will require a little work
pypreprocessor is a library that provides a pure c-style preprocessor for python that can also be use as a GPP (General Purpose Pre-Processor) for other types of source code.
Here’s a basic example:
from pypreprocessor import pypreprocessor
pypreprocessor.input = 'input_file.c'
pypreprocessor.output = 'output_file.c'
pypreprocessor.removeMeta = True
pypreprocessor.parse()
The preprocessor is extremely simple. It makes a pass through the source and conditionally comments out source based on what is defined.
Defines can be set either through #define statements in the source or by setting them in the pypreprocessor.defines list.
Setting the input/output parameters allow you to explicitly define which files are being opened/closed so a single preprocessor can be setup to batch process a large number of files if desired.
Setting the removeMeta parameter to True, the preprocessor should automatically extract any and all preprocessor statements leaving only the post-processed code.
Note: Usually this wouldn’t need to be set explicitly because python removed commented code automatically during the compilation to bytecode.
I only see one edge case. Because you’re looking to preprocess C source, you may want to set the processor defines explicitly (ie through pypreprocessor.defines) and tell it to ignore the #define statements in the source. That should keep it from accidentally removing any constants you may use in your project’s source code. There currently is no parameter to set this functionality but it would be trivial to add.
Here’s a trivial example:
from pypreprocessor import pypreprocessor
# run the script in 'production' mode
if 'commercial' in sys.argv:
pypreprocessor.defines.append('commercial')
if 'open' in sys.argv:
pypreprocessor.defines.append('open')
pypreprocessor.removeMeta = True
pypreprocessor.parse()
Then the source:
#ifdef commercial
// Copyright 2012 (C) Acme Technology, All Rights Reserved.
// Very large, often varied and restrictive copyright license in English and French,
// sometimes also embedded in make files and shell scripts with varied
// comment styles.
#ifdef open
// GPL Copyright (C) Acme Technology Labs 2012, Some rights reserved.
// Acme appreciates your interest in its technology, please contact [email protected]
// for technical support, and www.acme.com/emergingTech for updates and RSS feed.
#endif
Note: Obviously, you’ll need to sort out a way to set the input/output files but that shouldn’t be too difficult.
Disclosure: I am the original author of pypreprocessor.
Aside: I originally wrote it as a solution to the dreaded python 2k/3x maintenance issue. My approach was, do 2 and 3 development in the same source files and just include/exclude the differences using preprocessor directives. Unfortunately, I discovered the hard way that it’s impossible to write a true pure (ie doesn’t require c) preprocessor in python because the lexer flags syntax errors in incompatible code before the preprocessor gets a chance to run. Either way, it’s still useful under a wide range of circumstances including yours.
2
Probably it would be good idea to
1.add comment tags like :
> // *COPYRIGHT-BEGIN-TAG*
> // Copyright 2012 (C) Acme Technology, All Rights Reserved.
> // Very large, often varied and restrictive copyright license in English and French,
> // sometimes also embedded in make files and shell scripts with varied
> // comment styles.
> // *COPYRIGHT-ENG-TAG*
> ... Usual header stuff...
>
> void initTechnologyLibrary() {
> nuiInterface(on);
> #ifdef UNDER_RESEARCH
> holographicVisualization(on);
> #endif
> }
2. Write script for open source builder to go through all files
and replace text between COPYRIGHT-BEGIN-TAG and COPYRIGHT-ENG-TAG tags
3
I’m not going to show you a tool to convert your codebase, plenty of answers already did that. Rather, I’m answering your comment about how to handle branches for this.
You should have 2 branches:
- Community (let’s call the open source version like this)
- Professional (let’s call the closed source version like this)
The preprocessors shouldn’t exist. You have two different versions. And a cleaner codebase overall.
You’re afraid of maintaining two copies in parallel? Don’t worry, you can merge!
If you’re making modifications to the community branch, just merge them in the professional branch. Git handles this really well.
This way, you keep 2 maintained copies of your codebase. And releasing one for open source is easy as pie.
0