I am implementing a C preprocessor in C…
I have the three functions:
- trigraph replacing function
- Line splicing function
- comment removing function
However these functions work separately on files i.e.
First function will take a file and replace the trigraphs producing a temp-file1 as output
Second function will take temp-file1 as input and splice the lines and produce another temp-file2.
Third function will take temp-file2 as input and remove comment and produce yet another temp-file3.
And now the main preprocessing tasks will be performed on temp-file3 and a .i file will be produced as final output.
Now, I have 3 options :
- Use temp files
- use pipes
- instead of intermediate temp-files or pipes use strings(i.e. whole temp-file1, 2 and 3 will be three big strings!!)
I have three doubts…
- Option 1 seems less efficient than option2
- option 2 seems to be perfect but will I be limited by size of that unnamed pipe? (since I have single process i.e. function 1 2 & 3 will be called one after another) What if temp output size > pipe’s total capacity?
- option 3… Is it efficient, easy over previous two?
Please tell me, Which option should I choose?
0
option 4 is to refactor the functions so they can work on a stream and only process data as needed
in essence you call on function 3 if that needs more data it will call function 2 and if that needs more data it calls function 1 which reads directly from the input file; this will transform the preprocessor into a single pass instead of the 4-pass you have now
option 5 is concurrent processing, where you use a producer-consumer queue between the producing 1 and consuming 2 and a queue between the producing 2 and consuming 3, which produces for the main processing
option 5 will allow you to reuse more code as you can just replace all fwrite
s with push
es and all fread
with poll
s (each blocking as the buffer fills up/gets empty) but you’ll need to spawn a thread for each function
2
Option 1 is a way of allocating memory (in this case, from the page cache), with the following caveats:
- if your temporary files aren’t on the (a) temporary file system, your data will also be written pointlessly to disk
- it’s possible for other processes to read and maybe modify your temporary files
Option 2 won’t work as stated: the pipe buffer will fill, and your write will block. Pipes are only safe if they’re being read & written concurrently (whether by different processes, different threads, or suitably co-ordinated co-routines).
Option 3 is reasonable. Note that if your three functions can only shorten the file, you could simply re-write a single buffer in-place.
All three options have serious drawbacks.
- As indicated by @Useless in his answer, temporary files have the drawback of doing needless disk access with the risk of external entities modifying the files.
- Both option 2 and 3 limit the size of the files you can process. In option 2, this is limited by the internal buffer of the pipes and in option 3 it is limited by the amount of memory you have free.
I would advise to consider a fourth option:
You have listed for stages in the processing
- trigraph processing
- line splicing
- comment removal
- main preprocessing
Option 4 would be that each stage calls the function of the preceding stage to provide it with characters that have been processed up to that point.
So, the main preprocessing function requests characters from the comment removal function.
The comment removal function in turn requests characters from the line splicing function. If those characters indicate the start of a comment, more characters are requested until the entire comment has been seen. Those characters are discarded and a single space is returned to the caller. Characters outside comments are returned as-is.
The line splicing and trigraph processing functions work similarly, with the trigraph function being the only one that reads a file.