I’m trying to compare two text files. I want to compute how many lines were added and removed. Basically what git diff --stat
is doing. Bonus points for not having to store the entire file contents in memory.
The approach I’m currently having in mind is:
- read each line of the old file
- compute a hash (probably MD5 or SHA-1) for each line
- store the hashes in a set
- do the same for each line in the new file
- every hash from the old file set that’s missing in the new file set was removed
- every hash from the new file set that’s missing in the old file set was added
I’ll probably want to exclude empty and all white space lines. There is a small issue with duplicated lines. This can either be solved by additionally storing how often a hash appears or comparing the number of lines in the old and new file and adjust either the added or removed lines so that the numbers add up.
Do you see room for improvements or a better approach?
Edit
I’m currently using Java and the SVNKit library.
7
Since this is something quite common among diff tools, why not use a tried-and-tested diff library to do the work efficiently instead of rolling out your own code?
Searching for “diff library” or “diff tools” with the name of the language would provide some easier ways.
8
There is a diffstat
command (usually in the diffstat
package in most Linux distros), that can help you achieve this:
svn diff | diffstat
This outputs more or less the same as git diff --stat
If this is not an option for you, then maybe a good alternative is to parse the output of svn diff
: count the lines starting with +
and -
, keep in mind that for each affected files there will be lines starting with +++
and ---
.
0
why not do this:
- read each line of the old file
- there’s no 2
- store the lines in a set
- do the same for each line in the new file
- every line from the old file set that’s missing in the new file set was removed
- every line from the new file set that’s missing in the old file set was added
I don’t see any need to calculate hashes, if you’re going to compare things you might as well compare actual lines.