Various combinations of sort
, uniq
, diff
allow one to search for single duplicated lines across a large codebase. This data is extremely noisy, given the prevalence of standard lines of code across modules (if
s, for
loops, match
statements, etc.). It would be much more useful to look for several consecutive lines repeated across 2 or more files (in a greedy way, returning the most lines repeated) to detect historical code duplication.
Are there any existing tools/scripts that would do this? The solution would, of course, not know ahead of time the size/length of the repeated strings, and would work on many files. I would assume this is a solved problem in the compression space – maybe it would be possible to compress the directory and observe which largest patterns repeat?
Thank you in advance.