Relative Content

Tag Archive for bashawkgrep

How to find cases of duplicated multi-line text in a large codebase

Various combinations of sort, uniq, diff allow one to search for single duplicated lines across a large codebase. This data is extremely noisy, given the prevalence of standard lines of code across modules (ifs, for loops, match statements, etc.). It would be much more useful to look for several consecutive lines repeated across 2 or more files (in a greedy way, returning the most lines repeated) to detect historical code duplication.