Grep a row and retain only matching rows based on a specific column in bash [closed]
Closed 2 days ago.
Grep a row and retain only matching rows based on a specific column in bash [closed]
Closed 2 days ago.
How to find cases of duplicated multi-line text in a large codebase
Various combinations of sort
, uniq
, diff
allow one to search for single duplicated lines across a large codebase. This data is extremely noisy, given the prevalence of standard lines of code across modules (if
s, for
loops, match
statements, etc.). It would be much more useful to look for several consecutive lines repeated across 2 or more files (in a greedy way, returning the most lines repeated) to detect historical code duplication.
How to extract sequence cluster numbers containing a specific string in the ID from a cluster file in bash?
I have a cluster file for sequences clustering at 100% sequence identity, each containing sequence clusters denoted by cluster numbers followed by IDs. Here’s an example of the file format: