I am looking for the most efficient algorithm for comparing large set of strings that are representing possible amino acid substitutions at various positions of a 9mer peptide.
As a toy example, the matrix M1
1 2 3
A 1 1 1
B 1 1 0
C 0 1 0
defines the set of three-letter strings which have an A, B, or C at positions 1, 2, and 3 when a “1” exists in the corresponding row in the matrix (set1).
Matrices in which one or more “1”s are replaced by “0” define subsets of set1. For example a set2 comprising the strings “ABA”, “ACA” can be defined by a matrix M2
1 2 3
A 1 0 1
B 0 1 0
C 0 1 0
As a non-mathematician, my straightforward approach was to compute set1 and set2, and then looking for members in set1 that do not have a match in set2.
However, I have become curious whether more efficient ways might exist to achieve this aim. I.e., could it be done by first performing an operation with M1 and M2 to generate a matrix M3 (or or a small set of matrices M3. M4,…) that define(s) the non-overlapping strings?
Note: In cases like the example above, a single 3×3 matrix as M3 is probably not sufficient.