I’m working on a problem involving string matching where I need to compute the similarity scores for each prefix of a string C
against another string S
. The similarity score for a prefix P
of C
and S
is defined as the product of two metrics:
1. Hit Count (H): The number of occurrences of P
in S
.
2. Plagiarism Likelihood (L): The number of times a substring is both a prefix and a suffix of P
.
Given the constraints that both C
and S
can be up to 5*10^6 characters long, I am looking for an efficient way to perform these calculations, and time limit is 3 sec (ideally aiming for a time complexity of O(|S| + |C|)
).
I initially considered using the KMP (Knuth-Morris-Pratt) algorithm for counting the occurrences of each prefix in S
, but I am unsure how to efficiently compute L
for each prefix without excessive recomputation.
Specific Questions:
1. Is there an efficient algorithm or a combination of algorithms that can help achieve this complexity for both computing H
and L
?
2. How can dynamic programming be applied to compute the L
values efficiently for each prefix?
Any insights or suggestions on how to tackle this problem would be greatly appreciated!