I’ve seen numerous questions (and answers) concerning the number of binary strings (e.g “10010” containing some binary substring (e.g “00”). I’d like to know if there is a way to generalize this:
Given a length n
and a string s
(may contain any letters A...Z
), how many different strings of length n
, that contain the substring s
at least once, are there?
These kinds of questions are often solvable with combinatorics, but I’d like to find a dynamic-programming solution (and so I posted this question here instead of mathexchange).
Personally, I’m running on cold steam here. Of course, I tried some combinatorial methods:
n = 5
s = CAR
? = unknown character
all possibilities:
CAR??
?CAR?
??CAR
This basically boils down to 26^0*26^2 + 26^1*26^1 + 26^2*26^0 = 3 * 26^2 = 2028
possibilities, which I assume is correct. Consider however the following case:
n = 7
s = CAR
? = unknown character
all possibilities:
CAR????
?CAR???
??CAR??
???CAR?
????CAR
What’s the problem? Well, there’s 3 positions that can produce duplicate results:
CARCAR?
CAR?CAR
?CARCAR
Now the number of possible strings is
(2 * 26^7) + (2 * 26^1 * 26^6) + (2 * 26^2 * 26^5) + (2 * 26^3 * 26^4) - (3 * 26)
I can’t generalize this.
Did I catch anyone’s interest? Any help is appreciated.
1
Acknowledging @Hirle’s solution, I still figured I’d post my own.
I approached this from a little different direction. After some thinking, the simplest program of all held the key:
deterministic finite automaton. This could be solved with a dfa that searched for occurences of a string, as with it I can compute
an adjacency matrix that I can later use. I decided to skip the trivialities and just copied some code off the net for the
deterministic version of Knuth–Morris–Pratt algorithm (the original one if nondeterministic, which doesn’t serve our purposes).
The below code originates from here, with little modifications:
public static int[][] dfa;
// create the DFA from a String
public static void kmp(String pat) {
// build DFA from pattern
int m = pat.length();
dfa = new int[26][m];
dfa[pat.charAt(0) - 'A'][0] = 1;
for (int X = 0, j = 1; j < m; j++) {
for (int c = 0; c < 26; c++)
dfa[c][j] = dfa[c][X]; // Copy mismatch cases.
dfa[pat.charAt(j) - 'A'][j] = j+1; // Set match case.
X = dfa[pat.charAt(j) - 'A'][X]; // Update restart state.
}
}
For the pattern ABC
this generates the following 2D-matrix, or dfa, if you will:
[1, 1, 1]
[0, 2, 0]
[0, 0, 3]
[0, 0, 0]
[0, 0, 0]
[0, 0, 0]
[0, 0, 0]
[0, 0, 0]
[0, 0, 0]
[0, 0, 0]
[0, 0, 0]
[0, 0, 0]
[0, 0, 0]
[0, 0, 0]
[0, 0, 0]
[0, 0, 0]
[0, 0, 0]
[0, 0, 0]
[0, 0, 0]
[0, 0, 0]
[0, 0, 0]
[0, 0, 0]
[0, 0, 0]
[0, 0, 0]
[0, 0, 0]
[0, 0, 0]
Here, for example, the first row means that if the character seen is `A’, then depending on our current state we go to the state indicated by the row.
Let’s say the state we currently are in is the initial state (state 0). If we now saw the character ‘A’, the next state would be told to us by the first row’s first element,
which is 1. So now our state is 1. Let’s say we, again, saw the character ‘A’. Our state would remain 1, as the first row’s second element is also 1. Alright, now say we saw
the character ‘B’. Our state would proceed to 2, since in the B-row’s (the second row’s) second element is 2.
In essence, our state would be 3 only if we saw the characters ABC
in succession. The rows D-Z are all zeroes, since if we ever saw a character other than A-C, we must start
the search from the beginning (which is A, in this case).
To generalize, the matrix element [0][1]
tells us the state we should go to, if the character we saw was A
and we were in the state 1. Similarly the matrix element
[6][6]
told us our next state if the character we saw was G
, and we were in the state 6 (in the above matrix we of course don’t have the state 6, since we built it with a string of length 3).
Anyway, now we have all we need to build the adjacency matrix. Without further ado, here’s my code for that:
String s = "ABC"; // The string we built our dfa with, e.g the string we are "searching"
paths = new long[s.length() + 1][s.length() + 1];
for (int k = 0; k < 26; k++)
for (int i = 0; i < dfa[0].length; i++)
paths[i][dfa[k][i]]++;
paths[paths.length - 1][paths.length - 1] = 26;
So if we were building the adjacency matrix from a dfa that was built with the string “ABC”, the adjacency matrix would look like this:
[25, 1, 0, 0]
[24, 1, 1, 0]
[24, 1, 0, 1]
[0 , 0, 0, 26]
Here paths[i][j]
would tell us how many ways there are to get to the state j
from the state i
. For example, there are 25 ways to get to state 0 from state 0, because there is only
one way to get to some other state from state 0. Also, there are 26 ways to result in the last state, if we are in the last state.
If we had built everything up so far with the string “AA”, the adjacency matrix would look like this:
[25, 1, 0]
[25, 0, 1]
[0, 0, 26]
So, what now? We have our oh-so-mighty adjacency matrix that tells us cool stuff, what to do with it? Graph theory comes into play.
Referencing Wolfram Alpha:
the
(u,v)
th element of thek
th power of the adjacency matrix ofG
gives the number of paths of lengthk
between verticesu
andv
.
So what it is saying is, that if we raise our adjacency matrix into, say, 3rd power, then the element paths[0][k]
tells us how many paths of length 3
there are from state 0 to state k
.
I’ll prep a little more for this: If we have a string S
, and with that string we can result in the last state of the dfa, then we consider S
to be OK.
So what we want to know to get the answer to the original problem, is how many OK strings there are in bigger strings of length n
.
This is why we raise the adjacency matrix into the n
th power. Now the element paths[0][k]
, where k
is the length of the string we are searching, should tell us how many OK strings there are in a bigger string of length n
, which just
so happens to be the answer to our problem.
My code to raise a matrix into n
th power:
public static long[][] matrixToPower(long[][] a, int p) {
long[][] b = a;
for (int n = 1; n < p; n++)
a = matrixMultiplication(a, b);
return a;
}
public static long[][] matrixMultiplication(long[][] m1, long[][] m2) {
int len = m1.length;
long[][] mResult = new long[len][len];
for(int i = 0; i < len; i++) {
for(int j = 0; j < len; j++) {
for(int k = 0; k < len; k++) {
mResult[i][j] += (m1[i][k] * m2[k][j]) % M;
mResult[i][j] %= M;
}
}
}
return mResult;
}
I’m taking a modulo of size 10^9 + 7
to get reliable answers for big input.
Full code:
import java.util.*;
public class Program {
public static final int M = 1000000007;
public static int[][] dfa;
public static long[][] paths;
public static void main(String[] args){
Scanner sc = new Scanner(System.in);
int n = sc.nextInt();
String s = sc.next();
kmp(s); // generate dfa
paths = new long[s.length() + 1][s.length() + 1];
for (int k = 0; k < 26; k++)
for (int i = 0; i < dfa[0].length; i++)
paths[i][dfa[k][i]]++;
paths[paths.length - 1][paths.length - 1] = 26;
paths = matrixToPower(paths, n);
System.out.println(paths[0][s.length()]);
}
public static long[][] matrixToPower(long[][] a, int p) {
long[][] b = a;
for (int n = 1; n < p; n++)
a = matrixMultiplication(a, b);
return a;
}
public static long[][] matrixMultiplication(long[][] m1, long[][] m2) {
int len = m1.length;
long[][] mResult = new long[len][len];
for(int i = 0; i < len; i++) {
for(int j = 0; j < len; j++) {
for(int k = 0; k < len; k++) {
mResult[i][j] += (m1[i][k] * m2[k][j]) % M;
mResult[i][j] %= M;
}
}
}
return mResult;
}
// create the DFA from a String
public static void kmp(String pat) {
// build DFA from pattern
int m = pat.length();
dfa = new int[26][m];
dfa[pat.charAt(0) - 'A'][0] = 1;
for (int X = 0, j = 1; j < m; j++) {
for (int c = 0; c < 26; c++)
dfa[c][j] = dfa[c][X]; // Copy mismatch cases.
dfa[pat.charAt(j) - 'A'][j] = j+1; // Set match case.
X = dfa[pat.charAt(j) - 'A'][X]; // Update restart state.
}
}
}
1
[Edit: realized what I previously answered doesn’t work in all cases, so editing to fix it as per https://meta.stackexchange.com/questions/186689/editing-answers ]
First of all I think there is a slight mix-up in your math (the case for n=7 has some math in it for n=10)
(2 * 26^7) + (2 * 26^1 * 26^6) + (2 * 26^2 * 26^5) + (2 * 26^3 * 26^4) - (3 * 26)
should be
(2 * 26^4) + (2 * 26^1 * 26^3) + (26^2 * 26^2) - (3 * 26)
or just plain
(5*26^4)-(3*26)
Anyway, framed as a dynamic programming problem, I think you can look at it like this:
1) Place s
at some index, i
in the resulting string. There are n - length(s) + 1
possibilities.
2) For each possibility, we simply need to count the possibilities for filling in the remaining indexes. As you have noticed, we run into problems with duplicates. This can be avoided by only counting possibilities where s
occurs first at i
. So when we place CAR at i=0
, and get CARCAR?, we count, but when we place CAR at index i=3
we don’t count CARCAR? . If we only count possibilities where s
occurs first at i
we never count duplicates. The part of the resulting string after our s
at i
(call it the suffix) can be filled in in any way. For the part before s
at i
(call it the prefix) we need to take care not to insert occurrences of s
.
2a) What we can do here is count all possibilities of filling in the prefix and then subtract the possibilities where s
is in. Counting this later part comes down to our original problem: getting the “Number of strings (of length n) containing a specific substring” only for a smaller length (now the length of our prefix, instead of the length of the entire string). So now we have a reduction of our problem to a smaller problem and hence a dynamic solution.
This works for strings like CAR, but not for strings that can overlap with themselves (place LOL at index 3 of a string of length 7. Then we can create ?LOLOL?, containing 2 occurrences of LOL, where the first isn’t (completely) in the prefix, because it overlaps).
2b) This can be fixed by considering a slightly broader problem. We don’t count the number of arbitrary strings that we can make containing some substring, but we count the number of strings of a specific form (that might already have characters fixed at some indices) that we can make containing the substring. When we do this, we can do the same thing as before. Again, we place s
at some index, i
in the form string (now there aren’t necessarily n - length(s) + 1
possibilities to place s
, since some indices in our form might have fixed characters on them and s
might not fit everywhere). We still count all possibilities of filling in the prefix. But then not subtract the possibilities where s
is in the prefix, but where s
is in the prefix plus the s
we put at index i
save the last letter. So, when we have ???LOL?, we not only see that we need to subtract because we can create LOLLOL?, but since we check for possibilities to fill in LOL in the string ???LO, we also see that we need to subtract because we can create ?LOLOL?. Now, since we considered a broader problem, that can account for forms that might already have characters fixed at some indices, this again is a reduction of our problem to the same problem of smaller size.
Quick code would look like:
import java.util.*;
public class StringsInStrings {
//use arrays of CustomCharacter as strings/forms, use nulls when a index is not yet filled in
enum CustomCharacter {
A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y, Z
};
public static void main(String[] args) {
CustomCharacter[] string = new CustomCharacter[] { null, null, null, null, null, null, null };
CustomCharacter[] substring = new CustomCharacter[] { CustomCharacter.L, CustomCharacter.O, CustomCharacter.L };
System.out.println(countPossibilities(string, substring));
}
static int countPossibilities(CustomCharacter[] form, CustomCharacter[] substring) {
int possibilities = 0;
//put our substring at any index in the form
for (int index = 0; index <= form.length - substring.length; index++) {
//see if the substring fits at that index
if (fits(form, substring, index)) {
//count the possibilities for filling in the prefix
CustomCharacter[] prefix = Arrays.copyOfRange(form, 0, index);
int prefixPossibilities = (int) Math.pow(CustomCharacter.values().length, countNulls(prefix));
//count the possibilities for filling in the suffix
CustomCharacter[] suffix = Arrays.copyOfRange(form, index+substring.length, form.length);
int suffixPossibilities = (int) Math.pow(CustomCharacter.values().length, countNulls(suffix));
//count the possibilities where we fill the prefix such that our index is not the first occurence of the substring anymore
//these need to be subtracted
CustomCharacter[] reducedForm = Arrays.copyOfRange(insert(form,substring,index), 0, index + substring.length - 1);
int invalidPrefixPossibilities = countPossibilities(reducedForm, substring);
possibilities += (prefixPossibilities - invalidPrefixPossibilities) * suffixPossibilities;
}
}
return possibilities;
}
private static boolean fits(CustomCharacter[] form, CustomCharacter[] substring, int index) {
boolean result = true;
for (int subStrIndex = 0; subStrIndex < substring.length; subStrIndex++) {
if (!(form[index + subStrIndex] == null || form[index + subStrIndex] == substring[subStrIndex])) {
result = false;
}
}
return result;
}
private static int countNulls(CustomCharacter[] arr) {
int result = 0;
for (int i = 0; i < arr.length; i++) {
if (arr[i] == null) {
result++;
}
}
return result;
}
private static CustomCharacter[] insert(CustomCharacter[] form, CustomCharacter[] substring, int index) {
CustomCharacter[] result = Arrays.copyOf(form, form.length);
for (int i = 0; i < substring.length; i++) {
result[index + i] = substring[i];
}
return result;
}
}
So yeah, fun problem. I think this is a correct solution, maybe there is an easier one?
[One more edit: version with memoization and BigInteger. Making it more efficient and making use of the efficiency by having it support large examples… as discussed in the comments]
import java.math.BigInteger;
import java.util.*;
public class StringsInStringsCached {
//use arrays of CustomCharacter as strings/forms, use nulls when a index is not yet filled in
enum CustomCharacter {
A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y, Z
};
static HashMap<List<CustomCharacter>, BigInteger> cache;
public static void main(String[] args) {
List<CustomCharacter> string = customCharListWithNulls(100);
List<CustomCharacter> substring = customCharListFromString("ABCDABD");
cache = new HashMap<List<CustomCharacter>, BigInteger>();
System.out.println(countPossibilities(string, substring));
}
static BigInteger countPossibilities(List<CustomCharacter> form, List<CustomCharacter> substring) {
if(!cache.containsKey(form)){
BigInteger possibilities = BigInteger.ZERO;
//put our substring at any index in the form
for (int index = 0; index <= form.size() - substring.size(); index++) {
//see if the substring fits at that index
if (fits(form, substring, index)) {
//count the possibilities for filling in the prefix
List<CustomCharacter> prefix = copyOfRange(form, 0, index);
BigInteger.valueOf(CustomCharacter.values().length).pow(countNulls(prefix));
BigInteger.valueOf(countNulls(prefix));
BigInteger prefixPossibilities = BigInteger.valueOf(CustomCharacter.values().length).pow(countNulls(prefix));
//count the possibilities for filling in the suffix
List<CustomCharacter> suffix = copyOfRange(form, index+substring.size(), form.size());
BigInteger suffixPossibilities = BigInteger.valueOf(CustomCharacter.values().length).pow(countNulls(suffix));
//count the possibilities where we fill the prefix such that our index is not the first occurence of the substring anymore
//these need to be subtracted
List<CustomCharacter> reducedForm = copyOfRange(insert(form,substring,index), 0, index + substring.size() - 1);
BigInteger invalidPrefixPossibilities = countPossibilities(reducedForm, substring);
possibilities = possibilities.add((prefixPossibilities.subtract(invalidPrefixPossibilities)).multiply(suffixPossibilities));
}
}
cache.put(form, possibilities);
}
return cache.get(form);
}
private static boolean fits(List<CustomCharacter> form, List<CustomCharacter> substring, int index) {
boolean result = true;
for (int subStrIndex = 0; subStrIndex < substring.size(); subStrIndex++) {
if (!(form.get(index + subStrIndex) == null || form.get(index + subStrIndex) == substring.get(subStrIndex))) {
result = false;
}
}
return result;
}
private static int countNulls(List<CustomCharacter> l) {
int result = 0;
for (int i = 0; i < l.size(); i++) {
if (l.get(i) == null) {
result++;
}
}
return result;
}
private static List<CustomCharacter> insert(List<CustomCharacter> form, List<CustomCharacter> substring, int index) {
List<CustomCharacter> result = new ArrayList<CustomCharacter>(form);
for (int i = 0; i < substring.size(); i++) {
result.set(index + i, substring.get(i));
}
return result;
}
private static List<CustomCharacter> copyOfRange(List<CustomCharacter> l, int from, int to){
List<CustomCharacter> result = new ArrayList<CustomCharacter>();
for(int i = from; i < to; i++){
result.add(l.get(i));
}
return result;
}
private static List<CustomCharacter> customCharListWithNulls(int size) {
List<CustomCharacter> result = new ArrayList<CustomCharacter>();
for(int i = 0; i < size; i++){
result.add(null);
}
return result;
}
private static List<CustomCharacter> customCharListFromString(String s) {
List<CustomCharacter> result = new ArrayList<CustomCharacter>();
for(char c : s.toCharArray()){
result.add(CustomCharacter.valueOf(String.valueOf(c)));
}
return result;
}
}
4
I have an idea to solve it like a dynamic programming problem.
Let’s see a symmetric problem: How many strings doesn’t contain a specific substring T?
Assume the required string’s length is n, and one of a valid string is S. ( n == S.length() )
We define a 2d array dp[n][T.length()],
Using dp[i,j] to represent “How many property combination we have, when the suffix of substring s[0…i] is the same as the prefix of T, in which have the the longest length(which is variable ‘j’) of the suffix/prefix? “.
In other word: for the constant i, find out max{len} in “S[i-len+1, i]==T[0…len]”, and dp[i,len] means the number of the combination at this state.
For example:
T=abcabc and S[0…i]:
j=6 means that “the 6 chars in the end of S” is abcabc. (S[i-5...I] == abcabc
)
j=3 means that “the 6 chars in the end of S” is ###abc, and ###!="abc"
.
So dp[i, j=3] doesn’t contain dp[i, j=6]
And we have the code here
For performance, we use KMP algorithm to preprocess the string T and generate a ‘nxt’ array.
T = '*'+T; // to avoid -1 in index of array dp. (dp[i,-1])
geneNext();
dp[0][0]=25; dp[0][1]=1;
for (int i=0; i<n; i++) {
for (int j=0; j<T.size()-1; j++) {
//cout<<"dp["<<i<<","<<j<<"] = "<<dp[i][j]<<endl;
for (int ch=0; ch<26; ch++) {//here we assume S[i+1]=ch.
int tmp;// tmp is the new variable 'j' for S[0...i+1].
for (tmp=j+1; tmp>0; tmp = nxt[tmp])
if (T[tmp] == ch+'a')
break;
dp[i+1][tmp] = dp[i+1][tmp]+dp[i][j];
}
}
}
int ans = 0;
for (int i=0; i<T.size()-1; i++)
ans = ans + dp[n-1][i];
cout<<ans<<endl; // ans is my symmetric problem.
cout<<26^n - ans<<endl; // the answer for poster's problem.