Number of strings containing a specific substring

I’ve seen numerous questions (and answers) concerning the number of binary strings (e.g “10010” containing some binary substring (e.g “00”). I’d like to know if there is a way to generalize this:

Given a length n and a string s (may contain any letters A...Z), how many different strings of length n, that contain the substring s at least once, are there?

These kinds of questions are often solvable with combinatorics, but I’d like to find a dynamic-programming solution (and so I posted this question here instead of mathexchange).

Personally, I’m running on cold steam here. Of course, I tried some combinatorial methods:

n = 5
s = CAR
? = unknown character

all possibilities:

CAR??
?CAR?
??CAR

This basically boils down to 26^0*26^2 + 26^1*26^1 + 26^2*26^0 = 3 * 26^2 = 2028 possibilities, which I assume is correct. Consider however the following case:

n = 7
s = CAR
? = unknown character

all possibilities:

CAR????
?CAR???
??CAR??
???CAR?
????CAR

What’s the problem? Well, there’s 3 positions that can produce duplicate results:

CARCAR?
CAR?CAR
?CARCAR

Now the number of possible strings is

(2 * 26^7) + (2 * 26^1 * 26^6) + (2 * 26^2 * 26^5) + (2 * 26^3 * 26^4) - (3 * 26)

I can’t generalize this.

Did I catch anyone’s interest? Any help is appreciated.

1

Acknowledging @Hirle’s solution, I still figured I’d post my own.

I approached this from a little different direction. After some thinking, the simplest program of all held the key:
deterministic finite automaton. This could be solved with a dfa that searched for occurences of a string, as with it I can compute
an adjacency matrix that I can later use. I decided to skip the trivialities and just copied some code off the net for the
deterministic version of Knuth–Morris–Pratt algorithm (the original one if nondeterministic, which doesn’t serve our purposes).

The below code originates from here, with little modifications:

public static int[][] dfa;

// create the DFA from a String
public static void kmp(String pat) {

    // build DFA from pattern
    int m = pat.length();
    dfa = new int[26][m]; 
    dfa[pat.charAt(0) - 'A'][0] = 1; 
    for (int X = 0, j = 1; j < m; j++) {
        for (int c = 0; c < 26; c++) 
            dfa[c][j] = dfa[c][X];          // Copy mismatch cases. 

        dfa[pat.charAt(j) - 'A'][j] = j+1;  // Set match case. 
        X = dfa[pat.charAt(j) - 'A'][X];    // Update restart state. 
    } 
}

For the pattern ABC this generates the following 2D-matrix, or dfa, if you will:

[1, 1, 1]
[0, 2, 0]
[0, 0, 3]
[0, 0, 0]
[0, 0, 0]
[0, 0, 0]
[0, 0, 0]
[0, 0, 0]
[0, 0, 0]
[0, 0, 0]
[0, 0, 0]
[0, 0, 0]
[0, 0, 0]
[0, 0, 0]
[0, 0, 0]
[0, 0, 0]
[0, 0, 0]
[0, 0, 0]
[0, 0, 0]
[0, 0, 0]
[0, 0, 0]
[0, 0, 0]
[0, 0, 0]
[0, 0, 0]
[0, 0, 0]
[0, 0, 0]

Here, for example, the first row means that if the character seen is `A’, then depending on our current state we go to the state indicated by the row.

Let’s say the state we currently are in is the initial state (state 0). If we now saw the character ‘A’, the next state would be told to us by the first row’s first element,
which is 1. So now our state is 1. Let’s say we, again, saw the character ‘A’. Our state would remain 1, as the first row’s second element is also 1. Alright, now say we saw
the character ‘B’. Our state would proceed to 2, since in the B-row’s (the second row’s) second element is 2.

In essence, our state would be 3 only if we saw the characters ABC in succession. The rows D-Z are all zeroes, since if we ever saw a character other than A-C, we must start
the search from the beginning (which is A, in this case).

To generalize, the matrix element [0][1] tells us the state we should go to, if the character we saw was A and we were in the state 1. Similarly the matrix element
[6][6] told us our next state if the character we saw was G, and we were in the state 6 (in the above matrix we of course don’t have the state 6, since we built it with a string of length 3).

Anyway, now we have all we need to build the adjacency matrix. Without further ado, here’s my code for that:

String s = "ABC"; // The string we built our dfa with, e.g the string we are "searching"

paths = new long[s.length() + 1][s.length() + 1];
for (int k = 0; k < 26; k++)
    for (int i = 0; i < dfa[0].length; i++)
        paths[i][dfa[k][i]]++;

paths[paths.length - 1][paths.length - 1] = 26;

So if we were building the adjacency matrix from a dfa that was built with the string “ABC”, the adjacency matrix would look like this:

[25, 1, 0, 0]
[24, 1, 1, 0]
[24, 1, 0, 1]
[0 , 0, 0, 26]

Here paths[i][j] would tell us how many ways there are to get to the state j from the state i. For example, there are 25 ways to get to state 0 from state 0, because there is only
one way to get to some other state from state 0. Also, there are 26 ways to result in the last state, if we are in the last state.

If we had built everything up so far with the string “AA”, the adjacency matrix would look like this:

[25, 1, 0]
[25, 0, 1]
[0, 0, 26]

So, what now? We have our oh-so-mighty adjacency matrix that tells us cool stuff, what to do with it? Graph theory comes into play.

Referencing Wolfram Alpha:

the (u,v)th element of the kth power of the adjacency matrix of G gives the number of paths of length k between vertices u and v.

So what it is saying is, that if we raise our adjacency matrix into, say, 3rd power, then the element paths[0][k] tells us how many paths of length 3
there are from state 0 to state k.

I’ll prep a little more for this: If we have a string S, and with that string we can result in the last state of the dfa, then we consider S to be OK.
So what we want to know to get the answer to the original problem, is how many OK strings there are in bigger strings of length n.

This is why we raise the adjacency matrix into the nth power. Now the element paths[0][k], where k is the length of the string we are searching, should tell us how many OK strings there are in a bigger string of length n, which just
so happens to be the answer to our problem.

My code to raise a matrix into nth power:

public static long[][] matrixToPower(long[][] a, int p) {
    long[][] b = a;
    for (int n = 1; n < p; n++)
        a = matrixMultiplication(a, b);
    return a;
}

public static long[][] matrixMultiplication(long[][] m1, long[][] m2) {
    int len = m1.length;
    long[][] mResult = new long[len][len];
    for(int i = 0; i < len; i++) {
        for(int j = 0; j < len; j++) {
            for(int k = 0; k < len; k++) {
                mResult[i][j] += (m1[i][k] * m2[k][j]) % M;
                mResult[i][j] %= M;
            }
        }
    }
    return mResult;
}

I’m taking a modulo of size 10^9 + 7 to get reliable answers for big input.

Full code:

import java.util.*;

public class Program {

    public static final int M = 1000000007;

    public static int[][] dfa;
    public static long[][] paths;

    public static void main(String[] args){
        Scanner sc = new Scanner(System.in);
        int n = sc.nextInt();
        String s = sc.next();

        kmp(s); // generate dfa

        paths = new long[s.length() + 1][s.length() + 1];
        for (int k = 0; k < 26; k++)
            for (int i = 0; i < dfa[0].length; i++)
                paths[i][dfa[k][i]]++;            
        paths[paths.length - 1][paths.length - 1] = 26;

        paths = matrixToPower(paths, n);

        System.out.println(paths[0][s.length()]);
    }

    public static long[][] matrixToPower(long[][] a, int p) {
        long[][] b = a;
        for (int n = 1; n < p; n++)
            a = matrixMultiplication(a, b);
        return a;
    }

    public static long[][] matrixMultiplication(long[][] m1, long[][] m2) {
        int len = m1.length;
        long[][] mResult = new long[len][len];
        for(int i = 0; i < len; i++) {
            for(int j = 0; j < len; j++) {
                for(int k = 0; k < len; k++) {
                    mResult[i][j] += (m1[i][k] * m2[k][j]) % M;
                    mResult[i][j] %= M;
                }
            }
        }
        return mResult;
    }

    // create the DFA from a String
    public static void kmp(String pat) {
        // build DFA from pattern
        int m = pat.length();
        dfa = new int[26][m]; 
        dfa[pat.charAt(0) - 'A'][0] = 1; 
        for (int X = 0, j = 1; j < m; j++) {
            for (int c = 0; c < 26; c++) 
                dfa[c][j] = dfa[c][X];         // Copy mismatch cases. 
            dfa[pat.charAt(j) - 'A'][j] = j+1; // Set match case. 
            X = dfa[pat.charAt(j) - 'A'][X];   // Update restart state. 
        } 
    }

}

1

[Edit: realized what I previously answered doesn’t work in all cases, so editing to fix it as per https://meta.stackexchange.com/questions/186689/editing-answers ]

First of all I think there is a slight mix-up in your math (the case for n=7 has some math in it for n=10)

(2 * 26^7) + (2 * 26^1 * 26^6) + (2 * 26^2 * 26^5) + (2 * 26^3 * 26^4) - (3 * 26)

should be

(2 * 26^4) + (2 * 26^1 * 26^3) + (26^2 * 26^2)  - (3 * 26)

or just plain

(5*26^4)-(3*26)

Anyway, framed as a dynamic programming problem, I think you can look at it like this:

1) Place s at some index, i in the resulting string. There are n - length(s) + 1 possibilities.

2) For each possibility, we simply need to count the possibilities for filling in the remaining indexes. As you have noticed, we run into problems with duplicates. This can be avoided by only counting possibilities where s occurs first at i. So when we place CAR at i=0, and get CARCAR?, we count, but when we place CAR at index i=3 we don’t count CARCAR? . If we only count possibilities where s occurs first at i we never count duplicates. The part of the resulting string after our s at i (call it the suffix) can be filled in in any way. For the part before s at i (call it the prefix) we need to take care not to insert occurrences of s.

2a) What we can do here is count all possibilities of filling in the prefix and then subtract the possibilities where s is in. Counting this later part comes down to our original problem: getting the “Number of strings (of length n) containing a specific substring” only for a smaller length (now the length of our prefix, instead of the length of the entire string). So now we have a reduction of our problem to a smaller problem and hence a dynamic solution.

This works for strings like CAR, but not for strings that can overlap with themselves (place LOL at index 3 of a string of length 7. Then we can create ?LOLOL?, containing 2 occurrences of LOL, where the first isn’t (completely) in the prefix, because it overlaps).

2b) This can be fixed by considering a slightly broader problem. We don’t count the number of arbitrary strings that we can make containing some substring, but we count the number of strings of a specific form (that might already have characters fixed at some indices) that we can make containing the substring. When we do this, we can do the same thing as before. Again, we place s at some index, i in the form string (now there aren’t necessarily n - length(s) + 1 possibilities to place s, since some indices in our form might have fixed characters on them and s might not fit everywhere). We still count all possibilities of filling in the prefix. But then not subtract the possibilities where s is in the prefix, but where s is in the prefix plus the s we put at index i save the last letter. So, when we have ???LOL?, we not only see that we need to subtract because we can create LOLLOL?, but since we check for possibilities to fill in LOL in the string ???LO, we also see that we need to subtract because we can create ?LOLOL?. Now, since we considered a broader problem, that can account for forms that might already have characters fixed at some indices, this again is a reduction of our problem to the same problem of smaller size.

Quick code would look like:

import java.util.*;

public class StringsInStrings {
    //use arrays of CustomCharacter as strings/forms, use nulls when a index is not yet filled in
    enum CustomCharacter {
        A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y, Z
    };

    public static void main(String[] args) {
        CustomCharacter[] string = new CustomCharacter[] { null, null, null, null, null, null, null };
        CustomCharacter[] substring = new CustomCharacter[] { CustomCharacter.L, CustomCharacter.O, CustomCharacter.L };
        System.out.println(countPossibilities(string, substring));
    }

    static int countPossibilities(CustomCharacter[] form, CustomCharacter[] substring) {
        int possibilities = 0;
        //put our substring at any index in the form
        for (int index = 0; index <= form.length - substring.length; index++) {
            //see if the substring fits at that index
            if (fits(form, substring, index)) {
                //count the possibilities for filling in the prefix
                CustomCharacter[] prefix = Arrays.copyOfRange(form, 0, index);
                int prefixPossibilities = (int) Math.pow(CustomCharacter.values().length, countNulls(prefix));
                //count the possibilities for filling in the suffix
                CustomCharacter[] suffix = Arrays.copyOfRange(form, index+substring.length, form.length);
                int suffixPossibilities = (int) Math.pow(CustomCharacter.values().length, countNulls(suffix));

                //count the possibilities where we fill the prefix such that our index is not the first occurence of the substring anymore
                //these need to be subtracted
                CustomCharacter[] reducedForm = Arrays.copyOfRange(insert(form,substring,index), 0, index + substring.length - 1);
                int invalidPrefixPossibilities = countPossibilities(reducedForm, substring);


                possibilities += (prefixPossibilities - invalidPrefixPossibilities) * suffixPossibilities;
            }

        }

        return possibilities;
    }

    private static boolean fits(CustomCharacter[] form, CustomCharacter[] substring, int index) {
        boolean result = true;
        for (int subStrIndex = 0; subStrIndex < substring.length; subStrIndex++) {
            if (!(form[index + subStrIndex] == null || form[index + subStrIndex] == substring[subStrIndex])) {
                result = false;
            }
        }
        return result;
    }

    private static int countNulls(CustomCharacter[] arr) {
        int result = 0;
        for (int i = 0; i < arr.length; i++) {
            if (arr[i] == null) {
                result++;
            }
        }
        return result;
    }

    private static CustomCharacter[] insert(CustomCharacter[] form, CustomCharacter[] substring, int index) {
        CustomCharacter[] result = Arrays.copyOf(form, form.length);
        for (int i = 0; i < substring.length; i++) {
            result[index + i] = substring[i];
        }
        return result;
    } 
}

So yeah, fun problem. I think this is a correct solution, maybe there is an easier one?

[One more edit: version with memoization and BigInteger. Making it more efficient and making use of the efficiency by having it support large examples… as discussed in the comments]

import java.math.BigInteger;
import java.util.*;

public class StringsInStringsCached {
    //use arrays of CustomCharacter as strings/forms, use nulls when a index is not yet filled in
    enum CustomCharacter {
        A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y, Z
    };

    static HashMap<List<CustomCharacter>, BigInteger> cache;

    public static void main(String[] args) {
        List<CustomCharacter> string = customCharListWithNulls(100);
        List<CustomCharacter> substring = customCharListFromString("ABCDABD");

        cache = new HashMap<List<CustomCharacter>, BigInteger>();

        System.out.println(countPossibilities(string, substring));
    }

    static BigInteger countPossibilities(List<CustomCharacter> form, List<CustomCharacter> substring) {
        if(!cache.containsKey(form)){
            BigInteger possibilities = BigInteger.ZERO;
            //put our substring at any index in the form
            for (int index = 0; index <= form.size() - substring.size(); index++) {
                //see if the substring fits at that index
                if (fits(form, substring, index)) {
                    //count the possibilities for filling in the prefix
                    List<CustomCharacter> prefix = copyOfRange(form, 0, index);
                    BigInteger.valueOf(CustomCharacter.values().length).pow(countNulls(prefix));
                    BigInteger.valueOf(countNulls(prefix));

                    BigInteger prefixPossibilities = BigInteger.valueOf(CustomCharacter.values().length).pow(countNulls(prefix));
                    //count the possibilities for filling in the suffix
                    List<CustomCharacter> suffix = copyOfRange(form, index+substring.size(), form.size());
                    BigInteger suffixPossibilities = BigInteger.valueOf(CustomCharacter.values().length).pow(countNulls(suffix));

                    //count the possibilities where we fill the prefix such that our index is not the first occurence of the substring anymore
                    //these need to be subtracted
                    List<CustomCharacter> reducedForm = copyOfRange(insert(form,substring,index), 0, index + substring.size() - 1);
                    BigInteger invalidPrefixPossibilities = countPossibilities(reducedForm, substring);

                    possibilities = possibilities.add((prefixPossibilities.subtract(invalidPrefixPossibilities)).multiply(suffixPossibilities));
                }

            }

            cache.put(form, possibilities);
        }
        return cache.get(form);
    }

    private static boolean fits(List<CustomCharacter> form, List<CustomCharacter> substring, int index) {
        boolean result = true;
        for (int subStrIndex = 0; subStrIndex < substring.size(); subStrIndex++) {
            if (!(form.get(index + subStrIndex) == null || form.get(index + subStrIndex) == substring.get(subStrIndex))) {
                result = false;
            }
        }
        return result;
    }

    private static int countNulls(List<CustomCharacter> l) {
        int result = 0;
        for (int i = 0; i < l.size(); i++) {
            if (l.get(i) == null) {
                result++;
            }
        }
        return result;
    }

    private static List<CustomCharacter> insert(List<CustomCharacter> form, List<CustomCharacter> substring, int index) {
        List<CustomCharacter> result = new ArrayList<CustomCharacter>(form);
        for (int i = 0; i < substring.size(); i++) {
            result.set(index + i, substring.get(i));
        }
        return result;
    } 

    private static List<CustomCharacter> copyOfRange(List<CustomCharacter> l, int from, int to){
        List<CustomCharacter> result = new ArrayList<CustomCharacter>();
        for(int i = from; i < to; i++){
            result.add(l.get(i));
        }
        return result;
    }

    private static List<CustomCharacter> customCharListWithNulls(int size) {
        List<CustomCharacter> result = new ArrayList<CustomCharacter>();
        for(int i = 0; i < size; i++){
            result.add(null);
        }
        return result;
    }

    private static List<CustomCharacter> customCharListFromString(String s) {
        List<CustomCharacter> result = new ArrayList<CustomCharacter>();
        for(char c : s.toCharArray()){
            result.add(CustomCharacter.valueOf(String.valueOf(c)));
        }

        return result;
    }
}

4

I have an idea to solve it like a dynamic programming problem.

Let’s see a symmetric problem: How many strings doesn’t contain a specific substring T?

Assume the required string’s length is n, and one of a valid string is S. ( n == S.length() )

We define a 2d array dp[n][T.length()],
Using dp[i,j] to represent “How many property combination we have, when the suffix of substring s[0…i] is the same as the prefix of T, in which have the the longest length(which is variable ‘j’) of the suffix/prefix? “.

In other word: for the constant i, find out max{len} in “S[i-len+1, i]==T[0…len]”, and dp[i,len] means the number of the combination at this state.

For example:
T=abcabc and S[0…i]:
j=6 means that “the 6 chars in the end of S” is abcabc. (S[i-5...I] == abcabc)
j=3 means that “the 6 chars in the end of S” is ###abc, and ###!="abc".
So dp[i, j=3] doesn’t contain dp[i, j=6]

And we have the code here

For performance, we use KMP algorithm to preprocess the string T and generate a ‘nxt’ array.

T  = '*'+T; // to avoid -1 in index of array dp. (dp[i,-1])
geneNext();
dp[0][0]=25; dp[0][1]=1;
for (int i=0; i<n; i++) {
    for (int j=0; j<T.size()-1; j++) {
        //cout<<"dp["<<i<<","<<j<<"] = "<<dp[i][j]<<endl;
        for (int ch=0; ch<26; ch++) {//here we assume S[i+1]=ch.
            int tmp;// tmp is the new variable 'j' for S[0...i+1].
            for (tmp=j+1; tmp>0; tmp = nxt[tmp])
                if (T[tmp] == ch+'a')
                    break;
            dp[i+1][tmp] = dp[i+1][tmp]+dp[i][j];
        }
    }
}
int ans = 0;
for (int i=0; i<T.size()-1; i++)
    ans = ans + dp[n-1][i];
cout<<ans<<endl; // ans is my symmetric problem.
cout<<26^n - ans<<endl; // the answer for poster's problem.

Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa Dịch vụ tổ chức sự kiện 5 sao Thông tin về chúng tôi Dịch vụ sinh nhật bé trai Dịch vụ sinh nhật bé gái Sự kiện trọn gói Các tiết mục giải trí Dịch vụ bổ trợ Tiệc cưới sang trọng Dịch vụ khai trương Tư vấn tổ chức sự kiện Hình ảnh sự kiện Cập nhật tin tức Liên hệ ngay Thuê chú hề chuyên nghiệp Tiệc tất niên cho công ty Trang trí tiệc cuối năm Tiệc tất niên độc đáo Sinh nhật bé Hải Đăng Sinh nhật đáng yêu bé Khánh Vân Sinh nhật sang trọng Bích Ngân Tiệc sinh nhật bé Thanh Trang Dịch vụ ông già Noel Xiếc thú vui nhộn Biểu diễn xiếc quay đĩa Dịch vụ tổ chức tiệc uy tín Khám phá dịch vụ của chúng tôi Tiệc sinh nhật cho bé trai Trang trí tiệc cho bé gái Gói sự kiện chuyên nghiệp Chương trình giải trí hấp dẫn Dịch vụ hỗ trợ sự kiện Trang trí tiệc cưới đẹp Khởi đầu thành công với khai trương Chuyên gia tư vấn sự kiện Xem ảnh các sự kiện đẹp Tin mới về sự kiện Kết nối với đội ngũ chuyên gia Chú hề vui nhộn cho tiệc sinh nhật Ý tưởng tiệc cuối năm Tất niên độc đáo Trang trí tiệc hiện đại Tổ chức sinh nhật cho Hải Đăng Sinh nhật độc quyền Khánh Vân Phong cách tiệc Bích Ngân Trang trí tiệc bé Thanh Trang Thuê dịch vụ ông già Noel chuyên nghiệp Xem xiếc khỉ đặc sắc Xiếc quay đĩa thú vị
Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa
Thiết kế website Thiết kế website Thiết kế website Cách kháng tài khoản quảng cáo Mua bán Fanpage Facebook Dịch vụ SEO Tổ chức sinh nhật