A lot of this post was inspired by the papers of John Noonan & Doron Zeilberger and Zhuang (you can also checkoout the original paper, but it's a bit harder to follow).

Intro

Suppose we want to find the number of strings of length n (sourced from some given alphabet V), which don't contain any of a given set of strings B as a substring. Is there a fast way to do this?

The most basic case is excluding a string of a single character, in which case there are n|V|1 total strings. But past single character strings, reasoning becomes a bit more difficult. It's always true (and we will show) that the total number of strings follows a linear recurrence and so calculating the first few results using DP and using Berlekamp Massey will give a fast way, though we will show a way to compute a generating function directly.

A Derivation

Let's first define the weight WR of some word w=w1wnV of length n, and RZ+. We will define it using the set of variables x[w] for all wV of length R or less as follows:

WR(w)=k=1nm=kmin(k+R,n)x[wkwm]

Note some factors may appear more than once, for example:

W2(HELL)=x[H]x[E]x[L]2x[HE]x[EL]x[LL]

Now, we define the generating function over x[w] where w has length R as:

ΦR=wVWR(w)

Our strategy will be to perform substitutions on ΦR in order to recover the generating functions we want. For example the mapping:

(1)x[w]{0,if wB, e.g. w is a string we want to excludex,if w is a single character string1,otherwise

Will give us the generating function anxn where an is the number of words of length n not containing any wB as a substring. We'll denote this generating function by fB(x).

Computing ΦR

Let's define:

Suff(w)={wV:w' ends in w}

Now, all words in V must either be of length less than R or end in some string of length R. Define:

ΦR,w=wSuff(w)WR(w)

Then ΦR is the sum of ΦR,w for all words of length R plus the sum of WR(w) for all words of length less than R. Next, we see that our set of ΦR,w form a set of simultaneous equations:

ΦR,w1wR=WR(w)+(i1x[wiwr])cVΦR,cw1wR1

Each equation says in essence, if a word w ends in w, then it must either be w itself, else we can drop the last character of w and we are left another word with a suffix of length R. And so calculating ΦR reduces to solving these equations. We can also glean from this that ΦR is rational in its variables which implies as we stated in our introduction, fB(x) is the GF of a linear recurrence.

Note, making the substitution (1) prior to solving the system simplifies computing fB(x). We'll denote (1) applied to ΦR,w as ΦR,w(x).

Example

Consider the binary string of length n not containing the substring 111. We see (making our substitution ahead of time):

Φ3,000(x)=x3+x(Φ3,100(x)+Φ3,000(x))Φ3,001(x)=x3+x(Φ3,100(x)+Φ3,000(x))Φ3,010(x)=x3+x(Φ3,101(x)+Φ3,001(x))Φ3,011(x)=x3+x(Φ3,101(x)+Φ3,001(x))Φ3,100(x)=x3+x(Φ3,110(x)+Φ3,010(x))Φ3,101(x)=x3+x(Φ3,110(x)+Φ3,010(x))Φ3,110(x)=x3+x(Φ3,111(x)+Φ3,011(x))Φ3,111(x)=x(Φ3,111(x)+Φ3,011(x))

Solving, we find that:

Φ3,000(x)=Φ3,001(x)=Φ3,010(x)=Φ3,011(x)=x5+x4+x3x3+x2+x1Φ3,100(x)=Φ3,101(x)=Φ3,110(x)=x4+x3x3+x2+x1Φ3,111(x)=0

And thus:

f{111}(x)=1+2x+4x2+4x5+6x4+7x31x3x2x=x2+x+11x3x2x

AKA the (shifted) Tribonacci numbers.

The Goulden-Jackson Cluster Method

For large alphabets and R the method above will result in a lot of computational effort; our system alone will be of size |V|R. We'll introduce the Goulden-Jackson Cluster method as a means of reducing our work.

In this section we'll add a couple of restrictions on our set of bad words B. Firstly, no bad word should appear as a substring of any other bad word - the bigger bad word can be removed from B for the same end result. Secondly, all bB should be of length at least two. If this is not true, we can equivalently remove vB from our alphabet V.

Clusters

Given w and a set of words B, we define a marked word as a pair (w,{(b1,i1),(b2,i2)(bl,il):wikwik+length(bk)1=bkB}). For example, for B={HE,EL,LO}, the following is a marked word:

(HELLO,{(HE,1),(EL,2),(LO,4)})

And we define a cluster as a nonempty marked word for which every letter in w belongs to at least one bad word, and neighbouring bad words appearing in w always overlap:

(HEL,{(HE,1),(EL,2)})

Note, every subword of B in w needn't be included in the marked word, for example:

(HELLO,{(HE,1)})

Is a completely valid marked word.

Given B, we'll define CB(w) as the set of all clusters on w (exercise: find w,B such that this set has size greater than one), MB as the set of all marked words, and CB as the set of all clusters.

The concatenation of two marked words m1, m2 will be denoted m1m2, and defined how you would expect.

A Formula

First of all, we'll give an equivalent definition of fB(x):

fB(x)=wLBxlength(w)

Where LB is the set of all words in V, not containing any word in B as a substring. We'll focus on calculating fB(x) from here on, but other substitutions on ΦR act similarly (and are in examples).

Further define the auxiliary generating functions:

FB(x,t)=(w,S)=mMBxlength(w)t|S|CB(x,t)=(w,S)=cCBxlength(w)t|S|

And define Q(m=(w,S))=xlength(w)t|S| for brevity (it should be clear that Q(m1m2)=Q(m1)Q(m2)). Next, we see that every marked word m=(w,S) either ends in a character not present in any bad word in S, or otherwise the last character is part of the last bad word in S (which itself must be part of a cluster):

MB={e}{mc:mMB, cCB}{mv:mMB, vV}

FB(x,t)=1+mMBcCBQ(mc)+mMBvVQ(mv)=1+mMBcCBQ(m)Q(c)+mMBvVQ(m)Q(v)=1+FB(x,t)CB(x,t)+|V|x(FB(x,t))=11|V|xCB(x,t)

Where e is the (unique) empty marked word; note also the union is disjoint. We also wave hands a bit for vV, these always correspond to exactly one marked word given all elements of B have length greater than one.

Thus, calculating FB(x,b) reduces to calculating CB(x,t). We can group clusters according to their last bad word b. For some cluster c=(w,S), the cluster must then either consist solely of b (which implies w=b), else we can remove b along with some suffix of w to produce a smaller cluster.

For each bB let CB[b] denote the set of clusters ending in b, with CB[b](x,t) defined similarly. Then CB[b](x,t) form a SLE, for example for B={HELE,ELEM}, we have:

CB[ELEM](x,t)=CB[HELE](x,t)xt+CB[HELE](x,t)x3t+x4tCB[HELE](x,t)=x4t

Which results in:

CB(x,t)=x4t+x4t(xt+x3t+1)FB(x,t)=1(126x)(x4t+x4t(xt+x3t+1))

Now, recovering fB(x) from FB(x,t) is equivalent to substituting t=1 (exercise!), resulting in:

fB(x)=11x7x52x426x

Sample Sage implementation:

import string


def goulden_jackson(bad_words, alphabet=string.ascii_uppercase):
    s, gfvs = var("s"), {w: var(f"G_{w}") for w in bad_words}
    eqns = []
    for end_word in bad_words:
        eq = -s^(len(end_word))
        for i in range(1, len(end_word) + 1):
            sub = end_word[:i]
            for source_word in bad_words:
                if source_word.endswith(sub):
                    eq += -s^(len(end_word) - len(sub))*gfvs[source_word]
        eqns.append(eq == 0)

    soln = solve(eqns, *gfvs.values())
    CB = sum(eq.right() for eq in (soln[0] if len(bad_words) > 1 else soln))
    G = 1 / (1 - len(alphabet)*s - CB)
    return G.numerator() / G.denominator()

Our overlap checking is not optimised, we could do better with a suffix tree when bad_words is large. We could also further exploit symmetry, e.g. we must always have CB[abb](x,t)=CB[cbb](x,t).

Examples

PGF for the First Occurrence of a Binary String

For some binary string w=w1wl, let G(x)=n=1pnxn where pn is defined as the probability that the first occurrence of the string w in a random infinite binary string starts at n.

Then the number of binary strings of length n where the first occurrence of w occurs at the last l characters is given by the number of binary strings of length n which do contain w as a substring, minus the the number of binary strings of length n1 which contain w as a substring:

(2n[xn]f{w}(x))(2n1[xn1]f{w}(x))

Adjusting by l since we want the character where w starts:

pn=(2n+l[xn+l]f{w}(x))(2n+l1[xn+l1]f{w}(x))2n+l1

Summing and multiplying through by x to account for moving from 0 to 1 indexing:

G(x)=2x(1x)112xf{w}(x)2lxl|x=x2=x(1x2)11xf{w}(x2)xl

A Weighted Penney's Game

Consider a game between two players which consists of a set of rounds consisting of tosses of an unfair coin (say heads = p). Player 1 wins the round if the result is heads, and player 2 similarly for tails. A player wins the game if they reach k consecutive round wins. What is the probability player 1 wins? Markov chains may be a bit cleaner for this example, but we'll show how GJ can be applied anyway!

We will first calculate the number of strings of length n containing a given string w1, and not containing a second string w2 (denote this criteria C1). This corresponds to a sequence of tosses containing k Hs in a row, but not k Ts. Then we see how this allows us to calculate number of strings of length n containing a given string w1 only as a suffix, and not containing a second string w2 (C2), which will give us our result.

Let Sn consist of the set of strings satisfying C1. Then if Vn is the set of strings of length n not containing w2, and Un is the set of strings of length n not containing w1 or w2, what is Wn=VnUn?

We'll show by the classic subset argument Wn=Sn. Suppose w is a target string, e.g. it contains w1 and not w2. Then wVn since it doesn't contain w2, and also wUn since it contains w1, which implies SnWn. Similarly, if wVn then w does not contain w2, and wUn implies w must contain w1 since we now it doesn't contain w2; thus WnSn.

Since UnVn we must have:

|Sn|=|Vn||Un|

For brevity let xn=|Xn|. Since vn,un are just substring exclusion problems, we can use our methods to calculate sn. But now how do we calculate the number of substrings of length n which don't contain w2, and contain w1 only as a suffix? Like in the previous example, we may be tempted to say "subtract 2sn1 from sn to account for adding H or T to any wSn1", but this is not correct since appending T to wSn1 may result in a string ending in k Ts.

Let S(x,y)=n,msn,mxnym where sn,m is the number of strings with n Hs and m Ts satisfying C1 (we know this one). Further let T(x,y)=n,mtn,mxnym where tn,m is as like sn,m, but with added condition that the string ends in T; and let Further let K(x,y)=n,mkn,mxnym where kn,m is as like sn,m, but with added condition that the string ends in k1 lots of Ts. Then the following holds:

T(x,y)=yS(x,y)yK(x,y)K(x,y)=y4S(x,y)y4T(x,y)P(x,y)=(1xy)S(x,y)+K(x,y)

Where P(x,y) is the generating function we want (e.g. satisfying C2), which results in the following:

P(x,y)=(1xy+yyk1yk1yk)S(x,y)

Now the probability of player 1 winning is just P(p,(1p)).

Questions

  1. Is the following equivalent to our definition of a cluster: "Define (w,S) as a cluster if it cannot be decomposed as the concatenation (defined how you would expect) of two nonempty marked words"?
  2. How may we find the generating function of the number of words of length n in which every letter must be contained in a bad word?