Counting Strings and the Goulden-Jackson Cluster Method
A lot of this post was inspired by the papers of John Noonan & Doron Zeilberger and Zhuang (you can also checkoout the original paper, but it's a bit harder to follow).
Intro
Suppose we want to find the number of strings of length \( n \) (sourced from some given alphabet \( V \)), which don't contain any of a given set of strings \( B \) as a substring. Is there a fast way to do this?
The most basic case is excluding a string of a single character, in which case there are \( n^{\left|V\right| - 1} \) total strings. But past single character strings, reasoning becomes a bit more difficult. It's always true (and we will show) that the total number of strings follows a linear recurrence and so calculating the first few results using DP and using Berlekamp Massey will give a fast way, though we will show a way to compute a generating function directly.
A Derivation
Let's first define the weight \( W_R \) of some word \( w = w_1 \dots w_n \in V^* \) of length \( n \), and \( R \in \mathbb{Z}^+ \). We will define it using the set of variables \( x\left[w'\right] \) for all \( w' \in V^* \) of length \( R \) or less as follows:
\[ W_R(w) = \prod_{k = 1}^n \prod_{m = k}^{\min(k + R, n)} x\left[w_k \dots w_m\right] \]
Note some factors may appear more than once, for example:
\[ W_2(HELL) = x\left[H\right]x\left[E\right]x\left[L\right]^2x\left[HE\right]x\left[EL\right]x\left[LL\right] \]
Now, we define the generating function over \( x[w] \) where \( w \) has length \( \le R \) as:
\[ \Phi_R = \sum_{w \in V^*} W_R(w) \]
Our strategy will be to perform substitutions on \( \Phi_R \) in order to recover the generating functions we want. For example the mapping:
\begin{equation} x[w] \mapsto \left\{ \begin{array}{ll} 0, & \text{if } w \in B \text{, e.g. w is a string we want to exclude}\\ x, & \text{if } w \text{ is a single character string}\\ 1, & \text{otherwise} \end{array}\right. \end{equation}Will give us the generating function \( \sum a_n x^n \) where \( a_n \) is the number of words of length \( n \) not containing any \( w \in B \) as a substring. We'll denote this generating function by \( f_B(x) \).
Computing \( \Phi_R \)
Let's define:
\[ Suff(w) = \{ w' \in V^* : \text{w' ends in w} \} \]
Now, all words in \( V^* \) must either be of length less than \( R \) or end in some string of length \( R \). Define:
\[ \Phi_{R, w} = \sum_{w' \in Suff(w)} W_R(w') \]
Then \( \Phi_R \) is the sum of \( \Phi_{R, w} \) for all words of length \( R \) plus the sum of \( W_R(w) \) for all words of length less than \( R \). Next, we see that our set of \( \Phi_{R, w} \) form a set of simultaneous equations:
\[ \Phi_{R, w_1 \dots w_R} = W_R(w) + \left(\prod_{i \ge 1} x\left[w_i \dots w_r \right] \right) \sum_{c \in V} \Phi_{R, cw_1 \dots w_{R - 1}} \]
Each equation says in essence, if a word \( w' \) ends in \( w \), then it must either be \( w \) itself, else we can drop the last character of \( w' \) and we are left another word with a suffix of length \( R \). And so calculating \( \Phi_R \) reduces to solving these equations. We can also glean from this that \( \Phi_R \) is rational in its variables which implies as we stated in our introduction, \( f_B(x) \) is the GF of a linear recurrence.
Note, making the substitution (1) prior to solving the system simplifies computing \( f_B(x) \). We'll denote (1) applied to \( \Phi_{R, w} \) as \( \Phi_{R, w}(x) \).
Example
Consider the binary string of length \( n \) not containing the substring \( 111 \). We see (making our substitution ahead of time):
\begin{align*} \Phi_{3, 000}(x) &= x^3 + x \left(\Phi_{3, 100}(x) + \Phi_{3, 000}(x) \right)\\ \Phi_{3, 001}(x) &= x^3 + x \left(\Phi_{3, 100}(x) + \Phi_{3, 000}(x) \right)\\ \Phi_{3, 010}(x) &= x^3 + x \left(\Phi_{3, 101}(x) + \Phi_{3, 001}(x) \right)\\ \Phi_{3, 011}(x) &= x^3 + x \left(\Phi_{3, 101}(x) + \Phi_{3, 001}(x) \right)\\ \Phi_{3, 100}(x) &= x^3 + x \left(\Phi_{3, 110}(x) + \Phi_{3, 010}(x) \right)\\ \Phi_{3, 101}(x) &= x^3 + x \left(\Phi_{3, 110}(x) + \Phi_{3, 010}(x) \right)\\ \Phi_{3, 110}(x) &= x^3 + x \left(\Phi_{3, 111}(x) + \Phi_{3, 011}(x) \right)\\ \Phi_{3, 111}(x) &= x \left(\Phi_{3, 111}(x) + \Phi_{3, 011}(x) \right)\\ \end{align*}Solving, we find that:
\begin{align*} \Phi_{3, 000}(x) &= \Phi_{3, 001}(x) = \Phi_{3, 010}(x) = \Phi_{3, 011}(x) = -\frac{x^5 + x^4 + x^3}{x^3 + x^2 + x - 1}\\ \Phi_{3, 100}(x) &= \Phi_{3, 101}(x) = \Phi_{3, 110}(x) = -\frac{x^4 + x^3}{x^3 + x^2 + x - 1}\\ \Phi_{3, 111}(x) &= 0 \end{align*}And thus:
\begin{align*} f_{\{111\}}(x) &= 1 + 2x + 4x^2 + \frac{4x^5 + 6x^4 + 7x^3}{1 - x^3 - x^2 - x}\\ &= \frac{x^2 + x + 1}{1 - x^3 - x^2 - x}\\ \end{align*}AKA the (shifted) Tribonacci numbers.
The Goulden-Jackson Cluster Method
For large alphabets and \( R \) the method above will result in a lot of computational effort; our system alone will be of size \( \left|V\right|^R \). We'll introduce the Goulden-Jackson Cluster method as a means of reducing our work.
In this section we'll add a couple of restrictions on our set of bad words \( B \). Firstly, no bad word should appear as a substring of any other bad word - the bigger bad word can be removed from \( B \) for the same end result. Secondly, all \( b \in B \) should be of length at least two. If this is not true, we can equivalently remove \( v \in B \) from our alphabet \( V \).
Clusters
Given \( w \) and a set of words \( B \), we define a marked word as a pair \( (w, \{ (b_1, i_1), (b_2, i_2) \dots (b_l, i_l) : w_{i_k} \dots w_{i_k + length(b_k) - 1} = b_k \in B \}) \). For example, for \( B = \{HE, EL, LO \} \), the following is a marked word:
\[ (HELLO, \{ (HE, 1), (EL, 2), (LO, 4) \}) \]
And we define a cluster as a nonempty marked word for which every letter in \( w \) belongs to at least one bad word, and neighbouring bad words appearing in \( w \) always overlap:
\[ (HEL, \{ (HE, 1), (EL, 2) \}) \]
Note, every subword of \( B \) in \( w \) needn't be included in the marked word, for example:
\[ (HELLO, \{ (HE, 1) \}) \]
Is a completely valid marked word.
Given \( B \), we'll define \( C_B(w) \) as the set of all clusters on \( w \) (exercise: find \( w, B \) such that this set has size greater than one), \( M_B \) as the set of all marked words, and \( C_B \) as the set of all clusters.
The concatenation of two marked words \( m_1, \ m_2 \) will be denoted \( m_1m_2 \), and defined how you would expect.
A Formula
First of all, we'll give an equivalent definition of \( f_B(x) \):
\[ f_B(x) = \sum_{w \in L_B} x^{length(w)} \]
Where \( L_B \) is the set of all words in \( V^* \), not containing any word in \( B \) as a substring. We'll focus on calculating \( f_B(x) \) from here on, but other substitutions on \( \Phi_R \) act similarly (and are in examples).
Further define the auxiliary generating functions:
\begin{align*} F_B(x, t) &= \sum_{(w, S) = m \in M_B} x^{length(w)} t^{\left|S\right|}\\ C_B(x, t) &= \sum_{(w, S) = c \in C_B} x^{length(w)} t^{\left|S\right|} \end{align*}And define \( Q(m = (w, S)) = x^{length(w)}t^{\left|S\right|} \) for brevity (it should be clear that \( Q(m_1m_2) = Q(m_1)Q(m_2) \)). Next, we see that every marked word \( m = (w, S) \) either ends in a character not present in any bad word in \( S \), or otherwise the last character is part of the last bad word in \( S \) (which itself must be part of a cluster):
\[ M_B = \{ e \} \cup \{ mc : m \in M_B, \ c \in C_B \} \cup \{ mv : m \in M_B, \ v \in V \}\\ \]
\begin{align*} \Rightarrow F_B(x, t) &= 1 + \sum_{m \in M_B} \sum_{c \in C_B} Q(mc) + \sum_{m \in M_B} \sum_{v \in V} Q(mv)\\ &= 1 + \sum_{m \in M_B} \sum_{c \in C_B} Q(m)Q(c) + \sum_{m \in M_B} \sum_{v \in V} Q(m)Q(v)\\ &= 1 + F_B(x, t)C_B(x, t) + \left|V\right|x \left(F_B(x, t)\right)\\ &= \frac{1}{1 - \left|V\right|x - C_B(x, t)} \end{align*}Where \( e \) is the (unique) empty marked word; note also the union is disjoint. We also wave hands a bit for \( v \in V \), these always correspond to exactly one marked word given all elements of \( B \) have length greater than one.
Thus, calculating \( F_B(x, b) \) reduces to calculating \( C_B(x, t) \). We can group clusters according to their last bad word \( b \). For some cluster \( c = (w, S) \), the cluster must then either consist solely of \( b \) (which implies \( w = b \)), else we can remove \( b \) along with some suffix of \( w \) to produce a smaller cluster.
For each \( b \in B \) let \( C_B[b] \) denote the set of clusters ending in \( b \), with \( C_B[b](x, t) \) defined similarly. Then \( C_B[b](x, t) \) form a SLE, for example for \( B = \{HELE, ELEM\} \), we have:
\begin{align*} C_B[ELEM](x, t) &= C_B[HELE](x, t)xt + C_B[HELE](x, t)x^3t + x^4t\\ C_B[HELE](x, t) &= x^4t \end{align*}Which results in:
\begin{align*} C_B(x, t) &= x^4t + x^4t(xt + x^3t + 1)\\ F_B(x, t) &= \frac{1}{(1 - 26x) - (x^4t + x^4t(xt + x^3t + 1))} \end{align*}Now, recovering \( f_B(x) \) from \( F_B(x, t) \) is equivalent to substituting \( t = -1 \) (exercise!), resulting in:
\[ f_B(x) = \frac{1}{1 - x^7 - x^5 - 2x^4 - 26x} \]
Sample Sage implementation:
import string def goulden_jackson(bad_words, alphabet=string.ascii_uppercase): s, gfvs = var("s"), {w: var(f"G_{w}") for w in bad_words} eqns = [] for end_word in bad_words: eq = -s^(len(end_word)) for i in range(1, len(end_word) + 1): sub = end_word[:i] for source_word in bad_words: if source_word.endswith(sub): eq += -s^(len(end_word) - len(sub))*gfvs[source_word] eqns.append(eq == 0) soln = solve(eqns, *gfvs.values()) CB = sum(eq.right() for eq in (soln[0] if len(bad_words) > 1 else soln)) G = 1 / (1 - len(alphabet)*s - CB) return G.numerator() / G.denominator()
Our overlap checking is not optimised, we could do better with a suffix tree when bad_words
is large. We could also further exploit symmetry, e.g. we must always have \( C_B[abb](x, t) = C_B[cbb](x, t) \).
Examples
PGF for the First Occurrence of a Binary String
For some binary string \( w = w_1 \dots w_l \), let \( G(x) = \sum_{n = 1} p_n x^n \) where \( p_n \) is defined as the probability that the first occurrence of the string \( w \) in a random infinite binary string starts at \( n \).
Then the number of binary strings of length \( n \) where the first occurrence of \( w \) occurs at the last \( l \) characters is given by the number of binary strings of length \( n \) which do contain \( w \) as a substring, minus the the number of binary strings of length \( n - 1 \) which contain \( w \) as a substring:
\begin{align*} (2^n - \left[x^n\right]f_{\{w\}}(x)) - (2^{n - 1} - \left[x^{n - 1}\right]f_{\{w\}}(x)) \end{align*}Adjusting by \( l \) since we want the character where \( w \) starts:
\begin{align*} p_n = \frac{(2^{n + l} - \left[x^{n + l}\right]f_{\{w\}}(x)) - (2^{n + l - 1} - \left[x^{n + l - 1}\right]f_{\{w\}}(x))}{2^{n + l - 1}} \end{align*}Summing and multiplying through by \( x \) to account for moving from \( 0 \) to \( 1 \) indexing:
\begin{align*} G(x) &= 2x(1 - x)\frac{\frac{1}{1 - 2x} - f_{\{w\}}(x)}{2^lx^l}\bigg\rvert_{x=\frac{x}{2}}\\ &= x\left(1 - \frac{x}{2}\right)\frac{\frac{1}{1 - x} - f_{\{w\}}(\frac{x}{2})}{x^l} \end{align*}A Weighted Penney's Game
Consider a game between two players which consists of a set of rounds consisting of tosses of an unfair coin (say heads = \( p \)). Player 1 wins the round if the result is heads, and player 2 similarly for tails. A player wins the game if they reach \( k \) consecutive round wins. What is the probability player 1 wins? Markov chains may be a bit cleaner for this example, but we'll show how GJ can be applied anyway!
We will first calculate the number of strings of length \( n \) containing a given string \( w_1 \), and not containing a second string \( w_2 \) (denote this criteria \( C_1 \)). This corresponds to a sequence of tosses containing \( k \) \( H \)s in a row, but not \( k \) \( T \)s. Then we see how this allows us to calculate number of strings of length \( n \) containing a given string \( w_1 \) only as a suffix, and not containing a second string \( w_2 \) (\( C_2 \)), which will give us our result.
Let \( S_n \) consist of the set of strings satisfying \( C_1 \). Then if \( V_n \) is the set of strings of length \( n \) not containing \( w_2 \), and \( U_n \) is the set of strings of length \( n \) not containing \( w_1 \) or \( w_2 \), what is \( W_n = V_n - U_n \)?
We'll show by the classic subset argument \( W_n = S_n \). Suppose \( w \) is a target string, e.g. it contains \( w_1 \) and not \( w_2 \). Then \( w \in V_n \) since it doesn't contain \( w_2 \), and also \( w \not \in U_n \) since it contains \( w_1 \), which implies \( S_n \subseteq W_n \). Similarly, if \( w \in V_n\) then \( w \) does not contain \( w_2 \), and \( w \not \in U_n \) implies \( w \) must contain \( w_1 \) since we now it doesn't contain \( w_2 \); thus \( W_n \subseteq S_n \).
Since \( U_n \subseteq V_n \) we must have:
\[ \left|S_n\right| = \left|V_n\right| - \left|U_n\right| \]
For brevity let \( x_n = \left|X_n\right| \). Since \( v_n, u_n \) are just substring exclusion problems, we can use our methods to calculate \( s_n \). But now how do we calculate the number of substrings of length \( n \) which don't contain \( w_2 \), and contain \( w_1 \) only as a suffix? Like in the previous example, we may be tempted to say "subtract \( 2s_{n - 1} \) from \( s_n \) to account for adding \( H \) or \( T \) to any \( w \in S_{n - 1} \)", but this is not correct since appending \( T \) to \( w \in S_{n - 1} \) may result in a string ending in \( k\ T\)s.
Let \( S(x, y) = \sum_{n, m} s_{n, m} x^ny^m \) where \( s_{n, m} \) is the number of strings with \( n \) \( H \)s and \( m \) \( T \)s satisfying \( C_1 \) (we know this one). Further let \( T(x, y) = \sum_{n, m} t_{n, m} x^ny^m \) where \( t_{n,m} \) is as like \( s_{n, m} \), but with added condition that the string ends in \( T \); and let Further let \( K(x, y) = \sum_{n, m} k_{n, m} x^ny^m \) where \( k_{n,m} \) is as like \( s_{n, m} \), but with added condition that the string ends in \( k - 1 \) lots of \( T \)s. Then the following holds:
\begin{align*} T(x, y) &= yS(x, y) - yK(x, y)\\ K(x, y) &= y^4S(x, y) - y^4T(x, y)\\ P(x, y) &= (1 - x - y)S(x, y) + K(x, y) \end{align*}Where \( P(x, y) \) is the generating function we want (e.g. satisfying \( C_2 \)), which results in the following:
\[ P(x, y) = \left(1 - x - y + y \frac{y^{k - 1} - y^k}{1 - y^k}\right)S(x, y) \]
Now the probability of player 1 winning is just \( P(p, (1 - p)) \).
Questions
- Is the following equivalent to our definition of a cluster: "Define \( (w, S) \) as a cluster if it cannot be decomposed as the concatenation (defined how you would expect) of two nonempty marked words"?
- How may we find the generating function of the number of words of length \( n \) in which every letter must be contained in a bad word?