Counting Strings and the Goulden-Jackson Cluster Method
A lot of this post was inspired by the papers of John Noonan & Doron Zeilberger and Zhuang (you can also checkoout the original paper, but it's a bit harder to follow).
Intro
Suppose we want to find the number of strings of length
The most basic case is excluding a string of a single character, in which case there are
A Derivation
Let's first define the weight
Note some factors may appear more than once, for example:
Now, we define the generating function over
Our strategy will be to perform substitutions on
Will give us the generating function
Computing
Let's define:
Now, all words in
Then
Each equation says in essence, if a word
Note, making the substitution (1) prior to solving the system simplifies computing
Example
Consider the binary string of length
Solving, we find that:
And thus:
AKA the (shifted) Tribonacci numbers.
The Goulden-Jackson Cluster Method
For large alphabets and
In this section we'll add a couple of restrictions on our set of bad words
Clusters
Given
And we define a cluster as a nonempty marked word for which every letter in
Note, every subword of
Is a completely valid marked word.
Given
The concatenation of two marked words
A Formula
First of all, we'll give an equivalent definition of
Where
Further define the auxiliary generating functions:
And define
Where
Thus, calculating
For each
Which results in:
Now, recovering
Sample Sage implementation:
import string def goulden_jackson(bad_words, alphabet=string.ascii_uppercase): s, gfvs = var("s"), {w: var(f"G_{w}") for w in bad_words} eqns = [] for end_word in bad_words: eq = -s^(len(end_word)) for i in range(1, len(end_word) + 1): sub = end_word[:i] for source_word in bad_words: if source_word.endswith(sub): eq += -s^(len(end_word) - len(sub))*gfvs[source_word] eqns.append(eq == 0) soln = solve(eqns, *gfvs.values()) CB = sum(eq.right() for eq in (soln[0] if len(bad_words) > 1 else soln)) G = 1 / (1 - len(alphabet)*s - CB) return G.numerator() / G.denominator()
Our overlap checking is not optimised, we could do better with a suffix tree when bad_words
is large. We could also further exploit symmetry, e.g. we must always have
Examples
PGF for the First Occurrence of a Binary String
For some binary string
Then the number of binary strings of length
Adjusting by
Summing and multiplying through by
A Weighted Penney's Game
Consider a game between two players which consists of a set of rounds consisting of tosses of an unfair coin (say heads =
We will first calculate the number of strings of length
Let
We'll show by the classic subset argument
Since
For brevity let
Let
Where
Now the probability of player 1 winning is just
Questions
- Is the following equivalent to our definition of a cluster: "Define
as a cluster if it cannot be decomposed as the concatenation (defined how you would expect) of two nonempty marked words"? - How may we find the generating function of the number of words of length
in which every letter must be contained in a bad word?