I always thought that the Chomsky-Schützenberger Theorem is an easy consequence of the fact that for every CFG G, the strings generated by G are precisely the yields of the parse trees of G. Until recently, I did not have a good understanding of just how easy a consequence it is.
The theorem is usually stated as follows:
Theorem (Chomsky-Schützenberger). For every context-free language L, there exist a positive integer n, a homomorphism h, and a regular set R such that
L = h(Dn ∩ R),
where Dn is the Dyck language over the set Δn of n pairs of parentheses.
In fact, the theorem can be strengthened to require that the set R be local in the sense that there are sets J1, J2 ⊆ Δn and I ⊆ Δn2 such that
R = (J1 Δn* ∩ Δn* J2) − Δn* (Δn2 − I) Δn*.
This strengthening is often not mentioned in textbooks; it is explicitly stated in Chomsky's (1963) "Formal properties of grammars". Another fact mentioned by Chomsky (1963) is that Dn may be replaced with the set D′n of Dyck primes, which are elements of Dn whose first and last symbols form a matching pair of parentheses. (Also, the homomorphism h is in fact alphabetic in the sense that it maps each symbol in Δn either to a symbol or to the empty string, but this is always clear from the proof.)
I had thought that the set Dn ∩ R in the statement of the theorem can just be taken to be an alternative representation of the set PT(G) of parse trees of any CFG G for L. This is essentially correct, but not literally true. Chomsky expressed the qualification as restrictions on the rules of G (see below).
Proofs of the theorem given in some textbooks (e.g., Salomaa 1972 and Berstel 1979) fail to make any connection between Dn ∩ R and PT(G), thus bypassing a main concern for Chomsky. The proof in Kozen's (1997) Automata and Computability—perhaps the clearest textbook treatment of the theorem—makes a clear connection between the two sets without explicitly mentioning it. (Another easy line of approach to the theorem, via pushdown automata (e.g., Harrison 1978) only makes an indirect connection between the two sets, through PDA computations.)
Here's how I came to view the connection. Recall that a set L of trees over some finite alphabet Γ of labels is local if there are a set J ⊆ Γ and a finite set I ⊆ Γ+ such that any tree T is in L if and only if it satisfies the following conditions:
- the label of the root is in J, and
- for every node, the string obtained by concatenating its label with the labels of its children is in I.
As is well-known, the set
PT(
G) is local for any CFG
G.
Let τ be the obvious translation that maps a tree to its labeled bracketing representation—that is to say, if T is a tree with root label σ and immediate subtrees T1, …, Tn, then τ(T) = [σ τ(T1) … τ(Tn) ]σ. If T is a tree over Γ with |Γ| = n, then τ(T) is an element of D′n, where each pair of parentheses in Δn corresponds to a symbol in Γ. Let us call a set L of trees over Γ super-local if τ(L) = D′n ∩ R for some local set R ⊆ Δn*. It is clear that a set L of trees is super-local if there are sets J1, J2 ⊆ Γ, and I1, I2, I3 ⊆ Γ2 such that for every tree T, T is in L if and only if it satisfies the following conditions:
- the label of the root is in J1,
- the label of every leaf is in J2,
- if a node labeled by Y is the first child of a node labeled by X, then XY ∈ I1,
- if a node labeled by Y is the last child of a node labeled by X, then YX ∈ I2, and
- if a node labeled by X and a node labeled by Y are adjacent siblings (in this order), then XY ∈ I3.
Clearly, any super-local set of trees of bounded out-degree must be local, but it is easy to see that there are local sets, including sets of the form PT(G), that are not super-local. The set of "rule trees" of a CFG, where a rule, rather than a nonterminal, labels each node, need not be super-local either.
The strengthened form of the Chomsky-Schützenberger Theorem is an immediate consequence of the following easy lemma:
Lemma. If L is a local set of trees over Γ, then there exist a super-local set L′ over Γ′ and a relabeling g: Γ′ → Γ such that g maps L′ onto L.
The idea is to add to the label of each node information about the labels of all its siblings. This construction can be seen to underly Kozen's (1997) proof of the Chomsky-Schützenberger Theorem. While a "rule tree" of a CFG labels each node by the rule expanding it, the tree used by Kozen puts information about the rule on the children of the node expanded by the rule. It is in fact sufficient for each child to be labeled by the right-hand side of the rule expanding its parent (plus its position among its siblings).
Proof. Let J ⊆ Γ, I ⊆ Γ+ be the finite sets witnessing the locality of L. Let
Γ′ = { [X]1 | X ∈ J } ∪ { [α]i | X ∈ Γ, Xα ∈ I, 1 ≤ i ≤ |α| },
and set
J1 = { [X]1 | X ∈ J },
J2 = { [X1 … Xl]i | Xi ∈ I },
I1 = { [X1 … Xl]i [Y1 … Ym]1 | Xi Y1 … Ym ∈ I, m ≥ 1 },
I2 = { [Y1 … Ym]m [X1 … Xl]i | Xi Y1 … Ym ∈ I, m ≥ 1 },
I3 = { [X1 … Xl]i [X1 … Xl]i+1 | 1 ≤ i < l}.
Let L′ be the super-local set determined by J1, J2, I1, I2, I3, and let
g([X1 … Xl]i) = Xi.
Now it is clear that g gives a one-one correspondence between the trees in L′ and the trees in L.
Note that since the conditions given by I1, I2, I3 in the above proof are not independent, the definition of either I1 or I2 (but not both) may be changed to drop the part that says XiY1 … Ym ∈ I, without thereby affecting the resulting super-local set L′. If we wish to have τ(L′) = Dn ∩ R, rather than τ(L′) = D′n ∩ R, that's possible too by changing the label of the root of every tree to a special symbol that only appears at the root.
In the above proof, each node label in a tree from L is changed to the sequence of original labels of all its siblings, including itself, plus its position among its siblings. Clearly, there are more economical ways of adding information to each node label to create a super-local set. For instance, we can use labels of the form Xi … Xm consisting of the node's original label followed by the original labels of all its right siblings.
Let us call a tree language super-regular if it can be expressed in the form τ-1(D′m ∩ R) with R regular. Every super-regular tree language is regular, since it can be recognized by a "left-to-right" one-visit tree-walking automaton (cf. Okhotin, Salomaa, and Domaratzki 2002). The converse is of course not true, and the super-regular tree languages don't even contain all the local tree languages. Here is an example of a Chomsky normal form CFG G such that PT(G) is not super-regular. This shows that the set Dn ∩ R in the Chomsky-Schützenberger Theorem, even in its weak form, cannot in general be taken to be τ(PT(G)), unless the grammar G satisfies some special conditions.
S → S A
S → B S
S → a
A → a
B → a
That the set of parse trees of this grammar is not super-regular can be seen as follows. Consider strings of the form
(*) [S x1 … [S xk [S [a ]a ]S yk ]S … y1 ]S, with xi ∈ { [A [a ]a ]A, ε } and yi ∈ { [B [a ]a ]B, ε } (i = 1, … k).
Any string of the form (*) is in τ(PT(G)) if and only if for all i = 1, … k, either
xi = [A [a ]a ]A and yi = ε
or
xi = ε and yi = [B [a ]a ]B.
This means that the 2k strings of the form
(**) [S x1 … [S xk, with xi ∈ { [A [a ]a ]A, ε } (i = 1, … k)
are such that no two of them stand in the Myhill-Nerode right-congruence relation ≡τ(PT(G)) associated with τ(PT(G)).
Now let D′4 be the set of Dyck primes over { [S,]S,[A,]A,[B,]B,[a,]a }, and R be any regular set of strings. Clearly, all strings of the form (**) stand in the relation ≡D′4 to each other. Since R is regular, ≡R has only finitely many equivalence classes. So if we choose large enough k, there are distinct strings w1, w2 of the form (**) that stand in the relation ≡R. Since for any string languages L1 and L2, ≡L1 ∩ L2 is a coarser equivalence relation than ≡L1 ∩ ≡L2, the strings w1, w2 must stand in the relation ≡D′4 ∩ R. Therefore, we cannot have D′4 ∩ R = τ(PT(G)) for any regular set R. Since τ(PT(G)) = D′4 ∩ R is implied by τ(PT(G)) = D4 ∩ R, we cannot have the latter, either, for any regular R.
Update January 18, 2012: A similar proof appears in Libkin 2006.
When is PT(G) super-regular? Chomsky's (1963) proposed sufficient condition was that G be a modified normal grammar in the sense that G is in Chomsky normal form and contains no pair of rules
A → B C
D → C E
for any nonterminals A, B, C, D, E. However, Chomsky's proof that (in our words) PT(G) is super-regular for any modified normal grammar G does not work. Adding a further condition that says that if
A → B C
A → D E
F → B E
are rules of the grammar, then so is
A → B E
guarantees that PT(G) is super-local. I have so far been unable to tell whether there is any modified normal grammar G for which PT(G) is not super-regular.