Models of Computation(4) - Context-free Grammars (CFG)

12 minute read

Limitations of Regular Expressions

Regular expressions are useful for basic pattern matching. However, they cannot recognize the set of all Python programs nor all propositional formulas!

Grammars in a nutshell

  • A grammar is a set of rules which can be used to derive strings, and therefore languages. The rules are a recursive description of the strings.
  • Grammars naturally describe the hierarchical structure of most programming languages.
  • Grammars also form the basis for translating between different representations of programs.

Context-free grammars

Context-free grammars(CFG) is a mechanism for describing languages.

Program Syntax

statements: statement+
statement: compound_stmt  \mid  simple_stmt

Our Syntax

$S \rightarrow TS$

$T \rightarrow c \mid d$

Let us look at an example.

$S \rightarrow 0S1$

$S \rightarrow T$

$T \rightarrow a$

To derive a string:

  1. Write down the start variable. It is the variable on the LHS of the top rule, unless specified otherwise.
  2. Find a variable that is written down and a rule that starts with that variable. Replace the written down variable with the RHS of that rule.
  3. Repeat step 2 until no variables remain.

NOTE: $S, T$ are variables and $0, 1, a$ are terminals.

\[\begin{align} S &\Rightarrow 0S1 \\ &\Rightarrow 0T1 \\ &\Rightarrow 0a1 \end{align}\] \[\begin{align} S &\Rightarrow 0S1 \\ &\Rightarrow 00S11 \\ &\Rightarrow 000S111 \\ &\Rightarrow 000T111 \\ &\Rightarrow 000a111 \end{align}\]

These can be examples of string derivation.

Anatomy of a grammar

  • Variables (or Non-terminals)
    • These get rewritten to generate strings
  • Start variable
  • Alphabet
    • Consisting of terminal symbols making up strings in the language being generated
  • Rules of the form $A \rightarrow u$ where $A$ is a variable and $u$ is a string of variables and alphabet symbols.

A variable can have many rules:

\[S \rightarrow 0S1 \\ s \rightarrow T\]

which can be written together with the ‘or’ operator,

$ S \rightarrow 0S1 \mid T$

Example.

  • Variable $S, T$
  • Alphabet symbols $0, 1, a$
  • Start variable $S$
  • Rules: \(S \rightarrow 0S1 \mid T \\ T \rightarrow a\)

$S \Rightarrow 0S1 \Rightarrow 00S11 \Rightarrow 00T11 \Rightarrow 00a11$
We can derive the string $00a11$.

Another Example.

  • Variables $E$
  • Alphabet symbols $a, b, c, -$
  • Start variable $E$
  • Rules: $ E \rightarrow E - E \mid a \mid b \mid c$

We can derive the following strings.

\[E \Rightarrow a \Rightarrow b \Rightarrow c\] \[\begin{aligned} E &\Rightarrow E - E \\ &\Rightarrow a - E \\ &\Rightarrow a - a \end{aligned}\] \[\begin{aligned} E &\Rightarrow E - E \\ &\Rightarrow E - E - E \\ &\Rightarrow E - E - a \\ &\Rightarrow b - E - a \\ &\Rightarrow b - c - a \end{aligned}\]

Notation

  • $A, B, C, \ldots \text{ and } S$ are variables.
  • $S$ is the start variable.
  • $a, b, c, \ldots$ are terminals.
  • $u, v, w, \ldots$ are strings of terminals and variables. This can include the empty string $\epsilon$.
  • Often we just write the rules.
  • The start variable is usually $S$, or the first one listed.

Formal Definition: Syntax

Definition A context-free grammar (CFG) $G$ consists of four itmes $(V, \Sigma, R, S)$ where:

  • $V$ is a finite set of variables.
  • $\sigma$ is a finite set, disjoint from $V$, of terminals.
  • $R$ is a finite set of rules in the form $A \rightarrow v$ where $A \in V$ and $v$ is a string (possibly empty) over the alphabet $V \cup \Sigma$.
  • $S \in V$ is a special variable called the start variable.

Formal Definition: Semantics

Definition A derivation is a sequence of steps where variables are replaced by the right-hand side of a rule.

  • One step is written $\Rightarrow$
    • This is the “Rightarrow” symbol, and in this context is read “yields”
    • eg. $0S1 \Rightarrow 00S11$ is read “$0S1$ yields $00S11$”.
  • Zero or more steps is written $\Rightarrow^{*}$.
    • This is read “derives”
    • eg. “$S$ derives $000S111$”.
  • The language generated by $G$ is the set $L(G)$ of strings over the terminal alphabet $\Sigma$ that are derived from the start variable $S$.
\[L(G) = \{u \in \Sigma^{*} : S \Rightarrow^{*} u\}\]

More Examples

Which natural language description applies to the language generated by the following grammar?

\[S \rightarrow 0S1 \mid T \\ T \rightarrow a\]

=> All strings over alphabet ${0, 1, a}$ of the form $0^na1^n \text{ where } n \geq 0$. Note that it is not $0^{}a1^{}$ as this could generate $0a11$ which is not desired.

Write a natural language description of the language of the following grammar: $E \rightarrow E + E \mid 0 \mid 1$

=> All binary strings over the alphabet ${0, 1}$

Some thoughts on designing CFGs

  1. Think recursively!
    • How can a string in the language be built from smaller strings in the language?
    • Make sure you cover all cases.
  2. Variables generate substrings with similar properties.
    • Think of the variables as storing information, or as having meaning.

Designing CFGs

Q. Design a grammar that generates the language of binary strings that are palindromes.

Think recursively

  1. Base case: $0, 1, \epsilon$ are palindromes.
  2. Recursive case: if $u$ is a palindrome, then $0u0 \text{ and } 1u1$ are palindromes.
    • Why are there no other cases?

NOTE: $\epsilon \notin \Sigma$ - it is the empty string. $\Sigma$ is the alphabet.

So, here is a grammar:

\[S \rightarrow 0 \mid 1 \mid \epsilon \\ S \rightarrow 0S0 \mid 1S1\]

So this is a grammar $G$ and the language of $G$ is, \(L(G) = \{u \in \{0, 1\}^{*} : u \text{ is a palindrome}\}\)

Q. Design a grammar that generates the language of binary strings with the same number of 0’s and 1’s.

  1. Base case: $\epsilon$ has the same number of 0’s and 1’s. i.e. none.
  2. Recursive case: if $u, v$ have the same number of 0’s and 1’s, then so do $0u1v \text{ and } 1u0v$.
    • Why are there no other cases?

Here is a grammar:

\[S \rightarrow \epsilon \\ S \rightarrow 0S1S \mid 1S0S\]

Q. Design a grammar that generates the language of binary strings of the form $0^n1^m0^n \text{ for } n, m \geq 0$.

Variables generate substrings with similar properties.

Let $V = {S, X}$. We already know how to generate the outermost 0’s on both ends from previous examples. But we need the extra variable $X$ for generating the substring of 1’s. i.e. $X$ generates the language $L(1^{*})$.

So, here is a grammar:

\[S \rightarrow 0S0 \mid X \\ X \rightarrow 1X \mid \epsilon\]

NOTE: The variable $X$ generates the language $L(1^{*})$.

i.e. \(X \Rightarrow 1X \\ \Rightarrow 11X \\ \Rightarrow 111X \\ \Rightarrow 111\epsilon = 111\)

Language of a CFG

Consider giving a grammar for the set of strings over alphabet symbols ‘(‘ and ‘)’ in which the parentheses are well-balanced.

So, $\Sigma = {(, )}$

This is probably the single most important example of a CFG because arbitrary expressions, programming languages usually require balanced parantheses.

Context-Free Languages

Definition A language is context-free if it is generated by a CFG.

Easy Facts

  • The union of two CFL is also context-free. Why? Say a grammar $G_{1}$ and $G_{2}$ each generate some language. Then, we can find a grammar $G$ the generates another language, i.e. the union, $L(G_{1}) \cup L(G_{2}) = L(G)$.

How? Just add a new rule $S \rightarrow S_{1} \mid S_{2} \text{ where } S_{i}$ is the start symbol of grammar $i$.

  • The concatenation of two CFL is also context-free.

Why? Just add a new rule $S \rightarrow S_{1}S_{2}$.

  • The star(closure) of a CFL is also context-free.

Why? Just add a new rule $S \rightarrow SS_{1} \mid \epsilon$.

Examples

  • Show that $L(a^{*} \cup b^{*})$ is context-free.

Note that this is the language $L(a^{*}) \cup L(b^{*})$.

Also, note that $L(a^{*}) = L(a)^{*}$.

A grammar for $L(a)$ is simply,

\[S \rightarrow a\]

for the star, we use the fact from above,

\[S \rightarrow SS_{1} \mid \epsilon \\ S_{1} \rightarrow a\]

which simplifies to,

\[S \rightarrow Sa \mid \epsilon\]

So a grammar for $L(a^{*}) \text{ and } L(b^{*})$:

\[S_{1} \rightarrow S_{1}a \mid \epsilon \\ S_{2} \rightarrow S_{2}b \mid \epsilon\]

So here is a grammar for $L(a^{*}) \cup L(b^{*})$:

\[S \rightarrow S_{1} \mid S_{2} \\ S_{1} \rightarrow S_{1}a \mid \epsilon \\ S_{2} \rightarrow S_{2}b \mid \epsilon\]
  • Show that $L(a^{*}b^{*})$ is context-free.

Note that this is the language $L(a^{*})L(b^{*})$.

Similar to above, here is a grammar for $L(a^{*})L(b^{*})$.

\[S \rightarrow S_{1}S_{2} \\ S_{1} \rightarrow S_{1}a \mid \epsilon \\ S_{2} \rightarrow S_{2}b \mid \epsilon\]
  • Show that $L((aa \mid bb)^{*})$ is context-free.

Note that this is the language $L(aa \mid bb)^{*}$.

Here is a grammar for $L(aa \mid bb)$:

$S_{1} \rightarrow aa \mid bb$

So here is a grammar for $L(aa \mid bb)^{*}$:

\[S \rightarrow S_{1}S \mid \epsilon \\ S_{1} \rightarrow aa \mid bb\]

NOTE that CFGs are not allowed to mention the star operation(*) nor strictly speaking, the operation | (this is just a shorthand for us). i.e. \(S \rightarrow SS_{1} \mid \epsilon\) is just a short way of writing \(S \rightarrow SS_{1} \\ S \rightarrow \epsilon\)

Parsing

The problem of parsing is determining how the grammar derives a given string.

A parse tree (or derivation tree) is a tree labelled by variables and terminal symbols of the CFG.

  • The root is labeled by the start variable.
  • Each interior node is labeled by a variable.
  • Each leaf node is labeled by a terminal or $\epsilon$.
  • The children of a node labeled $X$ are labeled by the RHS of a rule $X \rightarrow x$, in order.

For example, the parse tree for $0011 \text{ in } S \rightarrow 0S1 \mid 01$ is as follows:

–ADD IMAGE–

A traversal of the leaf nodes retrieves the string.

The parse tree gives the “meaning” of a string.

–ADD IMAGE–

\[S \rightarrow S - S S \rightarrow x \mid y \mid z\]

This parse tree says that the expression means $x - (y - z) \text{ not } ((x - y) - z)$. The string doesn’t tell you the meaning, but it is captured in the parse tree.

In this case, $V = {S}$ and $\Sigma = {x, y, z -}$ The parse tree gets split up into three subtrees, and if we look at the string of each leaf nodes of the subtrees, we can get $x - (y - z)$. Note that ‘-‘ is considered just a symbol normally in CFG but if it was used in context of arithemetics, this would have the ‘minus’ meaning where this meaning is captured in the parse tree.

Natural Language Processing (NLP)

\[\begin{align} & \langle Sentence \rangle \rightarrow \langle NounPhrase \rangle\;\langle VerbPhrase \rangle \\ & \langle NounPhrase \rangle \rightarrow \langle ComplexNoun \rangle \\ & \langle NounPhrase \rangle \rightarrow \langle ComplexNoun \rangle\;\langle PrepPhrase \rangle \\ & \langle VerbPhrase \rangle \rightarrow \langle ComplexVerb \rangle \mid \langle ComplexVerb \rangle\;\langle PrepPhrase \rangle \\ & \langle PrepPhrase \rangle \rightarrow \langle Prep \rangle\;\langle ComplexNoun \rangle \\ & \langle ComplexNoun \rangle \rightarrow \langle Article \rangle\;\langle Noun \rangle \\ & \langle ComplexVerb \rangle \rightarrow \langle Verb \rangle \mid \langle Verb \rangle\;\langle NounPhrase \rangle \\ & \langle Article \rangle \rightarrow \text{a} \mid \text{the} \\ & \langle Noun \rangle \rightarrow \text{boy} \mid \text{dog} \mid \text{ball} \mid \text{stick} \\ & \langle Verb \rangle \rightarrow \text{walks} \mid \text{runs} \\ & \langle Prep \rangle \rightarrow \text{with} \end{align}\]
  • Terminals are the lower-case English alphabet.
  • For variables, we used $\langle Noun \rangle$, etc, for readability.

Now comes the ambiguity though… The string “the boy walks the dog with the stick” can be derived in this grammar. But it has (at least) two parse-trees depending on who has the stick.

First parse tree:

–ADD IMAGE–

In this case, the dog has the stick.

Second parse tree:

–ADD IMAGE–

In this case, the boy has the stick.

Ambiguous grammars and strings

Definition A string is ambiguous on a given grammar if it has at least two different parse trees. A grammar is ambiguous if it derives an ambiguous string.

So, we can say that the previous grammar is ambiguous.

Is there a way to see if a string is ambiguous without drawing parse trees?

  • A derivation is called leftmost if it always derives the leftmost variable first.
  • Each parse tree corresponds to one leftmost derivation.
  • So, a string is ambiguous if it has at least two leftmost derivations.
  • The same two statements hold with rightmost instead of leftmost.

So, “The boy walks the dog with a stick” has two leftmost derivations.

\[\begin{align} \langle Sentence \rangle &\Rightarrow \text{the boy } \langle VerbPhrase \rangle \\ &\Rightarrow \text{the boy } \langle Verb \rangle\;\langle NounPhrase \rangle \\ &\Rightarrow \text{the boy chases the dog with a stick} \end{align}\] \[\begin{align} \langle Sentence \rangle &\Rightarrow \text{the boy } \langle VerbPhrase \rangle \\ &\Rightarrow \text{the boy } \langle ComplexVerb \rangle\;\langle PrepPhrase \rangle \\ &\Rightarrow \text{the boy chases the dog with a stick} \end{align}\]

Consider the following example. Is the grammar ambiguous?

\[\begin{align} E &\rightarrow E - E \\ E &\rightarrow a \mid b \mid c \end{align}\]

Rightmost derivations of $a - b - c$:

\[\begin{align} E &\Rightarrow E - E \\ &\Rightarrow E - c \\ &\Rightarrow E - E - c \\ &\Rightarrow E - b - c \\ &\Rightarrow a - b - c \end{align}\] \[\begin{align} E &\Rightarrow E - E \\ &\Rightarrow E - E - E \\ &\Rightarrow E - E - c \\ &\Rightarrow E - b - c \\ &\Rightarrow a - b - c \end{align}\]

We have found two rightmost derivations, so the string $a - b - c$ is ambiguous, and thus grammar is also ambiguous.

The respective parse trees are:

–ADD IMAGE–

–ADD IMAGE–

One derives $(a - b) - c$ and the other derives $a - (b - c)$.

NOTE: Try deriving the string $a - b$. There is only one rightmost derivation, hence it is a non-ambiguous string.

Now, suppose we want $a - b - c$ to always mean $(a - b) - c$. Introduce a new non-terminal symbol $T$:

\[\begin{align} E &\rightarrow E - T \mid T \\ T &\rightarrow a \mid b \mid c \end{align}\]

Now, the only rightmost derivation of $a - b - c$ is:

\[\begin{align} E &\Rightarrow E - T \\ &\Rightarrow E - c \\ &\Rightarrow E - T - c \\ &\Rightarrow E - b - c \\ &\Rightarrow T - b - c \\ &\Rightarrow a - b - c \end{align}\]

–ADD IMAGE–

Why are they called “context-free”?

The Chomsky Hierarchy consists of 4 classes of grammars, depending on the type of production rules that they allow.

Type 0 (recursively enumerable) $z \rightarrow v$

Type 1 (context-sensitive) $uAv \rightarrow uzv$

Type 2 (context-free) $A \rightarrow z$

Type 3 (regular) $A \rightarrow aB \text{ and } A \rightarrow a$

given that $u, v, z$ are string of variables and terminals and $z$ is not empty.

–ADD IMAGE–

Leave a comment