Get this book -> Problems on Array: For Interviews and Competitive Programming

It is much easier to work with context-free grammars if the given CFG is its normal form. In this article we discuss the *Chomsky Normal Form*.

### Table of contents.

- Introduction.
- Chomsky Normal Form.
- Summary.
- References.

## Introduction.

Normally it is much easier to work with a context-free grammar if the given context-free grammar is in a normal form.

Normal forms include:

*Chomsky Normal Form*, whereby productions are in the form of*A â†’ BC*or*A â†’ a*, where*A*,*B*and*C*are variables and*a*is a terminal symbol.*Greibach Normal Form*, where productions are in the form of*A â†’ aÎ±*, where*Î± âˆˆ V**and*A âˆˆ V*.

## Chomsky Normal Form.

Rules in a context free grammar *G = (V,Î£, R, S)* are of the form:

*A â†’ w* where *A* is a variable and *w* is a string over the alphabet *V âˆª Î£*.

We will show that every context-free grammar *G* can be converted into a context-free grammar *Gâ€²* such that *L(G) =L(Gâ€²)* and rules of *Gâ€²* are of restricted form as defined below.

**Definition 1:** A context-free grammar *G = (V,Î£, R, S)* is said to be in *chomsky normal form* if every rule in *R* has one of the forms:

*A â†’ BC*where*A*,*B*and*C*are elements of*V*,*B â‰ S*, and*C â‰ S*.*A â†’ a*where*A*is an element of*V*and*a*is also an element of*Î£*.*S â†’ Ïµ*where*S*is the start variable.

Now, we should convince ourselves that for such a grammar *G*, *R* contains the rule number *3* iff *Ïµ âˆˆ L(G)*

**Theorem 1:** Let *Î£* be an alphabet and *L âŠ† Î£** be a context-free language. Then there exists a context-free grammar in chomsky normal for whose language is *L*.

**Proof:** *L* is a context-free language and there exists a grammar *G = (V,Î£, R, S)*, such that *L(G) = L*. Now we transform *G* to a grammar in chomsky normal form in five steps:

- We eliminate the start variable from the right side of the rules by defining
*G1= (V1,Î£, R1, S1),*where*S1*is the start variable,*V1 = V âˆª {S1}*and*R1 = R âˆª {S1 â†’ S}*.

This grammar will have the following property:

- The start variable
*S1*doesn't occur on the right side of any rule in*R1* - L(G1) =L(G)

- An
*Ïµ-rule*is a rule in the form of*A â†’ Ïµ*where*A*is a variable that is not equal to the start variable.

In this step we eliminate all*Ïµ-rules*from*G1*

Now we consider rules one by one. Let*A â†’ Ïµ*be an example of a rule where*A âˆˆ V1*and*A â‰ S1*. We change G1 as follows:

- First we remove the rule
*A â†’ â‰*from the current set*R1*. - Then, for each rule in the current set R1 in the form of:

(a).*B â†’ A*, add rule*B â†’ â‰*, unless the rule has been deleted from*R1*. In this way we replace the two-step derivation*B â‡’ A â‡’ â‰*by a one-step derivation*B â‡’ â‰*

(b).*B â†’ uAv*, where*u*and*v*are strings, both of which are not empty. We add the rule*B â†’uv to R1*now observe that we have replaced a two-step derivation*B â‡’ uAv â‡’ uv*with a single-step derivation*B â‡’ uv*.

(c).*B â†’ uAvAw*, where*u*,*v*and*w*are all strings. We add the rules*B â†’ uvw*,*B â†’ uAvw*, and*B â†’ uvAw*to*R1*. If*u = v = w = Ïµ*and the rule*B â†’ Ïµ*has been deleted from*R1*, we don't add the rule*B â†’ Ïµ*

(d). We shall treat all rules where*A*occurs multiple times on the right-hand side in a similar way.

This process is repeated until all *Ïµ-rules* are eliminated. Let R2 be the set of rules after all *Ïµ-rules* are eliminated. We define *G2 = (V2,Î£, R2, S2)* where *V2 = V1* and *S2 = S1*.

The properties of this grammar are:

- It's start variable
*S2*doesn't occur on the right-hand side of any rule in*R2*. *R2*doesn't contain any*Ïµ-rule*but may contain*S2 â†’ Ïµ*rule.*L(G2) = L(G1) = L(G)*

- A
*unit-rule*is in the form*A â†’ B*where*A*and*B*are variables. In this step we eliminate all unit rules from*G2*. To do this we consider all rules .

Let*A â†’ B*be an example of a rule where*A*and*B*are elements of*V2*. Knowing that*B â‰ S2*we change*G2*as follows:

- First we remove rule
*A â†’ B*from the current*R2*set. - For each rule in
*R2*set in the form of*B â†’ u*where*u âˆˆ(V2 âˆª Î£)**, we add the rule*A â†’ u*to the current set*R2*unless it is a unit-rule that has been eliminated. This way, we replace a two-step derivation*A â‡’ B â‡’ u*with a*one-step*derivation*A â‡’ u*.

The process is repeated until all unit-rules are eliminated.

Let*R3*be thes set of rules after all the rules have been eliminated. We therefore define*G3 = (V3,Î£, R3, S3)*, where*V3 = V2*and*S3 = S2*.

The above grammar has the following properties: - The start variable
*S3*doesn't occur on the right side of any rule in*R3*. *R3*doesn't contain any*Ïµ-rule*but may contain*S3 â†’ Ïµ*rule.*R3*doesn't contain any*unit-rule**L(G3) = L(G2) =L(G1) = L(G)*

- This step involves eliminating all rules with more than two symbols on the right side. For each rule in the set
*R3*in the form of*A â†’ u1u2, ..., uk,*, where k â‰¥ 3 and each*${\mathrm{u}}_{\mathrm{i}}$*is an element of*V3 âˆª Î£*we change*G3*as follows:

- First we remove the rule
*A â†’ u1 u2, ..., ${\mathrm{u}}_{\mathrm{i}}$*from the current set*R3*. - Next we add the following rules to set
*R3*.

*A â†’ u1A1
A1 â†’ u2A2
A2 â†’ u3A3
.
.
.
Akâˆ’3 â†’ ukâˆ’2Akâˆ’2
Akâˆ’2 â†’ ukâˆ’1uk*

Where *${\mathrm{A}}_{1}$, ${\mathrm{A}}_{2}$, ..., ${\mathrm{A}}_{\mathrm{k-2}}$* are new variables added to the current set *V3*.

In this way we replace the one-step derivation *A â‡’ u1 u2, ..., ${\mathrm{u}}_{\mathrm{i}}$* by the *(k-1)-step derivation*.

*$A\xe2\u2021\u2019{u}_{1}{A}_{1}\xe2\u2021\u2019{u}_{1}{u}_{2}{A}_{2}\xe2\u2021\u2019...\xe2\u2021\u2019{u}_{1}{u}_{2}...{u}_{k-2}{A}_{k-2}\xe2\u2021\u2019{u}_{1}{u}_{2}...{u}_{k}$*

Let *R4* and *V4* be the set of rules and variables respectively, when we eliminate all rules with more than two symbols on the right. We define *G4 = (V4,Î£, R4, S4)*, where *S4 = S3*.

The above grammar has the following properties:

- The start variable
*S4*doesn't occur on the right side of any rule in*R4*. *R4*doesn't contain any*Ïµ-rule*but mat contain*S3 â†’ Ïµ*rule.*R4*doesn't contain any unit-rule.*R4*doesn't contain any rule with more than two symbols on the right hand side.*L(G4) = L(G3) = L(G2) = L(G1) = L(G)*

- Now we eliminate all rules with the form
*A â†’ ${u}_{1}{u}_{2}$*where*u1*and*u2*are not both variables.

For each rule in the current set*R4*in the form of*A â†’ ${u}_{1}{u}_{2}$*where*u1*and*u2*are not both contained in*V4*we modify*G3*as follows:

- If
*u1 âˆˆ Î£*and*u2 âˆˆ V4*, we replace the rule*A â†’ ${u}_{1}{u}_{2}$*in the current*R4*set by the two rules*A â†’ ${U}_{1}{u}_{2}$*and*${U}_{1}\xe2\u2020\u2019{u}_{1}$*where*${U}_{1}$*is a new variable added to the current set*V4*. In this way we replace the one-step derivation*A â‡’ u1u2*with a two-step derivation*A â‡’ U1u2 â‡’ u1u2*. - If
*u1 âˆˆ V4*and*u2 âˆˆ Î£*, we replace the rule*A â†’ u1u2*in the current set*R4*by the two rules*A â†’ ${u}_{1}{U}_{2}$*and*${u}_{2}\xe2\u2020\u2019{u}_{2}$*where*${U}_{2}$*is a new variable added to the current*V4*set. This way we replace a one-step derivation*A â‡’ u1u2*with a two-step derivation*A â‡’ u1U2 â‡’ u1u2* - If
*u1 âˆˆ Î£*,*u2 âˆˆ Î£*, and*u1 â‰ u2*, we replace*A â†’ ${u}_{1}{u}_{2}$*in the current*R4*set by the three rules,*A â†’ ${U}_{1}$${U}_{2}$, ${U}_{1}$ â†’ ${u}_{2}$*, and*${U}_{2}$ â†’ ${u}_{2}$*where U1 and U2 are new variables added to the current*V4*set. In this way we replace the one-step derivation*A â‡’ u1u2*by a three-step derivation*A â‡’ U1U2 â‡’ u1U2 â‡’ u1u2*. - If
*u1 âˆˆ Î£*,*u2 âˆˆ Î£*, and*u1 = u2*, we replace the rule*A â†’ u1u2 = u1u1*in the current*R4*set by the two rules*A â†’ ${U}_{1}$${U}_{1}$*and*${U}_{1}$ â†’ ${u}_{1}$*. where ${U}_{1}$ is a new variable added to the current*V4*set. In this way we replace a one-step derivation*A â‡’ u1u2 = u1u1*with a three-step derivation*A â‡’ U1U1 â‡’ u1U1 â‡’ u1u1*.

Let *R5* be the set of rules and *V5* the set of variables after the above step is completed. We define *G5 = (V5,Î£, R5, S5)* where *S5 = S4*.

The above grammar has the following properties:

- The start variable
*S5*doesn't occur on the right side of any rule in*R5*. *R5*doesn't have any*Ïµ-rule*but may have*S5 â†’ Ïµ*rule.*R5*doesn't have any unit-rule.*R5*doesn't have any rule with more than two symbols on the right side.*R5*doesn't have any rule in the form of*$A\xe2\u2020\u2019{u}_{1}{u}_{2}$*where*${u}_{1}$*and*${u}_{2}$*are not both*V5*variables.*L(G5) = L(G4) = L(G3) = L(G2) = L(G1) = L(G)*.

And since the *G5* grammar is in *Chomsky normal form* we have completed our proof.

## Summary.

It is easier to work with CFGs if a given CFG is in its normal form, following this we have discussed the *Chomsky normal form*.

*Chomsky Normal Form* is a form whereby productions are in the form of *A â†’ BC* or *A â†’ a*, where *A*, *B* and *C* are variables and *a* is a terminal symbol.