Open-Source Internship opportunity by OpenGenus for programmers. Apply now.
It is much easier to work with context-free grammars if the given CFG is its normal form. In this article we discuss the Chomsky Normal Form.
Table of contents.
- Introduction.
- Chomsky Normal Form.
- Summary.
- References.
Introduction.
Normally it is much easier to work with a context-free grammar if the given context-free grammar is in a normal form.
Normal forms include:
- Chomsky Normal Form, whereby productions are in the form of A → BC or A → a, where A, B and C are variables and a is a terminal symbol.
- Greibach Normal Form, where productions are in the form of A → aα, where α ∈ V* and A ∈ V.
Chomsky Normal Form.
Rules in a context free grammar G = (V,Σ, R, S) are of the form:
A → w where A is a variable and w is a string over the alphabet V ∪ Σ.
We will show that every context-free grammar G can be converted into a context-free grammar G′ such that L(G) =L(G′) and rules of G′ are of restricted form as defined below.
Definition 1: A context-free grammar G = (V,Σ, R, S) is said to be in chomsky normal form if every rule in R has one of the forms:
- A → BC where A, B and C are elements of V, B ≠S, and C ≠S.
- A → a where A is an element of V and a is also an element of Σ.
- S → ϵ where S is the start variable.
Now, we should convince ourselves that for such a grammar G, R contains the rule number 3 iff ϵ ∈ L(G)
Theorem 1: Let Σ be an alphabet and L ⊆ Σ* be a context-free language. Then there exists a context-free grammar in chomsky normal for whose language is L.
Proof: L is a context-free language and there exists a grammar G = (V,Σ, R, S), such that L(G) = L. Now we transform G to a grammar in chomsky normal form in five steps:
- We eliminate the start variable from the right side of the rules by defining G1= (V1,Σ, R1, S1), where S1 is the start variable, V1 = V ∪ {S1} and R1 = R ∪ {S1 → S}.
This grammar will have the following property:
- The start variable S1 doesn't occur on the right side of any rule in R1
- L(G1) =L(G)
- An ϵ-rule is a rule in the form of A → ϵ where A is a variable that is not equal to the start variable.
In this step we eliminate all ϵ-rules from G1
Now we consider rules one by one. Let A → ϵ be an example of a rule where A ∈ V1 and A ≠S1. We change G1 as follows:
- First we remove the rule A → ≠from the current set R1.
- Then, for each rule in the current set R1 in the form of:
(a). B → A, add rule B → ≠, unless the rule has been deleted from R1. In this way we replace the two-step derivation B ⇒ A ⇒ ≠by a one-step derivation B ⇒ â‰
(b). B → uAv, where u and v are strings, both of which are not empty. We add the rule B →uv to R1 now observe that we have replaced a two-step derivation B ⇒ uAv ⇒ uv with a single-step derivation B ⇒ uv.
(c). B → uAvAw, where u, v and w are all strings. We add the rules B → uvw, B → uAvw, and B → uvAw to R1. If u = v = w = ϵ and the rule B → ϵ has been deleted from R1, we don't add the rule B → ϵ
(d). We shall treat all rules where A occurs multiple times on the right-hand side in a similar way.
This process is repeated until all ϵ-rules are eliminated. Let R2 be the set of rules after all ϵ-rules are eliminated. We define G2 = (V2,Σ, R2, S2) where V2 = V1 and S2 = S1.
The properties of this grammar are:
- It's start variable S2 doesn't occur on the right-hand side of any rule in R2.
- R2 doesn't contain any ϵ-rule but may contain S2 → ϵ rule.
- L(G2) = L(G1) = L(G)
- A unit-rule is in the form A → B where A and B are variables. In this step we eliminate all unit rules from G2. To do this we consider all rules .
Let A → B be an example of a rule where A and B are elements of V2. Knowing that B ≠S2 we change G2 as follows:
- First we remove rule A → B from the current R2 set.
- For each rule in R2 set in the form of B → u where u ∈(V2 ∪ Σ)*, we add the rule A → u to the current set R2 unless it is a unit-rule that has been eliminated. This way, we replace a two-step derivation A ⇒ B ⇒ u with a one-step derivation A ⇒ u.
The process is repeated until all unit-rules are eliminated.
Let R3 be thes set of rules after all the rules have been eliminated. We therefore define G3 = (V3,Σ, R3, S3), where V3 = V2 and S3 = S2.
The above grammar has the following properties: - The start variable S3 doesn't occur on the right side of any rule in R3.
- R3 doesn't contain any ϵ-rule but may contain S3 → ϵ rule.
- R3 doesn't contain any unit-rule
- L(G3) = L(G2) =L(G1) = L(G)
- This step involves eliminating all rules with more than two symbols on the right side. For each rule in the set R3 in the form of A → u1u2, ..., uk,, where k ≥ 3 and each is an element of V3 ∪ Σ we change G3 as follows:
- First we remove the rule A → u1 u2, ..., from the current set R3.
- Next we add the following rules to set R3.
A → u1A1
A1 → u2A2
A2 → u3A3
.
.
.
Ak−3 → uk−2Ak−2
Ak−2 → uk−1uk
Where , , ..., are new variables added to the current set V3.
In this way we replace the one-step derivation A ⇒ u1 u2, ..., by the (k-1)-step derivation.
Let R4 and V4 be the set of rules and variables respectively, when we eliminate all rules with more than two symbols on the right. We define G4 = (V4,Σ, R4, S4), where S4 = S3.
The above grammar has the following properties:
- The start variable S4 doesn't occur on the right side of any rule in R4.
- R4 doesn't contain any ϵ-rule but mat contain S3 → ϵ rule.
- R4 doesn't contain any unit-rule.
- R4 doesn't contain any rule with more than two symbols on the right hand side.
- L(G4) = L(G3) = L(G2) = L(G1) = L(G)
- Now we eliminate all rules with the form A → where u1 and u2 are not both variables.
For each rule in the current set R4 in the form of A → where u1 and u2 are not both contained in V4 we modify G3 as follows:
- If u1 ∈ Σ and u2 ∈ V4, we replace the rule A → in the current R4 set by the two rules A → and where is a new variable added to the current set V4. In this way we replace the one-step derivation A ⇒ u1u2 with a two-step derivation A ⇒ U1u2 ⇒ u1u2.
- If u1 ∈ V4 and u2 ∈ Σ, we replace the rule A → u1u2 in the current set R4 by the two rules A → and where is a new variable added to the current V4 set. This way we replace a one-step derivation A ⇒ u1u2 with a two-step derivation A ⇒ u1U2 ⇒ u1u2
- If u1 ∈ Σ, u2 ∈ Σ, and u1 ≠u2, we replace A → in the current R4 set by the three rules, A → , → , and → where U1 and U2 are new variables added to the current V4 set. In this way we replace the one-step derivation A ⇒ u1u2 by a three-step derivation A ⇒ U1U2 ⇒ u1U2 ⇒ u1u2.
- If u1 ∈ Σ, u2 ∈ Σ, and u1 = u2, we replace the rule A → u1u2 = u1u1 in the current R4 set by the two rules A → and → . where is a new variable added to the current V4 set. In this way we replace a one-step derivation A ⇒ u1u2 = u1u1 with a three-step derivation A ⇒ U1U1 ⇒ u1U1 ⇒ u1u1.
Let R5 be the set of rules and V5 the set of variables after the above step is completed. We define G5 = (V5,Σ, R5, S5) where S5 = S4.
The above grammar has the following properties:
- The start variable S5 doesn't occur on the right side of any rule in R5.
- R5 doesn't have any ϵ-rule but may have S5 → ϵ rule.
- R5 doesn't have any unit-rule.
- R5 doesn't have any rule with more than two symbols on the right side.
- R5 doesn't have any rule in the form of where and are not both V5 variables.
- L(G5) = L(G4) = L(G3) = L(G2) = L(G1) = L(G).
And since the G5 grammar is in Chomsky normal form we have completed our proof.
Summary.
It is easier to work with CFGs if a given CFG is in its normal form, following this we have discussed the Chomsky normal form.
Chomsky Normal Form is a form whereby productions are in the form of A → BC or A → a, where A, B and C are variables and a is a terminal symbol.