×

Search anything:

# Regular Expression in Theory of Computation

#### Theory of Computation

Get this book -> Problems on Array: For Interviews and Competitive Programming

Regular expressions are used to describe languages, in this article we discuss regular expressions and show that every language can be described by a regular expression.

1. Introduction.
2. Regular Expressions.
3. Equivalence of regular expressions and regular languages.
4. Summary.
5. References.

## Introduction.

Regular expressions are a means to describe languages, they define a finite pattern of strings or symbols.
Each pattern will match a set of strings and therefore they serve as names for a set of strings.

## Regular Expressions.

To get a idea about what regular expressions involve, let's go through a few examples:

Example 1:
Consider the expression (0 âˆª 1)01*.
The language that is described by the above expression is the set of all binary strings,

• that start with 0 or 1 as indicated by 0 âˆª 1,
• for which the second symbol is 0,
• that end with zero or more 1s as indicated by 1*.

A sample language is such as the following;
{00, 001, 0011, 00111, ..., 10, 101, 1011, 10111, ...}

Example 2:
The expression 1*01*01* defines the language:
{w : w contains exactly two 0s}

Example 3: The expression 1011 âˆª 0 describes the language {1011, 0}.

Example 4: The expression ((0 âˆª 1)(0 âˆª 1))* describes the language {w : the length of w is even}

Example 5: The expression (0 âˆª 1)*1011(0 âˆª 1)* describes the language {w : 1011 is a substring of w}

Example 6: The expression (0 âˆª 1)*0(0 âˆª 1)*0(0 âˆª 1)* describes the language {w : w contains at least two 0s}

Definition 1: Let Î£ be a non-empty alphabet.

1. Ïµ is a regular expression.
2. âˆ… is a regular expression.
3. For each a âˆˆ Î£, a is a regular expression.
4. If R1 and R2 are regular expressions, then R1 âˆª R2 is also a regular expression.
5. If R1 and R2 are regular expressions, then R1R2 is also a regular expression.
6. If R, then R* is also a regular expression.

From the above listing, 1-3 are regarded as the building blocks of regular expressions while 3-6 define rules used to combine regular expressions into new/larger regular expressions.

An example: We claim that (0 âˆª 1)*101(0 âˆª 1)* is a regular expression where the alphabet Î£ is {0, 1}.
Proof: We have to show that the expression can be built using the rules defined above.

• By rule 3, 0 and 1 are regular expressions.
• By rule 4, 0 âˆª 1 is a regular expression since both 0 and 1 are regular expressions.
• By rule 5, 10 and 101 are both regular expressions.
• By rule 5, (0 âˆª 1)*101 is a regular expression since (0âˆª1)* and 101 are also regular expressions.
• By rule 5, (0 âˆª 1)*101(0 âˆª 1)* is a regular expression since (0âˆª1)*101 and (0âˆª1)* are both regular expressions.
• By rule 6, (0 âˆª 1)* is a regular expression.

Now we define the language that is described by the above regular expressions.

1. Ïµ describes the language {Ïµ}.
2. âˆ… describes the language {âˆ…}.
3. For each a âˆˆ Î£, a describes the language {a}.
4. Let R1 and R2 be regular expressions and L1 and L2 be the languages described by these expressions respectively. The regular expressions R1 âˆª R2 describes the language L1 âˆª L2.
5. Also assume the above for the languages L1 and L2 and the expressions R1 and R2, The regular expression R1R2 also describes the language L1L2.
6. Let R be a regular expression and L be the language described by it. The regular expression R* describes the language L*.

Example 1:
Regular expression 0 âˆª Ïµ describes the language {0, Ïµ} whereas the expression 1* describes the language {Ïµ, 1, 11, 111, ..., } and therefore, the regular expression (0 âˆª Ïµ)1* describes the language {0, 01, 011, 0111, ..., Ïµ, 1, 11, 111, ...}.

Example 2: Regular expression (0 âˆª Ïµ)(1 âˆª Ïµ) describes the language {01, 0, 1, Ïµ}.

Definition 2: Let R1 and R2 be regular expressions and L1 and L2 be languages described by the expressions, then if L1 = L2, that is, R1 and R1 describe the same language, then we can write R1 = R2.
Therefore, even if (0 âˆª Ïµ)1* and 01* âˆª 1* are different regular expressions, we can write,
(0 âˆª Ïµ)1* = 01* âˆª 1*
because they describe the same language.

Theorem: Let R1, R2, R3 be regular expressions. The following identities hold.

1. R1âˆ… = âˆ…R1 = âˆ….
2. R1Ïµ = ÏµR1 = R1.
3. R1 âˆª âˆ… = âˆ… âˆª R1 = R1.
4. R1 âˆª R1 = R1.
5. R1 âˆª R2 = R2 âˆª R1.
6. R1(R2 âˆª R3) = R1R2 âˆª R1R3.
7. (R1 âˆª R2)R3 = R1R3 âˆª R2R3.
8. R1(R2R3) = (R1R2)R3.
9. âˆ…* = Ïµ.
10. Ïµ* = Ïµ.
11. (Ïµ âˆª R1)* = R1*.
12. (Ïµ âˆª R1)(Ïµ âˆª R1)* = R1*.
13. R1*(Ïµ âˆª R1) = (Ïµ âˆª R1)R1*=R1*.
14. R1*R2 âˆª R2 = R1*R2.
15. R1(R2R1)* = (R1R2)*R1.
16. (R1 âˆª R2)*= (R1*R2)*R1* = (R2*R1)*R2*.

We shall not present proof for the above identities however we assume they are verified.
For example: We can verify the identity (0 âˆª Ïµ)1* = 01* âˆª 1* as follows,

• By identity 7: (0 âˆª Ïµ)1* = 01* âˆª Ïµ1*
• By identity 2: (0 âˆª Ïµ)1* = 01* âˆª 1*

## Equivalence of regular expressions and regular languages.

Theorem: Let L be a language, then L is regular iff there exists a regular expression describing L.

The proof of this theorem entails two parts:

1. First we will prove that every regular expression describes a regular language.
2. Second, we prove that every DFA M can be converted to a regular expression describing a language L(M).

## 1. Every regular expression describes a regular language

Let R be an arbitrary regular expression over the alphabet Î£.
We need to prove that the language that is described by R is a regular language.
We prove using induction on the structure of R, that is, by induction on the way R is built using the rules we defined earlier(definition 1).

Base case 1: Assuming R = Ïµ, then R describes the language {Ïµ} and to prove that it is regular we shall construct a NFA M = (Q,Î£, Î´, q, F) that accepts this language.

We obtain this NFA by defining
Q = {q}, q the start state, F = {q} and Î´(q, a) = âˆ… for all a âˆˆ Î£Ïµ
State diagram of M:

Base case 2: Assuming R = âˆ…, then R describes the language âˆ… and to prove that it is regular, we construct an NFA M = (Q,Î£, Î´, q, F) that accepts it. We obtain this NFA by defining Q = {q}, q the start state, F = âˆ…, and Î´(q, a) = âˆ… for all a âˆˆ Î£Ïµ.
The state diagram of M:

Base case 3: Let a âˆˆ Î£, assume R = a, then R describes the language {a} and to prove that it is regular we construct a NFA M= (Q,Î£, Î´, q1, F) that accepts it. The NFA is obtained by defining Q = {q1, q2}, q1 is the start state, F = {q2}, and
Î´(q1, a) = {q2},
Î´(q1, b) = âˆ… for all b âˆˆ Î£Ïµ \ {a},
Î´(q2, b) = âˆ… for all b âˆˆ Î£Ïµ.
State diagram of M is shown below.

Induction step 1 of base case 1: Assuming R = R1 âˆª R2 where R1 and R2 are regular expressions. Let L1 and L2 be regular languages described by R1 and R2 respectively, then R describes the language L1 âˆª L2 which is also regular by Theorem 1.1 in the prerequisite article.

Induction step of base case 2: Assuming that R = R1R2 where R1 and R2 are regular expressions. Let L1 and L2 be two regular languages described by the regular expressions respectively, then R describes the language L1L2 which by Theorem 4 in the prerequisite article is regular.

Induction step of base case 3: Assuming that R = (R1)* where R1 is regular expression. Let L1 be the language described by R1 and assuming that L1 is regular, then R describes the language (L1)* which by Theorem 5 in the prerequisite article is regular.

At this point we conclude the proof of the claim that every regular expression describes a regular language.

An example:
Consider the regular expression: (ab âˆª a)*, where {a, b} is the alphabet. We construct a NFA that accepts the language described by the regular expression to prove that the regular expression describes a regular language.
Lets first build a regular expression:

• First we take the regular expressions a and b and combine them to a regular expression ab.
• Next, we take the regular expressions ab and a and combine them into regular expression ab âˆª a.
• We take regular expression ab âˆª a and transform it into the regular expression (ab âˆª a)*.

Now to construct an NFA:

• We construct NFA M1 that accepts the language described by regular expression a:

• We construct NFA M2 that accepts the language described by regular expression b:

• Apply the construction from the proof of Theorem 4(prerequisite article) to M1 and M2 which gives NFA M3 that accepts the language described by the regular expression ab:

• Apply the construction from proof of Theorem 1.1:(prerequisite article) to M1 and M3 which gives NFA M4 that accepts the language described by regular expression ab âˆª a:

• Apply construction from proof of Theorem 5(prerequisite article) to M4 which gives NFA M5 that accepts language described by regular expression (ab âˆª a)*

## 2. Converting a DFA to a regular expression.

We will prove that every DFA M can be converted into a regular expression describing a language L(M).
First we solve recurrence relations involving languages.

Solving recurrence relations.
Let Î£ be an alphabet, B and C be known languages in Î£* such that Ïµ âˆ‰ B and L be an unknown language such that L = BL âˆª C.
Let's try to express L in terms of B and C hence solving it.

To start consider an arbitrary string u in language L. Now to determine how u looks like. Since u âˆˆ L and L = BL âˆª C, we know u is a string in BL âˆª C and therefore we have the following possibilities:

1. u is an element of C.
2. u is an element of BL, if this is the case, then there are strings b âˆˆ B and v âˆˆ L such that u = bv. Since Ïµ âˆ‰ B, we have b â‰  Ïµ and thus, |v| < |u|. Since v is a string in L which is equal to BL âˆª C, v is a string in BL âˆª C and we have the following possibilities for v.
• v is an element of C, if this is the case, u = bv where b âˆˆ B and v âˆˆ C and therefore u âˆˆ BC.
• v is an element of BL, if this is the case, there are strings bâ€² âˆˆ B and w âˆˆ L such that v = bâ€² w. Since Ïµ âˆ‰ B we have bâ€² â‰  Ïµ, therefore, |w| < |v|. Since w is a string in L, there are two further possibilities.
** w is an element of C, if this is the case, u = bbâ€² w, where b, bbâ€² âˆˆ B and w âˆˆ C, therefore u âˆˆ BBC.
** w is an element of BL, if this is the case, there are strings bâ€²â€² âˆˆ B and x âˆˆ L such that w = bâ€²â€²x. Since â‰  Ïµ B we have bâ€²â€² â‰  Ïµ and therefore *|x| < |w|. Since x is a string in L, x is also a string in BL âˆª C and we have another two further possibilities for x.
*** x is an element in C, if this is the case, u = bbâ€² bâ€²â€² x where b, bâ€², bâ€²â€² âˆˆ B and x âˆˆ C and therefore u âˆˆ BBBC.
*** x is an element of BL and so on...

From the above we are convinced that any string u in L can be written as a concatenation of zero or more strings in B followed by one string in C.

Lemma 1: Let Î£ be an alphabet and B, C, L, be languages in Î£* such that Ïµ â‰  B and L = BL âˆª C, then L = B*C.

Proof:
We first show that B*C âŠ† L. Let u be an arbitrary string B*C. Then u is the concatenation of k strings of B for some k â‰¥ 0 followed by one string of C.

The base case: k = 0, u is a string in C, therefore u is a string in BL âˆª C. Since L = BL âˆª C, u is a string in L.

Inductive step: k â‰¥ 1, we write u = vwc where v is a string in B, w is the concatenation of k-1 strings of B and c is a string of C.
We define y = wc and observe that y is the concatenation of k-1 strings of B followed by a single string of C.
By induction, string y is an element of L, therefore u = vy where v is a string in B and y a string in L.
This shows that u is a string in BL and therefore u is a string in BL âˆª C and since L = BL âˆª C if follows that u is a string in L.
We have completed the proof that B*C âŠ† L.

To show that L âŠ† B*C, let u be an arbitrary string in L and l its length. By induction we will prove that u is a string in B*C.

Base case: l = 0, u = Ïµ and since u âˆˆ L and L = BL âˆª C, u is a string in BL âˆª C. Since Ïµ â‰  B, u cannot be a string in BL and hence u must be a string in C.
Since C âŠ† B*C, u is a string in B*C.

Induction step: l â‰¥ 1, If u is a string in C, u is a string in B*C and we are done.

Assuming u is not a string in C. Since u âˆˆ L and L = BL âˆª C, u is a string in BL. Therefore there are strings b âˆˆ B and v âˆˆ L such that u = bv.
Since Ïµ â‰  B, the length of b is greater or equal to 1 and therefore the length of v is less than the length of u.
By induction, v is a string in B*C and hence u = bv where b âˆˆ B and v âˆˆ B*C which shows u âˆˆ B(B*C). Since B(B*C) âŠ† B*C, it shows that u âˆˆ B*C.

This lemma holds for any language B that doesn't contains an empty string Ïµ.

Converting DFA to regular expression.
We use lemma 1 to prove that every DFA can be converted to a regular expression.
Let M = (Q,Î£, Î´, q, F) be an arbitrary deterministic finite automaton.
We will show that there exists a regular expression that describes a language L(M).

For each state r âˆˆ Q we define:
Lr={w âˆˆ Î£*: the path in the state diagram of M that starts in state r and corresponds to w ends in state F}. That is, Lr is the language accepted by M , if r was the start state.
Now we show that Lr can be described by a regular expression, Since L(M) = Lq, it proves that L(M) can be described by a regular expression.

The idea is to set up equations for the languages Lr which we then solve using lemma 1.
We claim that

Let w be a string Lr, then the path P in the state diagram of M starting in state r and corresponding to w ends in state F.
Since r âˆ‰ F, this path contains at least one edge.
Let râ€² be the following state after the first state, then râ€² = Î´(r, b) for some symbol b âˆˆ Î£.
Therefore, b is the first symbol of w.
We write w = bv where v is the remaining part of w.
Then the path Pâ€² = P \ {r} in the state diagram M starting in state râ€² and corresponding to v ends in state F.
Thus, v âˆˆ Lrâ€² = LÎ´(r, b).
Hence:

Conversely, let w be a string in , then there is a symbol b âˆˆ Î£ and a string v âˆˆ such that w = bv.
Let Pâ€² be a path in the state diagram of M starting in state Î´(r, b) and corresponding to v.
Since v âˆˆ , this path ends in a state F.
Let P be the path in the state diagram of M starting in r and follows the edge to Î´(r, b), then follows Pâ€².
This path corresponds to w and end in a state F and therefore w âˆˆ Lr.
We complete the proof of the claim.

In a similar fashion we can prove that:

In this case we have a set of equations in the unknowns Lr for r âˆˆ Q.
The total number of equations is equal to the number of unknowns(size of Q).
To obtain the regular expression of L(M) = Lq) we solve the equations using lemma 1, for this we have to convince ourselves that these equations have a solution for any DFA.

Lemma 2: Let n â‰¥ 1 be an integer and for 1 â‰¤ i â‰¤ n and 1 â‰¤ j â‰¤ n, let ${\mathrm{B}}_{\mathrm{ij}}$ and ${\mathrm{C}}_{\mathrm{i}}$ be regular expressions such that Ïµ â‰  ${\mathrm{B}}_{\mathrm{ij}}$. Let L1, L2, ..., Ln be languages satisfying:

Then L1 is expressed as a regular expression by involving regular expressions ${\mathrm{B}}_{\mathrm{ij}}$ and ${\mathrm{C}}_{\mathrm{i}}$.

Proof: We prove this lemma by an induction on n.
Base case: n = 1, therefore, we have
L1 = ${\mathrm{B}}_{\mathrm{11}}$ ${\mathrm{L}}_{1}$ âˆª ${\mathrm{C}}_{1}$

Since Ïµ âˆ‰ ${\mathrm{B}}_{\mathrm{11}}$, by lemma 1 that L1 = ${{\mathrm{B}}^{*}}_{11}$${\mathrm{C}}_{\mathrm{1}}$, this proves the base case.

Induction step: Let n â‰¥ 2, we assume that the lemma is true for n-1, we have:
${\mathrm{L}}_{\mathrm{n}}=\left(\underset{\mathrm{j}=1}{\overset{\mathrm{n}}{â‹ƒ}}{\mathrm{B}}_{\mathrm{nj}}{\mathrm{L}}_{\mathrm{j}}\right)âˆª{\mathrm{C}}_{\mathrm{n}}$

= ${\mathrm{B}}_{\mathrm{nn}}{\mathrm{L}}_{\mathrm{n}}âˆª\left(\underset{\mathrm{j}=1}{\overset{\mathrm{n}-1}{â‹ƒ}}{\mathrm{B}}_{\mathrm{nj}}{\mathrm{L}}_{\mathrm{j}}\right)âˆª{\mathrm{C}}_{\mathrm{n}}$

Since Ïµ âˆ‰ ${\mathrm{B}}_{\mathrm{nn}}$, then from lemma 1:

${\mathrm{L}}_{\mathrm{n}}={\mathrm{B}}_{\mathrm{nn}}^{*}\left(\left(\underset{\mathrm{j}=1}{\overset{\mathrm{n}-1}{â‹ƒ}}{\mathrm{B}}_{\mathrm{nj}}{\mathrm{L}}_{\mathrm{j}}\right)âˆª{\mathrm{C}}_{\mathrm{n}}\right)$

= ${\mathrm{B}}_{\mathrm{nn}}^{*}\left(\underset{\mathrm{j}=1}{\overset{\mathrm{n}-1}{â‹ƒ}}{\mathrm{B}}_{\mathrm{nj}}{\mathrm{L}}_{\mathrm{j}}\right)âˆª{\mathrm{B}}_{\mathrm{nn}}^{*}{\mathrm{C}}_{\mathrm{n}}$

= $\left(\underset{\mathrm{j}=1}{\overset{\mathrm{n}-1}{â‹ƒ}}{\mathrm{B}}_{\mathrm{nn}}^{*}{\mathrm{B}}_{\mathrm{nj}}{\mathrm{L}}_{\mathrm{j}}\right)âˆª{\mathrm{B}}_{\mathrm{nn}}^{*}{\mathrm{C}}_{\mathrm{n}}$

By substituting the equation for Ln, into equations for Li, 1 â‰¤ i â‰¤ n âˆ’ 1 we obtain the equation.
${\mathrm{L}}_{\mathrm{i}}=\left(\underset{\mathrm{j}=1}{\overset{\mathrm{n}}{â‹ƒ}}{\mathrm{B}}_{\mathrm{ij}}{\mathrm{L}}_{\mathrm{j}}\right)âˆª{\mathrm{C}}_{\mathrm{i}}$

= ${\mathrm{B}}_{\mathrm{in}}{\mathrm{L}}_{\mathrm{n}}âˆª\left(\underset{\mathrm{j}=1}{\overset{\mathrm{n}-1}{â‹ƒ}}{\mathrm{B}}_{\mathrm{ij}}{\mathrm{L}}_{\mathrm{j}}\right)âˆª{\mathrm{C}}_{\mathrm{i}}$

= $\left(\underset{\mathrm{j}=1}{\overset{\mathrm{n}-1}{â‹ƒ}}\left({\mathrm{B}}_{\mathrm{in}}{\mathrm{B}}_{\mathrm{nn}}^{*}{\mathrm{B}}_{\mathrm{nj}}âˆª{\mathrm{B}}_{\mathrm{ij}}\right){\mathrm{L}}_{\mathrm{j}}\right)âˆª{\mathrm{B}}_{\mathrm{in}}{\mathrm{B}}_{\mathrm{nn}}^{*}{\mathrm{C}}_{\mathrm{n}}âˆª{\mathrm{C}}_{\mathrm{i}}$

An therefore we have n-1 equations in L1, L2,...,Ln-1 and since $\mathrm{Ïµ}âˆ‰{\mathrm{B}}_{\mathrm{in}}{\mathrm{B}}_{\mathrm{nn}}^{*}{\mathrm{B}}_{\mathrm{nj}}âˆª{\mathrm{B}}_{\mathrm{ij}}$, it follows from the induction hypothesis that L1 can be expressed as a regular expression only involving the regular expressions ${\mathrm{B}}_{\mathrm{ij}}$ and ${\mathrm{C}}_{\mathrm{i}}$

## Summary.

Regular expressions are a means to describe languages, in this article, we proved that every regular expression describes a regular language and that every DFA can be converted to a regular expression that describes a language.

#### Erick Lumunge

Erick is a passionate programmer with a computer science background who loves to learn about and use code to impact lives positively.