Regular Expression (Regex) in Python

Internship at OpenGenus

Get FREE domain for 1st year and build your brand new site

Everyone has used the feature of CTRL+F to find some text in documents, Regex also known as Regular Expresssion is advance form of that it allows the user to search for a pattern defined by him/her in the given set of input. Regex matching is termed as Greedy.

Let's say you want to find a given ip address of format(xxx.xxx.xxx.x) is present or not in a large set of document, it can be done two ways:-

1- Tradtional way is without using any form of regex in the code and to visit each character in the file and match it with the requird form

def isip(text):
 if len(text) != 12:
     return False
 for i in range(0, 3):
     if not text[i].isdecimal():
         return False
 if text[3] != '.':
     return False
 for i in range(4, 7):
     if not text[i].isdecimal():
         return False
 if text[7] != '.':
     return False
 for i in range(8, 11):
     if not text[i].isdecimal():
         return False
 if text[10] != '.':
     return False
 for i in range(11,12):
     if not text[i].isdecimal():
         return False
 return True

**OUTPUT:- **

print('192.168.0.1 is a ip address:')
print(isip('192.168.0.1'))
>>> True

2- Using Regex this code can be compressed in very few lines

import re
ipNumRegex = re.compile(r'\d\d\d\.\d\d\d\.\d\d\d\.\d')
mo = ipNumRegex.search('My ip address is 127.0.0.1.')
print('Ip Found: ' + mo.group())

Output

Ip Found: 127.0.0.1**

As we can see above the difference between the two code is immense from writing a 17 line code to reducing it to just 4 line.

Now, lets understand the 2nd Program in Details:-

import re

import re - Here we are just importing the Regex module which goes by the name re

ipNumRegex = re.compile(r'\d\d\d\.\d\d\d\.\d\d\d\.\d')

ipNumRegex = re.compile(r'\d\d\d.\d\d\d.\d\d\d.\d') - Here we are defining the required pattern which we want to search, here \d means any digit [0-9] so it matching the form xxx.xxx.xxx.x . r ar the begining is used to avoid the escape squence as we know backslash n (\n) means new line but if we want to print \n then we have to use \n so to avoid writing \d\d\d\.\d\d\d\.\d\d\d\.\d we out r before the single qoute

mo = ipNumRegex.search('My Ip is 415-555-4242.')

mo = ipNumRegex.search('My Ip is 415-555-4242.') - Here we are defining the text in which we want to search
The search() method will return None if the regex pattern is not found in the string. If the pattern is found, the search() method
returns a Match object.

print('Ip Found: ' + mo.group())

print('Ip Found: ' + mo.group())- group return the actuall matched object

Some predefined set of special sequences that begin with '' :-

Character Class

  • \w - Matches any single letter, digit, or underscore.
  • \W - Matches any character not part of \w (lowercase w).
  • \d - Matches decimal digit 0-9.
  • \D -Matches any character that is not a decimal digit.
  • \s - Matches a single whitespace character like: space, newline, tab, return.
  • \S - Matches any character not part of \s (lowercase s).
  • \t - Matches tab.
  • \n - Matches newline.
  • \r - Matches return.
  • \A - Matches only at the start of the string. Works across multiple lines as well.
  • \Z - Uppercase z. Matches only at the end of the string.
    • Checks if the preceding character appears one or more times.
    • Checks if the preceding character appears zero or more times.
  • ? βˆ™ Checks if the preceding character appears exactly zero or one time.
  • βˆ™ Specifies a non-greedy version of +, *
  • { } Checks for an explicit number of times.
  • ( ) Creates a group when performing matches.
  • The {n} matches exactly n of the preceding group.
  • The {n,} matches n or more of the preceding group.
  • The {,m} matches 0 to m of the preceding group.
  • The {n,m} matches at least n and at most m of the preceding group
  • < > Creates a named group when performing matches.
  • [abc] matches any character between the brackets (such as a, b, or c).
  • [^abc] matches any character that isn’t between the brackets.
  • IGNORECASE (I) - Allows case-insensitive matches.
  • DOTALL (S) - Allows . to match any character, including newline.
  • MULTILINE (M) - Allows start of string (^) and end of string ($) anchor to match newlines as well.
  • VERBOSE (X) - Allows you to write whitespace and comments within a regular expression to make it more readable.

Examples

Dot(.) Character

The dot(.) matches any single character except newline character.

x = re.compile(r'.un')
x.findall('the theif of run is having fun')
['run','fun']

One can also match newline character using dot(.) but user have also to use re.DOTALL.

The findall() Method

In corresponding to search() method, we also have findall() which returns all the strings matched in the given document

ipNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d') # has no groups
phoneNumRegex.findall('Ip: 127.0.0.1 private: 0.0.0.0')
['127.0.0.1', '0.0.0.0']

DotStar

Using DotStar(.*) all the characters in the docuemnts.

x=re.compile(r'Names:(.*))
f=x.search("Names:swagpandaswami")
f.group()

Output

swagpandaswami

Task-

The Above given example can be updated to match the ip address of the from xxx.xxx.xxx.xxx, with just some minor changes, the tools are itself provided in this article only (Hint Wild Card)

Some Key Applications of Regular Expression:-

  • Extract emails from a Text Document
  • Regular Expressions for Data collection
  • Working with Date Time features
  • Using Regex for Text (NLP)
  • Finding PhonNumber in the given document
  • Automation

Research these key points:

  1. Difference bewtween group() and groups().
  2. What is VERBOSE.
  3. Time Complexity.
  4. Making you Own character Class(Yes you read that right you can make your own class).