Regex Expressions in Java


Regex -> Reg + ex i.e. Regular + Expressions

Regex Expressions are a sequence of characters which describe a search pattern. It is generally used for 'find', 'find and replace' as well as 'input validation'.
Most languages either contain regex expressions or contain the ability to create regex expressions such as Perl, Javascript et cetera. This article shall focus on how to write a Regex Expression in Java.

Writing Regex

Let us understand a regular expression with an example to check for a valid username such that:

  • It must be atleast eight characters and atmost twenty five characters long
  • It must not contain any special characters.
  • The first character must be an alphabet.

The regular expression would be "^[a-zA-Z]\\w{7,24}$".
Now let us break this expression down to understand it

  • ^ in the beginning and outside the square brackets represent the starting character of the string.
  • [a-zA-z] specifies the limits of uppercase and lower case alphabets.
  • \w represent word items, i.e. any word character, shorthand for [a-zA-Z_0-9]. This is known as a Regex Metacharacter. We use \\w because \ is also the escape character so it is necessary to use \\ for \.
  • {7,24} expresses that the word must be from 7 to 29 characters long (It excludes the first character as we had explictly explained it.
  • $ signifies the end of the expression.

Now that you have an idea of a regular expression and what it breaks down to, we will look into individual components so that you can make your own.

Character Classes

Particular characters can be included or excluded using Character Classes, explained as follows:

Simple Class
[ope] refers to the set {'o','p','e'}.

Negation
This is brought to action using ^, [^nge] refers to anything but {'n','g','e'}. It can be represented as C - {'n','g','e'} where C is the set of all characters.

Range
This is brought to action using -, for instance [a-z] refers to the alphabets from 'a' to 'z', we may also have multiple ranges at once like [a-zA-Z]

These can be interplayed with some logical operations.

Union - [n-u[w-z]] means either 'n' to 'u' or 'w' to 'z'.
Intersection - [a-z]&&[^iq]] means 'a' to 'z' excluding 'i' and 'q'.

Regex Quantifiers

The attributes of these characters are specified by Regex Quantifiers, explained as follows:

  • P? - P occurs only once or not at all.
  • P+ - P occurs once or more than once.
  • P* - P occurs any number of times.
  • P{n} - P occurs n times only.
  • P{n,} - P occurs n or more times.
  • P{m,n} - P occurs at least m times but less than n times.

There exists some shorthand for Regex known as Regex Metacharacters explained as follows:

  • . - Any one character.
  • \d - Any digit(s), shorthand for [0-9].
  • \D - Any non-digit(s), shorthand for [^0-9].
  • \s - Any whitespace character(s), shorthand for [\t\n\x0B\f\r].
  • \S - Any non-whitespace character(s), shorthand for [^\s].
  • \w - Any word character(s), shorthand for [a-zA-Z_0-9].
  • \W - Any non-word character(s), shorthand for [^\w].

Now that you know Regex expressions, try solving these problems

1. Can we build a Regex expression to check if entered value is a valid IP Address?

  • IP address is a string in the form "A.B.C.D", where the value of A, B, C, and D may range from 0 to 255.
  • Leading zeros are allowed.
  • The length of A, B, C, or D can't be greater than 3.

Solution
This is basically looking for the values 0 - 255
This can be broken down as follows:

  • If, the first position has a '0' or '1', it can be followed by any two characters, or even no characters at all. i.e. ^[01]?\\d\\d?
  • If the first position has 2
    • If second position lies in the range of 0 to 4, then any digit may follow. i.e. 2[0-4]\\d.
    • If second position contains '5' then the following digit must be of the range '0' to '5', i.e.25[0-5].

Every number is proceeded by a '.', so we will need to add that (with the escape character '' as '.' has a separate meaning otherwise.
This forms "^([01]?\\d\\d?|2[0-4]\\d|25[0-5])\\."

This can come up 4 times, hence the string shall be like

String pattern = 
            "^([01]?\\d\\d?|2[0-4]\\d|25[0-5])\\." +
            "([01]?\\d\\d?|2[0-4]\\d|25[0-5])\\." +
            "([01]?\\d\\d?|2[0-4]\\d|25[0-5])\\." +
            "([01]?\\d\\d?|2[0-4]\\d|25[0-5])$";

Try it yourself
2. Write a Regex expression for password validation

  • Must be atleast 8 characters
  • Must contain atleast one special character
  • Must contain atleast on uppercase character

Using Regex in Java

In Java, we have Java Regex i.e. an API used to define a pattern for searching and manipulating strings. It defines an interface and 3 APIs in the java.util.regex package, namely:

  • Matcher Class
  • MatchResult Interface
  • Pattern Class
  • PatternSyntaxException Class

Let us look at these and their functionality one by one.

Matcher Class

This implements the MatchResult interface and contains the following functions.

  1. boolean matches() - This tests whether a boolean expression matches a given pattern.
  2. boolean find() - This looks for the next expression that matches the pattern.
  3. boolean find(int start) - This looks for the next expression that matches that pattern of a given start number.
  4. String group() - This returns the matched subsequence.
  5. int start() - This returns the starting index of the matched subsequence.
  6. int end() - This returns the ending index of the matched subsequence.
  7. int groupCount() - This returns the total number of matched subsequences.

Pattern Class

1.static Pattern compile(String regex) - This compiles the input expression and returns the instance of the pattern.
2.Matcher matcher(CharSequence input) - This creates a matcher that matches the given input with the pattern.
3.static boolean matches(String regex, CharSequence input) - This performs its task like a combination of the compile and matcher methods. This 'compiles' the regular expression & then matches the given input with the pattern.
4.String[] split(CharSequence input) - This splits the provided input string around the matches of given pattern.
5.String pattern() - This returns the regex pattern.

To delve deep into using these in Java you are highly recommended to look through its documentation here.

Question

Try predicting the output of the following code.

    import java.util.regex.*;  
    public class Question
    {  
            public static void main(String args[])
        {
            Pattern p = Pattern.compile(".ut");
            Matcher m = p.matcher("but");  
            boolean b1 = m.matches();  


            boolean b2=Pattern.compile(".ut").matcher("cut").matches();  

            boolean b3 = Pattern.matches(".ut", "out");  

            System.out.println(b1+" "+b2+" "+b3);  
        }
    }
true true true
false false false
Can't Say
true true false
These are three ways you can compare strings where ".ut" is the regular expression that will hold true for "cut","but","hut","0ut","out" et cetera.

Now you know how to write your own regex and also have an understanding of classes you have at your disposal in Java to implement the functionality of Regular Expressions. You may go and use it in your programs! Cheers!