Regular Expression

Strings are the most commonly used data structure in programming, and the need to manipulate strings is almost ubiquitous. For example, to determine if a string is a valid Email address, you could programmatically extract substrings before and after the @ symbol and verify if they are a valid username and domain name respectively—but this approach is cumbersome and the code is difficult to reuse.

A regular expression is a powerful tool for matching strings. Its design philosophy is to define a rule for strings using a descriptive language: any string that conforms to the rule is considered a “match,” while non-conforming strings are deemed invalid.

Thus, the steps to validate an Email address are:

Create a regular expression pattern to match valid Emails;
Use this pattern to match the user’s input and determine its validity.

Since regular expressions are also represented as strings, we first need to understand how to describe characters using special syntax.

In regular expressions:

Directly specifying a character means exact match;
\d matches a single digit;
\w matches a single letter or digit.

Examples:

'00\d' matches '007' but not '00A';
'\d\d\d' matches '010';
'\w\w\d' matches 'py3';
. matches any character (e.g., 'py.' matches 'pyc', 'pyo', 'py!', etc.).

To match variable-length characters:

* matches zero or more of the preceding character;
+ matches one or more of the preceding character;
? matches zero or one of the preceding character;
{n} matches exactly n of the preceding character;
{n,m} matches between n and m of the preceding character.

Complex Example: `\d{3}\s+\d{3,8}`

Let’s break this down left to right:

\d{3}: Matches exactly 3 digits (e.g., '010');
\s+: Matches one or more whitespace characters (spaces, tabs, etc., e.g., ' ', '\t');
\d{3,8}: Matches 3 to 8 digits (e.g., '1234567').

Combined, this pattern matches phone numbers with area codes separated by any number of spaces (e.g., '010 1234567').

What about matching numbers like '010-12345'? Since - is a special character in regex, it must be escaped with \:

\d{3}\-\d{3,8}

However, this still won’t match '010 - 12345' (with spaces around -), requiring a more complex pattern.

Advanced Syntax

For more precise matching, use [] to define a character range:

[0-9a-zA-Z\_]: Matches a digit, letter, or underscore;
[0-9a-zA-Z\_]+: Matches a string of one or more digits/letters/underscores (e.g., 'a100', '0_Z', 'Py3000');
[a-zA-Z\_][0-9a-zA-Z\_]*: Matches a valid Python variable (starts with a letter/underscore, followed by any number of digits/letters/underscores);
[a-zA-Z\_][0-9a-zA-Z\_]{0,19}: Restricts Python variables to 1–20 characters (1 initial character + up to 19 subsequent characters).

Additional syntax:

A|B: Matches either A or B (e.g., (P|p)ython matches 'Python' or 'python');
^: Matches the start of a line (e.g., ^\d requires the string to start with a digit);
$: Matches the end of a line (e.g., \d$ requires the string to end with a digit).

Note: py matches 'python', but ^py$ enforces an exact full-line match (only matches 'py').

The `re` Module

With this foundation, we can use regular expressions in Python via the built-in re module. Since Python strings also use \ for escaping, extra care is needed:

s = 'ABC\\-001'  # Python string
# Corresponding regex pattern: 'ABC\-001'

To avoid escape issues, always use Python’s r prefix for raw strings:

s = r'ABC\-001'  # Python string
# Corresponding regex pattern remains: 'ABC\-001'

Basic Matching

The match() method checks if a string matches a regex pattern:

>>> import re
>>> re.match(r'^\d{3}\-\d{3,8}$', '010-12345')
&lt;_sre.SRE_Match object; span=(0, 9), match='010-12345'>
>>> re.match(r'^\d{3}\-\d{3,8}$', '010 12345')
>>>  # Returns None (no match)

Returns a Match object if the pattern matches;
Returns None if there is no match.

Common validation pattern:

test = 'user_input_string'
if re.match(r'regex_pattern', test):
    print('ok')
else:
    print('failed')

Splitting Strings

Regex-based splitting is more flexible than fixed-character splitting:

# Basic split (fails with consecutive spaces)
>>> 'a b   c'.split(' ')
['a', 'b', '', '', 'c']

# Regex split (handles any number of spaces)
>>> re.split(r'\s+', 'a b   c')
['a', 'b', 'c']

# Split on spaces/commas
>>> re.split(r'[\s\,]+', 'a,b, c  d')
['a', 'b', 'c', 'd']

# Split on spaces/commas/semicolons
>>> re.split(r'[\s\,\;]+', 'a,b;; c  d')
['a', 'b', 'c', 'd']

Use this to normalize messy user input (e.g., tag lists) into clean arrays.

Group Extraction

Beyond simple matching, regex can extract substrings using () to define groups:

>>> m = re.match(r'^(\d{3})-(\d{3,8})$', '010-12345')
>>> m.group(0)  # Entire matched string
'010-12345'
>>> m.group(1)  # First group (area code)
'010'
>>> m.group(2)  # Second group (local number)
'12345'

group(0): Always the full matched string;
group(1), group(2), etc.: Corresponding captured groups.

Practical Example (Time Validation):

>>> t = '19:05:30'
>>> m = re.match(r'^(0[0-9]|1[0-9]|2[0-3]|[0-9])\:(0[0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9]|[0-9])\:(0[0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9]|[0-9])$', t)
>>> m.groups()
('19', '05', '30')

Regex has limits (e.g., validating dates like '2-30' or '4-31' requires additional code logic).

Greedy vs. Non-Greedy Matching

Regex uses greedy matching by default (matches as many characters as possible):

# Greedy match (\d+ consumes all digits, leaving 0* with nothing)
>>> re.match(r'^(\d+)(0*)$', '102300').groups()
('102300', '')

# Non-greedy match (add ? to \d+ to match minimal digits)
>>> re.match(r'^(\d+?)(0*)$', '102300').groups()
('1023', '00')

Compilation

The re module compiles regex patterns every time you call match()/search(). For frequently reused patterns, precompile for better performance:

>>> import re
# Precompile the pattern
>>> re_telephone = re.compile(r'^(\d{3})-(\d{3,8})$')
# Reuse the compiled pattern
>>> re_telephone.match('010-12345').groups()
('010', '12345')
>>> re_telephone.match('010-8086').groups()
('010', '8086')

Compiled Regular Expression objects store the pattern internally, eliminating repeated compilation.

Summary

Regular expressions are extremely powerful—covering all their features would require an entire book. Keep a regex reference handy if you work with them frequently.

Exercises

Exercise 1: Validate Email Addresses

Write a regex to validate Emails like:

someone@gmail.com
bill.gates@microsoft.com

import re

def is_valid_email(addr):
    # Implement the regex pattern here
    return True

# Test cases
assert is_valid_email('someone@gmail.com')
assert is_valid_email('bill.gates@microsoft.com')
assert not is_valid_email('bob#example.com')
assert not is_valid_email('mr-bob@example.com')
print('ok')

Exercise 2: Extract Names from Emails

Extract names from formatted Email addresses:

<Tom Paris> tom@voyager.org → Tom Paris
bob@example.com → bob

import re

def name_of_email(addr):
    # Implement the regex pattern here
    return None

# Test cases
assert name_of_email('&lt;Tom Paris> tom@voyager.org') == 'Tom Paris'
assert name_of_email('tom@voyager.org') == 'tom'
print('ok')

Reference Source Code

regex.py

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import re

print("Test: 010-12345")
m = re.match(r"^(\d{3})-(\d{3,8})$", "010-12345")
print(m.group(1), m.group(2))

t = "19:05:30"
print("Test:", t)
m = re.match(r"^(0[0-9]|1[0-9]|2[0-3]|[0-9])\:(0[0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9]|[0-9])\:(0[0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9]|[0-9])$", t)
print(m.groups())

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import re

print("Test: 010-12345")
m = re.match(r"^(\d{3})-(\d{3,8})$", "010-12345")
print(m.group(1), m.group(2))

t = "19:05:30"
print("Test:", t)
m = re.match(r"^(0[0-9]|1[0-9]|2[0-3]|[0-9])\:(0[0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9]|[0-9])\:(0[0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9]|[0-9])$", t)
print(m.groups())

Python for beginner

Curriculum