Strings are the most commonly used data structure in programming, and the need to manipulate strings is almost ubiquitous. For example, to determine if a string is a valid Email address, you could programmatically extract substrings before and after the @ symbol and verify if they are a valid username and domain name respectively—but this approach is cumbersome and the code is difficult to reuse.
A regular expression is a powerful tool for matching strings. Its design philosophy is to define a rule for strings using a descriptive language: any string that conforms to the rule is considered a “match,” while non-conforming strings are deemed invalid.
Thus, the steps to validate an Email address are:
Since regular expressions are also represented as strings, we first need to understand how to describe characters using special syntax.
In regular expressions:
\d matches a single digit;\w matches a single letter or digit.Examples:
'00\d' matches '007' but not '00A';'\d\d\d' matches '010';'\w\w\d' matches 'py3';. matches any character (e.g., 'py.' matches 'pyc', 'pyo', 'py!', etc.).To match variable-length characters:
* matches zero or more of the preceding character;+ matches one or more of the preceding character;? matches zero or one of the preceding character;{n} matches exactly n of the preceding character;{n,m} matches between n and m of the preceding character.\d{3}\s+\d{3,8}Let’s break this down left to right:
\d{3}: Matches exactly 3 digits (e.g., '010');\s+: Matches one or more whitespace characters (spaces, tabs, etc., e.g., ' ', '\t');\d{3,8}: Matches 3 to 8 digits (e.g., '1234567').Combined, this pattern matches phone numbers with area codes separated by any number of spaces (e.g., '010 1234567').
What about matching numbers like '010-12345'? Since - is a special character in regex, it must be escaped with \:
\d{3}\-\d{3,8}
However, this still won’t match '010 - 12345' (with spaces around -), requiring a more complex pattern.
For more precise matching, use [] to define a character range:
[0-9a-zA-Z\_]: Matches a digit, letter, or underscore;[0-9a-zA-Z\_]+: Matches a string of one or more digits/letters/underscores (e.g., 'a100', '0_Z', 'Py3000');[a-zA-Z\_][0-9a-zA-Z\_]*: Matches a valid Python variable (starts with a letter/underscore, followed by any number of digits/letters/underscores);[a-zA-Z\_][0-9a-zA-Z\_]{0,19}: Restricts Python variables to 1–20 characters (1 initial character + up to 19 subsequent characters).Additional syntax:
A|B: Matches either A or B (e.g., (P|p)ython matches 'Python' or 'python');^: Matches the start of a line (e.g., ^\d requires the string to start with a digit);$: Matches the end of a line (e.g., \d$ requires the string to end with a digit).Note: py matches 'python', but ^py$ enforces an exact full-line match (only matches 'py').
re ModuleWith this foundation, we can use regular expressions in Python via the built-in re module. Since Python strings also use \ for escaping, extra care is needed:
s = 'ABC\\-001' # Python string
# Corresponding regex pattern: 'ABC\-001'
To avoid escape issues, always use Python’s r prefix for raw strings:
s = r'ABC\-001' # Python string
# Corresponding regex pattern remains: 'ABC\-001'
The match() method checks if a string matches a regex pattern:
>>> import re
>>> re.match(r'^\d{3}\-\d{3,8}$', '010-12345')
<_sre.SRE_Match object; span=(0, 9), match='010-12345'>
>>> re.match(r'^\d{3}\-\d{3,8}$', '010 12345')
>>> # Returns None (no match)
Match object if the pattern matches;None if there is no match.Common validation pattern:
test = 'user_input_string'
if re.match(r'regex_pattern', test):
print('ok')
else:
print('failed')
Regex-based splitting is more flexible than fixed-character splitting:
# Basic split (fails with consecutive spaces)
>>> 'a b c'.split(' ')
['a', 'b', '', '', 'c']
# Regex split (handles any number of spaces)
>>> re.split(r'\s+', 'a b c')
['a', 'b', 'c']
# Split on spaces/commas
>>> re.split(r'[\s\,]+', 'a,b, c d')
['a', 'b', 'c', 'd']
# Split on spaces/commas/semicolons
>>> re.split(r'[\s\,\;]+', 'a,b;; c d')
['a', 'b', 'c', 'd']
Use this to normalize messy user input (e.g., tag lists) into clean arrays.
Beyond simple matching, regex can extract substrings using () to define groups:
>>> m = re.match(r'^(\d{3})-(\d{3,8})$', '010-12345')
>>> m.group(0) # Entire matched string
'010-12345'
>>> m.group(1) # First group (area code)
'010'
>>> m.group(2) # Second group (local number)
'12345'
group(0): Always the full matched string;group(1), group(2), etc.: Corresponding captured groups.Practical Example (Time Validation):
>>> t = '19:05:30'
>>> m = re.match(r'^(0[0-9]|1[0-9]|2[0-3]|[0-9])\:(0[0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9]|[0-9])\:(0[0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9]|[0-9])$', t)
>>> m.groups()
('19', '05', '30')
Regex has limits (e.g., validating dates like '2-30' or '4-31' requires additional code logic).
Regex uses greedy matching by default (matches as many characters as possible):
# Greedy match (\d+ consumes all digits, leaving 0* with nothing)
>>> re.match(r'^(\d+)(0*)$', '102300').groups()
('102300', '')
# Non-greedy match (add ? to \d+ to match minimal digits)
>>> re.match(r'^(\d+?)(0*)$', '102300').groups()
('1023', '00')
The re module compiles regex patterns every time you call match()/search(). For frequently reused patterns, precompile for better performance:
>>> import re
# Precompile the pattern
>>> re_telephone = re.compile(r'^(\d{3})-(\d{3,8})$')
# Reuse the compiled pattern
>>> re_telephone.match('010-12345').groups()
('010', '12345')
>>> re_telephone.match('010-8086').groups()
('010', '8086')
Compiled Regular Expression objects store the pattern internally, eliminating repeated compilation.
Regular expressions are extremely powerful—covering all their features would require an entire book. Keep a regex reference handy if you work with them frequently.
Write a regex to validate Emails like:
someone@gmail.combill.gates@microsoft.com
import re
def is_valid_email(addr):
# Implement the regex pattern here
return True
# Test cases
assert is_valid_email('someone@gmail.com')
assert is_valid_email('bill.gates@microsoft.com')
assert not is_valid_email('bob#example.com')
assert not is_valid_email('mr-bob@example.com')
print('ok')
Extract names from formatted Email addresses:
<Tom Paris> tom@voyager.org → Tom Parisbob@example.com → bob
import re
def name_of_email(addr):
# Implement the regex pattern here
return None
# Test cases
assert name_of_email('<Tom Paris> tom@voyager.org') == 'Tom Paris'
assert name_of_email('tom@voyager.org') == 'tom'
print('ok')
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import re
print("Test: 010-12345")
m = re.match(r"^(\d{3})-(\d{3,8})$", "010-12345")
print(m.group(1), m.group(2))
t = "19:05:30"
print("Test:", t)
m = re.match(r"^(0[0-9]|1[0-9]|2[0-3]|[0-9])\:(0[0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9]|[0-9])\:(0[0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9]|[0-9])$", t)
print(m.groups())