Python Regex: An Introduction to Using Regular Expressions

As a new developer, regular expressions (or regex as it’s commonly known), can be daunting because of the strange, unfamiliar syntax:

 ^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$

Find your bootcamp match

Select Your Interest

Your experience

Time to start

GET MATCHED

By continuing you agree to our Terms of Service and Privacy Policy, and you consent to receive offers and opportunities from Career Karma by telephone, text message, and email.

When I was a new developer and saw the expression above for the first time, I remember my first thought was “say what? I’m never going to be able to learn that gobbledygook. That’s too much for me. I can’t do it.”

But you can! It’s just a matter of getting over that initial overwhelming feeling that regular expressions are too foreign. Just like learning or speaking a foreign language — once you get the hang of it, it will come pretty naturally. I promise!

In this article, we take a look at what regular expressions are, why they are used, and how to use them in Python. By the time you finish this article, you’ll have a solid understanding of regular expressions so you can interpret what the expression above means!

What are Regular Expressions?

When we take a look at what a regular expression is, we need to remind ourselves of what a string is. Remember that a string is just a collection of characters that are strung together and bounded by a set of quotation marks:

 “Hello World”

“555-555-5555”

   “John Doe
123 Main St.
Anywhere, USA 99999-9999”

  “email@email.io”

These are all examples of strings. They can be of any length, or no length at all. Regular Expressions, or more commonly shortened to regex, is an expression whose components match patterns up with strings to find out some sort of information.

What’s great about regex is that it doesn’t necessarily care about the language you are coding in — it’s fairly language agnostic. The difference comes in the language’s methods and how to perform actions using those expressions.

What are Regular Expressions Used For?

Regex is a pattern that we would want to specifically look for in a string. We can use regex to search for a particular phrase or pattern to replace it with something else, or we can validate forms to be certain a user is entering information in a certain format so that it is consistent across all users.

Search and Replace

Say for instance we have a phone number presented in this format:

 555 555 5555

This is a valid format for a phone number in the United States. But what if we wanted to replace the spaces with dashes? Or add parentheses around the area code and a dash to make it more readable?

We can use regular expressions for that! We’ll go over in the next section how to do that with Python — for now I want you to get the general feel for what you can do with regular expressions and how they can be useful.

The result we would see after a search and replace on a phone number in Python for the format we would like to use would be:

555-555-5555
     or
(555)555-5555

There is no need to hardcode or look for a specific value or index of the string in regular expressions because we can just look for the patterns in the strings and manipulate all of the records we have to match.

Validate

Have you ever filled something in on a site only for it to display an error message because of a missed symbol or a pattern you didn’t follow? More than likely regular expressions were used to make sure your input matched what their database is looking for.

This is called validation and is super useful when building forms to be certain that a phone number follows the format you’d like for it to be in or that an email address is a properly formatted email address, or a password matches the parameters you have set for it to be a valid password (length, special characters, digits, upper or small case, etc).

This helps prevent errors in your database by alerting the user to typos or mismatched patterns.

In the next section, we will take a look at the components or pattern matchers that build regular expressions.

Types of Regex Pattern Matchers

Literal characters, metacharacters, and quantifiers make up the types of pattern matchers we see in regex. A pattern matcher is a character that is used to help find a pattern in a string. It is the primary building block of a regular expression.

Literal Characters

The most basic example of a pattern matcher in regex is a literal character. It matches a hard coded character or string.

Examples:

hello ⇒ collection of five distinct characters.

When a regex pattern is applied here, it looks for each of these characters in succession. “hello”, “helloing”, or “helloed” would pass a pattern check, but “Hello”, “helo”, or “HeLlo” would not.

A ⇒ collection of one distinct character.

Because regex looks for distinct characters, it is case sensitive too. So “A” would pass, but “a” would not. We’ll get into this more in a little bit.

A simple sentence\. ⇒ collection of several distinct characters.

Regex looks for every character in the expression in succession when it looks at a string. If “A simple sentence” is not in the searched string exactly as it is written in the regex, it would not pass.

Escape Characters

Take a look at the last example. Notice that there is a \ in front of the period. A dot in regular expression syntax is synonymous with keywords in languages like JavaScript or Python.

You can’t use a period/dot on its own if you want it to be part of a pattern in regular expressions. You have to escape the character in order for the regular expression engine to interpret that as a literal representation of a period instead of the regex meaning.

Here are some other special sequences of characters that need to be escaped if you want the literal character instead of the translated meaning that the regex engine compiles it to.

Asterisk *
Backslash /
Plus +
Caret ^
Dollar Sign $
Dot/Period .
Pipe |
Question Mark ?
Parentheses – both types ()
Curly Braces – both types {}

Literal characters in regex match exactly with the character you include as part of the pattern. If you want to include a character that’s listed above, be sure to escape it so that it can also be a part of your regex.

Common Matchers

The purpose of a matcher is to match multiple letters in a pattern. This collection of pattern matching symbols is fairly consistent among the programming languages that use regex.

Matcher	Description	Example
.	Matches any character	n.w would match now, naw, or new, etc. Any character passes the test
^regex	Looks for pattern at beginning of the line	^hello would match hello in a line that started with that pattern
regex$	Looks for pattern at end of the line	world$ would match world in a line that ended with that pattern
[abc]	Matches a, b, or c	[misp]is considered a set and would match any string that has any of those characters in it. For example, it could match all the individual letters in mississippi, and miss, but only some of the letters in marsh, and missouri
[abc][xyz]	Matches a, b, or c followed by x, y, or z	/[Mm][sip]/ would match any string that starts with M or m, followed by a set that has any of the characters in [sip]
[^abc]	Not a, b, or c	[^rstlne] would match any character that is not r, s, t, l, n, or e
[a-zA-Z0-9]	Matches any character within the range	[a-n] would match any character between a and n. and, end, blind, can, all have characters that entirely match here
A\|B	A or B	M\|m. would match any word or phrase at least two characters in length that matches either an M or an m plus at least one or more other characters.
CAT	Matches C, followed by A, followed by T	hello world would match hello world exactly

Metacharacters

Regular expressions also use metacharacters to describe a pattern. Metacharacters have some sort of meaning behind them and will describe the shape of the pattern.

Metacharacter	Description	Example
\d	Matches any digit	\d would match 1, 2, or 3, etc. Shorthand for [0-9]
\D	Matches any non-digit character	\D would match A, B, g, etc.. Shorthand for [^0-9]
\s	Matches any whitespace character	\s would match new lines, tabs, spaces, etc.
\S	Matches any non-whitespace character	\S would match any character except a whitespace character.
\w	Matches any word character	A word character, short for [a-zA-Z_0-9]
\W	Matches any non-word character	[\W] would match any special characters. Shorthand for [^\w]

Note: Capital letter metacharacters (\W, \D, etc) usually correspond to the opposite of what the lowercase letter metacharacters do (\w, \d, etc).

Quantifiers

Quantifier	Description	Example
+	One of more of preceding character	\d+ would match two or more digits
*	Zero or more of preceding character	.* would match any character 0 or more times Note: Technically an empty string would fulfill this regex!
?	Zero or one of the preceding character	a?.* would match a, any, hello, world
{number}	Matches preceding character exactly number of times	\d{3} matches exactly three digits [0-9]
{num1,num2}	Matches preceding character in a range of nums	\d{3,5} matches 3 to 5 digits that are [0-9]

Use the quantifiers, metacharacters, and other matchers as the building blocks for your regular expressions. The syntax mentioned here is similar across multiple languages that use regular expressions.

However, there are some things that are used in Ruby or JavaScript, for instance, that would not be transferable to Python.

Let’s learn a little more about how regular expressions work in Python in the next section.

How Do Regular Expressions Work in Python?

To use regular expressions in Python import the re module into the top of your file.

import re 
string = "The quick brown fox jumped over the lazy dog"
 
 
result = re.search("q.+k\s", string) # this is the match object if it returns a positive result. It will return a NoneType object otherwise
print(result.span(), "<== This is the tuple containing the span of indexes the result is in")
print(result.string, "<== This is the original string ")
print(result.group(), "<== This is the group of characters that match our regex pattern")

===================================================================

import re 
string = """The 
quick brown 
fox jumped over the lazy dog
"""

result = re.search("""
					^ # beginning of line
                    Q # literal character 'Q'
                    . # dot ==> any character
                    + # quantifier == more than 1
                    k # literal character 'k'
                    \s # special character '\s'
                    """, string, flags=re.IGNORECASE | re.M | re.VERBOSE)

if result:                                 
  print(result.span(), "<== This is the tuple containing the span of indexes the result is in")
  print(result.string, "<== This is the original string ")
  print(result.group(), "<== This is the group of characters that match our regex pattern")
else:
	print(result)


===================================================================

import re 
print(re.match("quick", str, flags=re.IGNORECASE))
print(re.search("quick", str, flags=re.IGNORECASE | re.MULTILINE))

===================================================================

import re 
phone = "555 555 5555"
correctFormat = re.sub("\s", "-", phone)
print(correctFormat)

===================================================================

import re 

left_parens = re.sub("^", "(", phone) 
right_parens = re.sub("\s", ")", left_parens, 1)
secondCorrectFormat = re.sub("\s", "-", right_parens)
print(secondCorrectFormat)

It’s that simple! There is nothing special you have to add or packages you need to download. It’s already built-in as long as you import it.

The fun comes from how to use the different methods that are available to us in the re module.

re.search(regex, str, flags=0)

Use the search function when you want to apply a regex pattern to a string to see if the pattern is contained in the string. This method will take a regular expression pattern and match it anywhere in the string.

If the regex pattern is NOT contained in the string, the interpreter will return None. If there is a match, a Match object will be returned that will contain some information about it.

The Match object has a property and two methods that can be used to retrieve information about the match:

match_obj.span() a method that returns a tuple containing the start and end positions of the match (the end position is inclusive).

match_obj.string a property that returns the string passed into the function.

match_obj.group() a method that returns the part of the string where there was a match.

match_obj here will be replaced with the variable you assign to the result of your re.search() method. Here is an example using each method and property:

import re
 
string = "The quick brown fox jumped over the lazy dog"
 
result = re.search("q.+k\s", string) # this is the match object if it returns a positive result. It will return a NoneType object otherwise
print(result.span(), "<== This is the tuple containing the span of indexes the result is in")
print(result.string, "<== This is the original string ")
print(result.group(), "<== This is the group of characters that match our regex pattern")

The main thing to remember here is that the Match object in Python has methods and properties.

We use the property string to access the original string we tested the regex against, the span method to access the indexes, and the group method to get the actual matched result.

There are other methods and properties that can be referenced in the Python docs, but these are good to get you started.

Optional arguments: flags

There is an optional argument you can use with the re module’s search method as well. To use it assign the flags parameter to a list of the flags you would like to use. The default is set to 0. Some of the more popular options include:

re.DEBUG

This flag will display debug information about the compiled regex if needed.

re.I
re.IGNORECASE

Case insensitive matching. This will ignore the case of the characters passed into sets or literal characters so that both capital- and lower- case letters will match.

re.M
re.MULTILINE

Multiline mode. Allows beginning of line and end of line regex metacharacters to be used in multiline strings. Otherwise it would just look at the beginning and the end of the string. Without the multiline flag, the regex engine considers the string one line.

re.S
re.DOTALL

This flag tells the regex engine that the dot character will match any character. The default behavior is for the dot character to match every character except a newline character.

re.X
re.VERBOSE

The verbose flag allows you to add comments to your regular expression to break down your expression and comment on its purpose. This may help immensely as you are learning regex in Python.

This code snippet takes the example from above and incorporates some of the flags listed above. When using more than one flag, use the bitwise | operator in between each as a separator.

import re
 
string = """The
quick brown
fox jumped over the lazy dog
"""
 
result = re.search("""
          		  ^ # beginning of line
                   Q # literal character 'Q'
                   . # dot ==> any character
                   + # quantifier == more than 1
                   k # literal character 'k'
                   \s # special character '\s'
                   """, string, flags=re.IGNORECASE | re.M | re.VERBOSE)
                  
                  
                  
                  
if result:                                
print(result.span(), "<== This is the tuple containing the span of indexes the result is in")
print(result.string, "<== This is the original string ")
print(result.group(), "<== This is the group of characters that match our regex pattern")
else:
  	print(result)

Take the time to notice how the flags help us in testing our string. Try taking the flags parameter out. Does the string return a result? If not, what does it return as a result?

Remember that the search method on the re module takes in the regex, the string to be searched, and an optional flags parameter. A truthy value will return a Match Object that has properties and methods associated with it. A falsy value returns a NoneType object.

re.match(regex, str, flags=0)

Use the match method when you are looking to match a regex pattern to the beginning of the string. If you want to match the regex pattern anywhere in the string, use the search method above instead.

import re
 
string = """The
quick brown
fox jumped over the lazy dog
"""
 
print(re.match("quick", string, flags=re.IGNORECASE | re.MULTILINE))
print(re.search("quick", string, flags=re.IGNORECASE | re.MULTILINE))

When looking at this code in a Python interpreter, you will see that the first method, match(), will not return a Match object, but instead None. The second allows for looking inside the string and will return a Match object. This is because the match() method only looks at the beginning of a string, even if the re.MULTILINE flag is raised.

re.sub(findRegex, replaceWith, str, count=0, flags=0)

The sub() method in the re module takes a regex, finds the leftmost match in a string, and replaces it with something else. It repeats the same operation for the number of times indicated in the count parameter. If there is no count parameter or it is set to zero, all occurrences will be replaced.

At the beginning of this article we talked about reformatting a phone number. Let’s take a look at how to do that here:

import re
 
phone = "555 555 5555"
correctFormat = re.sub("\s", "-", phone)
print(correctFormat) # 555-555-5555

This takes our phone string, finds all occurrences of whitespace, and replaces it with a dash. In this case, we end up with 555-555-5555.

Let’s try a more difficult reformat:

import re
 
phone = "555 555 5555"
 
left_parens = re.sub("^", "(", phone)
right_parens = re.sub("\s", ")", left_parens, 1)
secondCorrectFormat = re.sub("\s", "-", right_parens)
print(secondCorrectFormat) # (555)555-5555

In this example, we go to the beginning of the line to add an open parentheses and assign that new string to a variable.

We then take that newly assigned variable (left_parens) and use it to perform the same operation on finding the next available whitespace character to replace it with a close parentheses. This is assigned to right_parens.

Finally, we take the right_parens variable and use it to perform the same operation on the final whitespace character to replace it with a dash.

This will give us (555)555-5555.

To recap, the sub() method takes in a regex pattern, a replacement string or function, the actual string we want to perform the sub() on, and a count. If we don’t provide a count, it will perform the replacements on all occurrences. It returns the new string with the substitutions performed.

Conclusion

Regular Expressions are a way to validate data or to search and replace characters in our strings. Regex consists of metacharacters, quantifiers, and literal characters that can be used to test our strings to see if it passes a validation test or to search and replace Matches.

Regex can be a little overwhelming at first, but once you get it, it’s a little bit like riding a bike. It’ll be in the back of your memory and super easy to pick up again.

When you feel you have a handle on what comes up in this article, take a look at the Python docs to see what else can be done with regular expressions. Definitely take a look at the compile and split methods.

Happy regexing!

About us: Career Karma is a platform designed to help job seekers find, research, and connect with job training programs to advance their careers. Learn about the CK publication.

What's Next?

Want to dive deeper?

Ask a question to our community

Want to explore tech careers?

Take our careers quiz

About the Author

Christina Kopecky

Technical Writer at Career Karma

Christina is an experienced technical writer, covering topics as diverse as Java, SQL, Python, and web development. She earned her Master of Music in flute performance from the University of Kansas and a bachelor's degree in music with minors in French an... read more about the author

Jan 26, 2021