As a new developer, regular expressions (or regex as it’s commonly known), can be daunting because of the strange, unfamiliar syntax:
^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$
When I was a new developer and saw the expression above for the first time, I remember my first thought was “say what? I’m never going to be able to learn that gobbledygook. That’s too much for me. I can’t do it.”
But you can! It’s just a matter of getting over that initial overwhelming feeling that regular expressions are too foreign. Just like learning or speaking a foreign language — once you get the hang of it, it will come pretty naturally. I promise!
In this article, we take a look at what regular expressions are, why they are used, and how to use them in Python. By the time you finish this article, you’ll have a solid understanding of regular expressions so you can interpret what the expression above means!
What are Regular Expressions?
When we take a look at what a regular expression is, we need to remind ourselves of what a string is. Remember that a string is just a collection of characters that are strung together and bounded by a set of quotation marks:
“Hello World” “555-555-5555” “John Doe 123 Main St. Anywhere, USA 99999-9999” “email@email.io”
These are all examples of strings. They can be of any length, or no length at all. Regular Expressions, or more commonly shortened to regex, is an expression whose components match patterns up with strings to find out some sort of information.
What’s great about regex is that it doesn’t necessarily care about the language you are coding in — it’s fairly language agnostic. The difference comes in the language’s methods and how to perform actions using those expressions.
What are Regular Expressions Used For?
Regex is a pattern that we would want to specifically look for in a string. We can use regex to search for a particular phrase or pattern to replace it with something else, or we can validate forms to be certain a user is entering information in a certain format so that it is consistent across all users.
Search and Replace
Say for instance we have a phone number presented in this format:
555 555 5555
This is a valid format for a phone number in the United States. But what if we wanted to replace the spaces with dashes? Or add parentheses around the area code and a dash to make it more readable?
We can use regular expressions for that! We’ll go over in the next section how to do that with Python — for now I want you to get the general feel for what you can do with regular expressions and how they can be useful.
The result we would see after a search and replace on a phone number in Python for the format we would like to use would be:
555-555-5555 or (555)555-5555
There is no need to hardcode or look for a specific value or index of the string in regular expressions because we can just look for the patterns in the strings and manipulate all of the records we have to match.
Validate
Have you ever filled something in on a site only for it to display an error message because of a missed symbol or a pattern you didn’t follow? More than likely regular expressions were used to make sure your input matched what their database is looking for.
This is called validation and is super useful when building forms to be certain that a phone number follows the format you’d like for it to be in or that an email address is a properly formatted email address, or a password matches the parameters you have set for it to be a valid password (length, special characters, digits, upper or small case, etc).
This helps prevent errors in your database by alerting the user to typos or mismatched patterns.
In the next section, we will take a look at the components or pattern matchers that build regular expressions.
Types of Regex Pattern Matchers
Literal characters, metacharacters, and quantifiers make up the types of pattern matchers we see in regex. A pattern matcher is a character that is used to help find a pattern in a string. It is the primary building block of a regular expression.
Literal Characters
The most basic example of a pattern matcher in regex is a literal character. It matches a hard coded character or string.
Examples:
hello
⇒ collection of five distinct characters.
When a regex pattern is applied here, it looks for each of these characters in succession. “hello”, “helloing”, or “helloed” would pass a pattern check, but “Hello”, “helo”, or “HeLlo” would not.
A
⇒ collection of one distinct character.
"Career Karma entered my life when I needed it most and quickly helped me match with a bootcamp. Two months after graduating, I found my dream job that aligned with my values and goals in life!"
Venus, Software Engineer at Rockbot
Because regex looks for distinct characters, it is case sensitive too. So “A” would pass, but “a” would not. We’ll get into this more in a little bit.
A simple sentence\
. ⇒ collection of several distinct characters.
Regex looks for every character in the expression in succession when it looks at a string. If “A simple sentence” is not in the searched string exactly as it is written in the regex, it would not pass.
Escape Characters
Take a look at the last example. Notice that there is a \ in front of the period. A dot in regular expression syntax is synonymous with keywords in languages like JavaScript or Python.
You can’t use a period/dot on its own if you want it to be part of a pattern in regular expressions. You have to escape the character in order for the regular expression engine to interpret that as a literal representation of a period instead of the regex meaning.
Here are some other special sequences of characters that need to be escaped if you want the literal character instead of the translated meaning that the regex engine compiles it to.
- Asterisk *
- Backslash /
- Plus +
- Caret ^
- Dollar Sign $
- Dot/Period .
- Pipe |
- Question Mark ?
- Parentheses – both types ()
- Curly Braces – both types {}
Literal characters in regex match exactly with the character you include as part of the pattern. If you want to include a character that’s listed above, be sure to escape it so that it can also be a part of your regex.
Common Matchers
The purpose of a matcher is to match multiple letters in a pattern. This collection of pattern matching symbols is fairly consistent among the programming languages that use regex.
Matcher | Description | Example |
. | Matches any character | n.w would match now, naw, or new, etc. Any character passes the test |
^regex | Looks for pattern at beginning of the line | ^hello would match hello in a line that started with that pattern |
regex$ | Looks for pattern at end of the line | world$ would match world in a line that ended with that pattern |
[abc] | Matches a, b, or c | [misp]is considered a set and would match any string that has any of those characters in it. For example, it could match all the individual letters in mississippi, and miss, but only some of the letters in marsh, and missouri |
[abc][xyz] | Matches a, b, or c followed by x, y, or z | /[Mm][sip]/ would match any string that starts with M or m, followed by a set that has any of the characters in [sip] |
[^abc] | Not a, b, or c | [^rstlne] would match any character that is not r, s, t, l, n, or e |
[a-zA-Z0-9] | Matches any character within the range | [a-n] would match any character between a and n. and, end, blind, can, all have characters that entirely match here |
A|B | A or B | M|m. would match any word or phrase at least two characters in length that matches either an M or an m plus at least one or more other characters. |
CAT | Matches C, followed by A, followed by T | hello world would match hello world exactly |
Metacharacters
Regular expressions also use metacharacters to describe a pattern. Metacharacters have some sort of meaning behind them and will describe the shape of the pattern.
Metacharacter | Description | Example |
\d | Matches any digit | \d would match 1, 2, or 3, etc. Shorthand for [0-9] |
\D | Matches any non-digit character | \D would match A, B, g, etc.. Shorthand for [^0-9] |
\s | Matches any whitespace character | \s would match new lines, tabs, spaces, etc. |
\S | Matches any non-whitespace character | \S would match any character except a whitespace character. |
\w | Matches any word character | A word character, short for [a-zA-Z_0-9] |
\W | Matches any non-word character | [\W] would match any special characters. Shorthand for [^\w] |
Note: Capital letter metacharacters (\W, \D, etc) usually correspond to the opposite of what the lowercase letter metacharacters do (\w, \d, etc).
Quantifiers
Quantifier | Description | Example |
+ | One of more of preceding character | \d+ would match two or more digits |
* | Zero or more of preceding character | .* would match any character 0 or more times Note: Technically an empty string would fulfill this regex! |
? | Zero or one of the preceding character | a?.* would match a, any, hello, world |
{number} | Matches preceding character exactly number of times | \d{3} matches exactly three digits [0-9] |
{num1,num2} | Matches preceding character in a range of nums | \d{3,5} matches 3 to 5 digits that are [0-9] |
Use the quantifiers, metacharacters, and other matchers as the building blocks for your regular expressions. The syntax mentioned here is similar across multiple languages that use regular expressions.
However, there are some things that are used in Ruby or JavaScript, for instance, that would not be transferable to Python.
Let’s learn a little more about how regular expressions work in Python in the next section.
How Do Regular Expressions Work in Python?
To use regular expressions in Python import the re
module into the top of your file.
import re string = "The quick brown fox jumped over the lazy dog" result = re.search("q.+k\s", string) # this is the match object if it returns a positive result. It will return a NoneType object otherwise print(result.span(), "<== This is the tuple containing the span of indexes the result is in") print(result.string, "<== This is the original string ") print(result.group(), "<== This is the group of characters that match our regex pattern") =================================================================== import re string = """The quick brown fox jumped over the lazy dog """ result = re.search(""" ^ # beginning of line Q # literal character 'Q' . # dot ==> any character + # quantifier == more than 1 k # literal character 'k' \s # special character '\s' """, string, flags=re.IGNORECASE | re.M | re.VERBOSE) if result: print(result.span(), "<== This is the tuple containing the span of indexes the result is in") print(result.string, "<== This is the original string ") print(result.group(), "<== This is the group of characters that match our regex pattern") else: print(result) =================================================================== import re print(re.match("quick", str, flags=re.IGNORECASE)) print(re.search("quick", str, flags=re.IGNORECASE | re.MULTILINE)) =================================================================== import re phone = "555 555 5555" correctFormat = re.sub("\s", "-", phone) print(correctFormat) =================================================================== import re left_parens = re.sub("^", "(", phone) right_parens = re.sub("\s", ")", left_parens, 1) secondCorrectFormat = re.sub("\s", "-", right_parens) print(secondCorrectFormat)
It’s that simple! There is nothing special you have to add or packages you need to download. It’s already built-in as long as you import it.
The fun comes from how to use the different methods that are available to us in the re module.
re.search(regex, str, flags=0)
Use the search function when you want to apply a regex pattern to a string to see if the pattern is contained in the string. This method will take a regular expression pattern and match it anywhere in the string.
If the regex pattern is NOT contained in the string, the interpreter will return None
. If there is a match, a Match
object will be returned that will contain some information about it.
The Match
object has a property and two methods that can be used to retrieve information about the match:
match_obj.span()
a method that returns a tuple containing the start and end positions of the match (the end position is inclusive).
match_obj.string
a property that returns the string passed into the function.
match_obj.group()
a method that returns the part of the string where there was a match.
match_obj
here will be replaced with the variable you assign to the result of your re.search()
method. Here is an example using each method and property:
import re string = "The quick brown fox jumped over the lazy dog" result = re.search("q.+k\s", string) # this is the match object if it returns a positive result. It will return a NoneType object otherwise print(result.span(), "<== This is the tuple containing the span of indexes the result is in") print(result.string, "<== This is the original string ") print(result.group(), "<== This is the group of characters that match our regex pattern")
The main thing to remember here is that the Match
object in Python has methods and properties.
We use the property string to access the original string we tested the regex against, the span method to access the indexes, and the group method to get the actual matched result.
There are other methods and properties that can be referenced in the Python docs, but these are good to get you started.
Optional arguments: flags
There is an optional argument you can use with the re
module’s search method as well. To use it assign the flags
parameter to a list of the flags you would like to use. The default is set to 0. Some of the more popular options include:
- re.DEBUG
This flag will display debug information about the compiled regex if needed.
- re.I
re.IGNORECASE
Case insensitive matching. This will ignore the case of the characters passed into sets or literal characters so that both capital- and lower- case letters will match.
- re.M
re.MULTILINE
Multiline mode. Allows beginning of line and end of line regex metacharacters to be used in multiline strings. Otherwise it would just look at the beginning and the end of the string. Without the multiline flag, the regex engine considers the string one line.
- re.S
re.DOTALL
This flag tells the regex engine that the dot character will match any character. The default behavior is for the dot character to match every character except a newline character.
- re.X
re.VERBOSE
The verbose flag allows you to add comments to your regular expression to break down your expression and comment on its purpose. This may help immensely as you are learning regex in Python.
This code snippet takes the example from above and incorporates some of the flags listed above. When using more than one flag, use the bitwise | operator in between each as a separator.
import re string = """The quick brown fox jumped over the lazy dog """ result = re.search(""" ^ # beginning of line Q # literal character 'Q' . # dot ==> any character + # quantifier == more than 1 k # literal character 'k' \s # special character '\s' """, string, flags=re.IGNORECASE | re.M | re.VERBOSE) if result: print(result.span(), "<== This is the tuple containing the span of indexes the result is in") print(result.string, "<== This is the original string ") print(result.group(), "<== This is the group of characters that match our regex pattern") else: print(result)
Take the time to notice how the flags help us in testing our string. Try taking the flags parameter out. Does the string return a result? If not, what does it return as a result?
Remember that the search method on the re module takes in the regex, the string to be searched, and an optional flags parameter. A truthy value will return a Match Object that has properties and methods associated with it. A falsy value returns a NoneType object.
re.match(regex, str, flags=0)
Use the match method when you are looking to match a regex pattern to the beginning of the string. If you want to match the regex pattern anywhere in the string, use the search method above instead.
import re string = """The quick brown fox jumped over the lazy dog """ print(re.match("quick", string, flags=re.IGNORECASE | re.MULTILINE)) print(re.search("quick", string, flags=re.IGNORECASE | re.MULTILINE))
When looking at this code in a Python interpreter, you will see that the first method, match()
, will not return a Match
object, but instead None
. The second allows for looking inside the string and will return a Match
object. This is because the match()
method only looks at the beginning of a string, even if the re.MULTILINE
flag is raised.
re.sub(findRegex, replaceWith, str, count=0, flags=0)
The sub()
method in the re module takes a regex, finds the leftmost match in a string, and replaces it with something else. It repeats the same operation for the number of times indicated in the count parameter. If there is no count parameter or it is set to zero, all occurrences will be replaced.
At the beginning of this article we talked about reformatting a phone number. Let’s take a look at how to do that here:
import re phone = "555 555 5555" correctFormat = re.sub("\s", "-", phone) print(correctFormat) # 555-555-5555
This takes our phone string, finds all occurrences of whitespace, and replaces it with a dash. In this case, we end up with 555-555-5555
.
Let’s try a more difficult reformat:
import re phone = "555 555 5555" left_parens = re.sub("^", "(", phone) right_parens = re.sub("\s", ")", left_parens, 1) secondCorrectFormat = re.sub("\s", "-", right_parens) print(secondCorrectFormat) # (555)555-5555
In this example, we go to the beginning of the line to add an open parentheses and assign that new string to a variable.
We then take that newly assigned variable (left_parens
) and use it to perform the same operation on finding the next available whitespace character to replace it with a close parentheses. This is assigned to right_parens
.
Finally, we take the right_parens
variable and use it to perform the same operation on the final whitespace character to replace it with a dash.
This will give us (555)555-5555
.
To recap, the sub()
method takes in a regex pattern, a replacement string or function, the actual string we want to perform the sub()
on, and a count. If we don’t provide a count, it will perform the replacements on all occurrences. It returns the new string with the substitutions performed.
Conclusion
Regular Expressions are a way to validate data or to search and replace characters in our strings. Regex consists of metacharacters, quantifiers, and literal characters that can be used to test our strings to see if it passes a validation test or to search and replace Matches.
Regex can be a little overwhelming at first, but once you get it, it’s a little bit like riding a bike. It’ll be in the back of your memory and super easy to pick up again.
When you feel you have a handle on what comes up in this article, take a look at the Python docs to see what else can be done with regular expressions. Definitely take a look at the compile and split methods.
Happy regexing!
About us: Career Karma is a platform designed to help job seekers find, research, and connect with job training programs to advance their careers. Learn about the CK publication.