Python regular expression 101

?: noncapture group




>>> import re
>>> line = "this; that;so  and"
>>> re.split(r'(?:;|,|\s)\s*', line)

?: is called a noncapture group

>>> re.split(r'[;,\s]\s*', line)

endswith and startswith

>>> filename = 'spam.txt'
>>> filename.endswith('.txt')
True

This accepts also a tuple for a list of ends/starts arguments

>>> filename.endswith(('.fits', '.tiff', 'tif'))

Use of any

>>> filenames = ['file1.py', 'file2.txt', 'file3.csv']
>>> any(name.endswith('py') for name in filenames)
True

fnmatch and fnmatchcase

This can be used when we want the same search features as unix

>>> from fnmatch import fnmatch, fnmatchcase
>>> fnmatch('foo.txt', '*.txt')
>>> fnmatch('Dat45.csv', 'Dat[0-9]*')
>>> fnmatch([foo.txt', '?oo.txt')
>>> names = ['file1.csv', 'file2.csv', 'file3.py', 'file4.pyc']
>>> [name for name in names if fnmatch(name, 'file*.py')] 

Matching and searching text patterns

For simple search, we can use str.find(), str.endswith() or str.startswith()

re.match

For more complicated matching, use match

>>> import re
>>> re.match(r'\d+/\d+/\d+', '10/20/2010')

But if we need to do this often, we need to precompile the search pattern to be more efficient

>>> dataexp = re.compile(r'\d+/\d+/\d+')
>>> if dataexp.match(text):
    ....

But match only retrieves the first occurrence. To get all of them, we need to use findall

>>> text = 'Today is 12/10/2010. PyCon is 12/10/2011'
>>> datapad = re.compile(r'\d+/\d+/\d+')
>>> datapad.findall(text)
['12/10/2010', '12/10/2011']

We can also use group to save the results

>>> datapad = re.compile(r'(\d+)/(\d+)/(\d+)')
>>> m = datapad.match('10/20/2011')
>>> m.group(0)
10
>>> m.group(1)
20

Replace string

To replace string, use sub

>>> import re
>>> text = "Today is 10/20/2010. Pycon is 10/20/2020'
>>> datapad = re.compile(r'(\d+)/(\d+)/(\d+)')
>>> datapad.sub(r'\3-\1-\2', text)
Today is 2010-10-20. Pycon is 2020-10-20

\3, \1 and \2 refers to the matching groups

It’s also possible to provide our own method like this

>>> from calendar import month_abbr
>>> def change_date(m):
      mon_date = month_abbr[int(m.group(1))]
      return f'{m.group(2)}-{mon_date}-{m.group(0)}'
>>> datapad.sub(change_date, text)

Here is another cool example when we use our own method to fix upper and lower cases strings

>>> import re
>>> text = "UPPER PYTHON, lower python, MiXeD PyThOn"
>>> def matchcase(word):
      def replace(m):
         text = m.group()
         if text.isupper():
            return word.upper()
         elif text.islower():
            return word.lower()
         elif text[0].isupper():
            return word.capitalize()
         else:
            return word
       return replace
>>> re.sub('python', matchase('snake'), text, flags=re.IGNORECASE)

Non-greedy search

By default, all those match, sub are greedy. They will try to grab as much as the text that match the pattern. Here is the problem

>>> str_pattern = re.compile(r'\"(.*)\"')
>>> text1 = 'computer says "no"'
>>> str_pattern.findall(text1)
no
>>> text2 = 'computer says "no", phone says "yes"'
>>> str_pattern.finall(text2)
no", phone says "yes
# PROBLEM HERE

To make sure we get the non-greedy search and return the first times the pattern has been found, we need to add ? after * or +

>>> str_pattern = re.compile(r'\"(.*?)\"')
>>> text1 = 'computer says "no"'
>>> str_pattern.findall(text1)
no
>>> text2 = 'computer says "no", phone says "yes"'
>>> str_pattern.finall(text2)
no

Combining/concatenating strings

It’s possible to use concatenation with a generator function

>>> def sample():
      yield 'Is'
      yield 'Chicago'
      yield 'not'
      yield 'not chicago'
>>> text = ", ".join(sample())
Is, Chicago, not, not Chicago

Interpolating variables and strings

Use of format_map

>>> s = '{name} has {n} messages'
>>> s.format(name='jean', n=10)
jean has 10 messages
>>> n = 33
>>> s.format_map(vars())
jean has 33 messages
>>> vars()
{'__name__': '__main__', 'name': 'jean', 'n': 33, ....}

vars() works also with instances

>>> class Info:
       def __init__(self, name, n):
          self.name = name
          self.n = n
>>> a = Info('jean', 10)
>>> s.format_map(vars(a))

To get all the vars at different depth into the program, we can use sys._getframe(n).f_locals.

You can find an example in action on my python_101 repository.

In the case keys are missing, this won’t work. We need to use the __missing__ method of dict

>>> class safesub(dict):
       def __missing__(self, key):
          return '{' + key + '}'
>>> import sys
>>> def sub(text):
       return text.format_map(safesub(sys._getframe(1).f_locals))
>>> print (sub('Your {name} is with {color}'))
Your jean is with {color}