simplere¶
A simplified interface to Python’s regular expression (re
)
string search. Eliminates steps and provides
simpler access to results. As a bonus, also provides compatible way to
access Unix glob searches.
Usage¶
Python regular expressions are powerful, but the language’s lack of an en passant (in passing) assignment requires a preparatory motion and then a test:
import re
match = re.search(pattern, some_string)
if match:
print match.group(1)
With simplere
, you can do it in fewer steps:
from simplere import *
if match / re.search(pattern, some_string):
print match[1]
In addition to its own classes, from simplere import *
imports both
the standard re
module and the match
object so you don’t have to.
Motivation¶
In the simple examples above, “fewer steps” seems like a small savings (3 lines to 2). While a 33% savings is a pretty good optimization, is it really worth using another module and a quirky en passant operator to get it?
In code this simple, maybe not. But real regex-based searching tends
to have multiple, cascading searches, and to be tightly interwoven
with complex pre-conditions, error-checking, and post-match formatting
or actions. It gets complicated fast. When multiple re
matches
must be done, it consumes a lot of “vertical space” and often
threatens to push the number of lines a programmer is viewing at
any given moment beyond the number that can be easily held in working
memory. In that case, it proves valuable to condense what is logically
a single operation (“regular expression test”) into a single line
with its conditional if
.
This is even more true for the “exploratory” phases of development, before a program’s appropriate structure and best logical boundaries have been established. One can always “back out” the condensing en passant operation in later production code, if desired.
Re Objects¶
Re
objects are memoized
for efficiency, so they compile their pattern just once, regardless of how
many times they’re mentioned in a program.
Note that the in
test turns the sense of the matching around (compared
to the standard re
module). It asks “is the given string in the set of
items this pattern describes?” To be fancy, the Re
pattern is an
intensionally defined set (namely “all strings matching the pattern”). This
order often makes excellent sense whey you have a clear intent for the test.
For example, “is the given string within the set of all legitimate
commands?”
Second, the in
test had the side effect of setting the underscore name
_
to the result. Python doesn’t support en passant
assignment–apparently, no matter how hard you try, or how much
introspection you use. This makes it harder to both test and collect results
in the same motion, even though that’s often exactly appropriate. Collecting
them in a class variable is a fallback strategy (see the En Passant
section below for a slicker one).
If you prefer the more traditional re
calls:
if Re(pattern).search(some_string):
print Re._[1]
Re
works even better with named pattern components, which are exposed
as attributes of the returned object:
person = 'John Smith 48'
if person in Re(r'(?P<name>[\w\s]*)\s+(?P<age>\d+)'):
print Re._.name, "is", Re._.age, "years old"
else:
print "don't understand '{}'".format(person)
One trick being used here is that the returned object is not a pure
_sre.SRE_Match
that Python’s re
module returns. Nor is it a subclass.
(That class appears to be unsubclassable.)
Thus, regular expression matches return a proxy object that
exposes the match object’s numeric (positional) and
named groups through indices and attributes. If a named group has the same
name as a match object method or property, it takes precedence. Either
change the name of the match group or access the underlying property thus:
x._match.property
It’s possible also to loop over the results:
for found in Re('pattern (\w+)').finditer('pattern is as pattern does'):
print found[1]
Or collect them all in one fell swoop:
found = Re('pattern (\w+)').findall('pattern is as pattern does')
Pretty much all of the methods and properties one can access from the standard
re
module are available.
Under the Covers¶
ReMatch
objects
wrap Python’s native``_sre.SRE_Match`` objects (the things that re
method calls return).:
match = re.match(r'(?P<word>th.s)', 'this is a string')
match = ReMatch(match)
if match:
print match.group(1) # still works
print match[1] # same thing
print match.word # same thing, with logical name
But that’s a lot of boilerplate for a simple test, right? So simplere
en passant operator redefining the division operation and proxies the re
result
on the fly to the pre-defined match
object:
if match / re.search(r'(?P<word>th.s)', 'this is a string'):
assert match[1] == 'this'
assert match.word == 'this'
assert match.group(1) == 'this'
If the re
operation fails, the resulting object is guaranteed to have
a False
-like Boolean value, so that it will fall through conditional tests.
Options and Alternatives¶
If you prefer the look of the less-than (<
) or less-than-or-equal
(<=
), as indicators that match
takes the value of the
following function call, they are experimentally supported as aliases
of the division operation (/
). You may define your own match
objects, and can use them on memoized Re
objects too. Putting
a few of these optional things together:
answer = Match() # need to do this just once
if answer < Re(r'(?P<word>th..)').search('and that goes there'):
assert answer.word == 'that'
Globs¶
Regular expressions are wonderfully powerful, but sometimes the simpler Unix glob is works just fine. As a bonus,
simplere
also provides simple glob access.:
if 'globtastic' in Glob('glob*'):
print "Yes! It is!"
else:
raise ValueError('YES IT IS')
If you want to search or test
against multiple patterns at once, Glob
objects take
a variable number of patterns. A match is defined as any of the
patterns matching.:
img_formats = Glob("*.png", "*.jpeg", "*.jpg", "*.gif")
if filename.lower() in img_formats:
... further processing ...
Alternatively, you can splat an existing list into the Glob
constructor with Python’s unary star syntax:
img_formats = "*.png *.jpeg *.jpg *.gif".split()
if filename.lower() in Glob(*img_formats):
... further processing ...
Case-insensitive glob searches are also available:
bg = InsensitiveGlob('b*')
if 'bubba' in bg:
assert 'Bubba' in bg
Globs have their own syntax for case insensitive characters,
but it can be a pain to use. It may be easier to use the
InsensitiveGlob
subclass. Or even alias the case-insensitive
version as the main one:
from simplere import InsensitiveGlob as Glob
Note
Case folding / case-insensitive searches work well in the ASCII range, but Unicode characters and case folding is more intricate. Basic folding is provided out of the box. It’s quite adequate for mapping against common filename patterns, for example. Those needing more extensive Unicode case folding should consider normalizing strings, as described here. As the tests show, basic Unicode folding works fine everywhere. Using Unicode in glob patterns (not just strings to be matched) works only on Python 3.3 or above.
Notes¶
- Automated multi-version testing managed with pytest and tox. Continuous integration testing with Travis-CI. Packaging linting with pyroma.
- Version 1.2.9 updates testing for early 2017 Python versions. Successfully packaged for, and tested against, all late-model versions of Python: 2.6, 2.7, 3.3, 3.4, 3.5, and 3.6, as well as PyPy 5.6.0 (based on 2.7.12) and PyPy3 5.5.0 (based on 3.3.5).
- The author, Jonathan Eunice or @jeunice on Twitter welcomes your comments and suggestions.
Installation¶
To install or upgrade to the latest version:
pip install -U simplere
To install under a specific Python version (3.3 in this example):
python3.3 -m pip -U simplere
You may need to prefix these with sudo
to authorize installation. In
environments without super-user privileges, you may want to use pip
‘s
--user
option, to install only for a single user, rather than
system-wide. If you use the standalone pip
programs, you may also
need to use pip2
or pip3
version-dependent variants, depending
on your system configuration.
Testing¶
To run the module tests, use one of these commands:
tox # normal run - speed optimized
tox -e py27 # run for a specific version only (e.g. py27, py34)
tox -c toxcov.ini # run full coverage tests
Change Log¶
1.2.12 (May 31, 2017)
Updated Python 2 / 3 compatibility strategy to be Python 3 centric. Should be more future-proofed.
Updated testing.
1.2.11 (February 17, 2017)
Updates testing. Again. Python 3.2 support again dropped, for the last time, given failure on Travis CI. It’s old, anyway. Time to upgrade!
1.2.10 (February 17, 2017)
Updates testing. Python 3.2 support re-established, given still supported on Travis CI. Docs tweaked.
1.2.9 (January 23, 2017)
Updates testing. Newly qualified under 2.7.13 and 3.6, as well as most recent builds of pypy and pypy3. Python 3.2 support withdrawn given obsolescence.
1.2.8 (August 26, 2015)
Substantial documentation reorg.
1.2.7 (August 23, 2015)
Starts automated measurement of test branch coverage. Inital runs show 100% branch coverage. Hooah!
1.2.6 (August 20, 2015)
Bumped to 100% test coverage.
1.2.5 (August 19, 2015)
Added automated measurement of test coverage. Line coverage started at 92%. Bumped to 97%.
1.2.0 (August 14, 2015)
Realized imports were overly restrictive, requiring clients of module to needlessly (and contra docs) manually importre
and construct thematch
object. Fixed. Bumped minor version number to reflect de facto API change.
1.1.1 (August 14, 2015)
Simplifiedsetup.py
and packaging. Tweaked docs.
1.1.0
Adds multi-pattern and case insensitive Glob subclass. Added wheel packaging. Rearranged and extended testing structure. Updated setup and docs.
1.0.10
Addedbdist_wheel
package support. Extended testing matrix to 3.5 pre-release builds. Switched to Apache License.
1.0.5
In several dot-releases, have added support for Travis-CI cloud- based continuous integration testing, Sphinx-based documentation, and readthedocs.org hosted documentation. The Travis bit has required a separate Github repository be created. It is managed out of the same development directory, overlaying the existing Mercurial / Bitbucket repo. So far, that has caused no problems.
Documentation somewhat improved.
1.0.0
Cleaned up source for better PEP8 conformance
Bumped version number to 1.0 as part of move to semantic versioning, or at least enough of it so as to not screw up Python installation procedures (which don’t seem to understand 0.401 is a lesser version that 0.5, because 401 > 5)