Engarde!¶
Engarde is a package for defensive data analysis. Engarde supports python 2.7+ and python 3.4+.
Why?¶
The raison d’être for engarde is the fact of life that data are messy. To do our analysis, we often have certain assumptions about our data that should be invariant across updates to your dataset. Engarde is a lightweight way to explicitly state your assumptions and check that they’re actually true.
@is_shape(-1, 10)
@is_monotonic(strict=True)
@none_missing()
def compute(df):
# complex operations to determine result
...
return result
We state our assumptions as decorators, and verify that they are true upon the result of the function.
engarde is similar in spirit to the R library assertr.
Usage¶
There are two main ways to use engarde, depending on whether you’re
working interactively or not.
For interactive use, I’d suggest using DataFrame.pipe
to run the
check.
For non-interactive use, each of the checks are wrapped into a
decorator. You can decorate the functions that makeup your ETL pipeline
with the checks that should hold true at that stage in the pipeline.
Checkout Example to see engarde in action.
Contents¶
Installation and Dependencies¶
Engarde itself is pure python, so I’d just use pip
:
pip install engarde
It does depend on pandas
, which may be more difficult to pip
install. You might consider conda if you have trouble installing
pandas and its dependencies. Once you have the dependencies sorted out
a pip install engarde
should work.
If you’re using conda, engarde
is available in the conda-forge channel:
conda install -c conda-forge engarde
Example¶
Engarde really shines when you have a dataset that regularly receives updates. We’ll work with a data set of customer preferences on trains. This is a static dataset and isn’t being updated, but you could imagine that each month the Dutch authorities upload a new month’s worth of data.
We can start by making some very basic assertions, that the dataset is the correct shape, and that a few columns are the correct dtypes. Assertions are made as decorators to functions that return a DataFrame.
In [1]: import pandas as pd
In [2]: import engarde.decorators as ed
In [3]: pd.set_option('display.max_rows', 10)
In [4]: dtypes = dict(
...: price1=int,
...: price2=int,
...: time1=int,
...: time2=int,
...: change1=int,
...: change2=int,
...: comfort1=int,
...: comfort2=int
...: )
In [5]: @ed.is_shape((None, 11))
...: @ed.has_dtypes(items=dtypes)
...: def unload():
...: trains = pd.read_csv("data/trains.csv", index_col=0)
...: return trains
In [6]: unload()
Out[6]:
id choiceid choice price1 time1 change1 comfort1 price2 time2 \
1 1 1 choice1 2400 150 0 1 4000 150
2 1 2 choice1 2400 150 0 1 3200 130
3 1 3 choice1 2400 115 0 1 4000 115
4 1 4 choice2 4000 130 0 1 3200 150
5 1 5 choice2 2400 150 0 1 3200 150
.. .. ... ... ... ... ... ... ... ...
347 30 7 choice1 2100 135 1 1 2800 135
348 30 8 choice1 2100 125 1 1 3500 125
349 30 9 choice1 2100 150 0 0 2800 125
350 30 10 choice1 2800 125 0 1 2800 135
351 30 11 choice2 3500 125 1 0 2800 135
change2 comfort2
1 0 1
2 0 1
3 0 0
4 0 0
5 0 0
.. ... ...
347 1 0
348 1 0
349 0 1
350 1 0
351 1 0
[351 rows x 11 columns]
One very important part of the design of Engarde is that your code, the code actually doing the work, shouldn’t have to change. I don’t want a bunch of asserts cluttering up the logic of what’s happening. This is a perfect case for decorators.
The order of execution here is unload
returns the DataFrame
, trains
.
Next, ed.has_dtypes
asserts that trains
has the correct dtypes, as specified with dtypes
. Once that assert passes, has_dtypes
passes trains
along to the next check, and so on, until the original caller gets back trains
.
Each row of this dataset contains a passengers preference over two routes. Each route has an associated cost, travel time, comfort level, and number of changes. Like any good economist, we’ll assume people are rational: their first choice is surely going to be better in at least one way than their second choice (faster, more comfortable, ...). This is fundamental to our analysis later on, so we’ll explicitly state it in our code, and check it in our data.
In [7]: def rational(df):
...: """
...: Check that at least one criteria is better.
...: """
...: r = ((df.price1 < df.price2) | (df.time1 < df.time2) |
...: (df.change1 < df.change2) | (df.comfort1 > df.comfort2))
...: return r
...:
In [8]: @ed.is_shape((None, 11))
...: @ed.has_dtypes(items=dtypes)
...: @ed.verify_all(rational)
...: def unload():
...: trains = pd.read_csv("data/trains.csv", index_col=0)
...: return trains
...:
In [9]: df = unload()
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-9-b108f050ce4e> in <module>()
----> 1 df = unload()
/Users/tom.augspurger/sandbox/engarde/engarde/decorators.py in wrapper(*args, **kwargs)
22 @wraps(func)
23 def wrapper(*args, **kwargs):
---> 24 result = func(*args, **kwargs)
25 ck.is_shape(result, shape)
26 return result
/Users/tom.augspurger/sandbox/engarde/engarde/decorators.py in wrapper(*args, **kwargs)
115 @wraps(func)
116 def wrapper(*args, **kwargs):
--> 117 result = func(*args, **kwargs)
118 ck.has_dtypes(result, items)
119 return result
/Users/tom.augspurger/sandbox/engarde/engarde/decorators.py in wrapper(*operation_args, **operation_kwargs)
147 def wrapper(*operation_args, **operation_kwargs):
148 result = operation_func(*operation_args, **operation_kwargs)
--> 149 vfunc(result, func, *args, **kwargs)
150 return result
151 return wrapper
/Users/tom.augspurger/sandbox/engarde/engarde/generic.py in verify_all(df, check, *args, **kwargs)
40 result = check(df, *args, **kwargs)
41 try:
---> 42 assert np.all(result)
43 except AssertionError as e:
44 msg = "{} not true for all".format(check.__name__)
AssertionError: ('rational not true for all', id choiceid choice price1 time1 change1 comfort1 price2 time2 \
13 2 3 choice2 2450 121 0 0 2450 93
18 2 8 choice2 2975 108 0 0 2450 108
27 3 6 choice2 1920 106 0 0 1440 96
28 3 7 choice1 1920 106 0 0 1920 96
33 4 1 choice2 545 105 1 1 545 85
.. .. ... ... ... ... ... ... ... ...
306 27 9 choice1 3920 140 1 1 3920 125
319 28 8 choice2 2450 133 1 1 2450 108
325 28 14 choice2 2450 123 0 1 2450 108
328 28 17 choice2 2815 108 0 1 2450 108
330 29 2 choice2 2800 140 2 0 2800 120
change2 comfort2
13 0 1
18 0 1
27 0 1
28 0 1
33 1 1
.. ... ...
306 0 2
319 0 2
325 0 2
328 0 2
330 0 1
[42 rows x 11 columns])
So our check failed, apparently people aren’t rational...
Engarde has printed the name of the failed assertion and the rows that are False.
We’re simply resusing pandas printing machinery, so set pd.options.display.max_rows
to display
more or fewer rows.
We’ll fix this problem by ignoring those people (why change your mind when you can change the data?).
In [16]: @ed.verify_all(rational)
....: def drop_silly_people(df):
....: r = df.query("price1 < price2 | time1 < time2 |"
....: "change1 < change2 | comfort1 > comfort2")
....: return r
....:
In [17]: @ed.is_shape((None, 11))
....: @ed.has_dtypes(items=dtypes)
....: def unload():
....: trains = pd.read_csv("data/trains.csv", index_col=0)
....: return trains
In [18]: df = unload().pipe(drop_silly_people)
In [19]: df.head()
Out[19]:
id choiceid choice price1 time1 change1 comfort1 price2 time2 \
1 1 1 choice1 2400 150 0 1 4000 150
2 1 2 choice1 2400 150 0 1 3200 130
3 1 3 choice1 2400 115 0 1 4000 115
4 1 4 choice2 4000 130 0 1 3200 150
5 1 5 choice2 2400 150 0 1 3200 150
change2 comfort2
1 0 1
2 0 1
3 0 0
4 0 0
5 0 0
All of our assertions have “passed” now, so we’re happy and our analysis can proceed.
Design¶
It’s important that engarde
not get in your way.
Your task is hard enough without a bunch of assertions
cluttering up the logic of the code.
And yet, it does help to explicitly state the assumptions
fundamental to your analysis. Decorators provide a nice
compromise.
Checks¶
Each checks takes a DataFrame, arguments necessary for the check,
asserts the truth of the check, and returns the original DataFrame.
If the assertion fails, an AssertionError
is raised and engarde
tries to print out some informative information about where the failure
occurred.
The exceptions to the above rule are for generic assertions verify
,
verify_all
, and verify_any
. These take an additional argument,
assertion_func
, a function taking a DataFrame and returning some
kind of booleans. You can think of any of the built-in checks, like
none_missing
as special cases of the generic verify functions
where assertion_func
has been fixed.
Decorators¶
Each check
has an associated decorator. The decorator simply marshals
arguments, allowing you to make your assertions outside the actual logic
of your code. Personally, this is the most compelling use-case for engarde
.
You have a data source that pushes updates to a dataset. The updates are
(or should be) similarly shaped. Perhaps you have some automated reporting
derived from the dataset, and you wish to fail early if a crucial assumption
is violated.
API¶
checks¶
This file contains the functions doing the actual asserts.
You can potentially use this file during interactive sessions,
probably via the pipe
method.
checks.py
Each function in here should
- Take a DataFrame as its first argument, maybe optional arguments
- Makes its assert on the result
- Return the original DataFrame
-
engarde.checks.
is_monotonic
(df, items=None, increasing=None, strict=False)¶ Asserts that the DataFrame is monotonic.
Parameters: df : Series or DataFrame
items : dict
mapping columns to conditions (increasing, strict)
increasing : None or bool
None is either increasing or decreasing.
strict : whether the comparison should be strict
Returns: df : DataFrame
-
engarde.checks.
is_same_as
(df, df_to_compare, **kwargs)¶ Assert that two pandas dataframes are the equal
Parameters: df : pandas DataFrame
df_to_compare : pandas DataFrame
**kwargs : dict
keyword arguments passed through to panda’s
assert_frame_equal
Returns: df : DataFrame
-
engarde.checks.
is_shape
(df, shape)¶ Asserts that the DataFrame is of a known shape.
Parameters: df : DataFrame
shape : tuple
(n_rows, n_columns). Use None or -1 if you don’t care about a dimension.
Returns: df : DataFrame
-
engarde.checks.
none_missing
(df, columns=None)¶ Asserts that there are no missing values (NaNs) in the DataFrame.
Parameters: df : DataFrame
columns : list
list of columns to restrict the check to
Returns: df : DataFrame
same as the original
-
engarde.checks.
unique_index
(df)¶ Assert that the index is unique
Parameters: df : DataFrame Returns: df : DataFrame
-
engarde.checks.
within_n_std
(df, n=3)¶ Assert that every value is within
n
standard deviations of its column’s mean.Parameters: df : DataFame
n : int
number of standard deviations from the mean
Returns: df : DataFrame
-
engarde.checks.
within_range
(df, items=None)¶ Assert that a DataFrame is within a range.
Parameters: df : DataFame
items : dict
mapping of columns (k) to a (low, high) tuple (v) that
df[k]
is expected to be between.Returns: df : DataFrame
-
engarde.checks.
within_set
(df, items=None)¶ Assert that df is a subset of items
Parameters: df : DataFrame
items : dict
mapping of columns (k) to array-like of values (v) that
df[k]
is expected to be a subset ofReturns: df : DataFrame
-
engarde.checks.
has_dtypes
(df, items)¶ Assert that a DataFrame has
dtypes
Parameters: df: DataFrame
items: dict
mapping of columns to dtype.
Returns: df : DataFrame
-
engarde.checks.
verify
(df, check, *args, **kwargs)¶ Generic verify. Assert that
check(df, *args, **kwargs)
is true.Parameters: df : DataFrame
check : function
Should take DataFrame and **kwargs. Returns bool
Returns: df : DataFrame
same as the input.
-
engarde.checks.
verify_all
(df, check, *args, **kwargs)¶ Verify that all the entries in
check(df, *args, **kwargs)
are true.
-
engarde.checks.
verify_any
(df, check, *args, **kwargs)¶ Verify that any of the entries in
check(df, *args, **kwargs)
is true
decorators¶
-
engarde.decorators.
none_missing
(columns=None)¶ Asserts that no missing values (NaN) are found
-
engarde.decorators.
within_range
(items)¶ Check that a DataFrame’s values are within a range.
Parameters: items : dict or array-like
dict maps columss to (lower, upper) array-like checks the same (lower, upper) for each column
-
engarde.decorators.
within_set
(items)¶ Check that DataFrame values are within set.
>>> @within_set({'A': {1, 3}}) >>> def f(df): return df
-
engarde.decorators.
has_dtypes
(items)¶ Tests that the dtypes are as specified in items.
-
engarde.decorators.
verify
(func, *args, **kwargs)¶ Assert that func(df, *args, **kwargs) is true.
-
engarde.decorators.
verify_all
(func, *args, **kwargs)¶ Assert that all of func(*args, **kwargs) are true.
-
engarde.decorators.
verify_any
(func, *args, **kwargs)¶ Assert that any of func(*args, **kwargs) are true.
-
engarde.decorators.
within_n_std
(n=3)¶ Tests that all values are within 3 standard deviations of their mean.
This file provides a nice API for each of the checks, designed to fit seamlessly into an ETL pipeline. Each of the functions defined here can be applied to a functino that returns a DataFrame.