Engarde!

Engarde is a package for defensive data analysis. Engarde supports python 2.7+ and python 3.4+.

Why?

The raison d’être for engarde is the fact of life that data are messy. To do our analysis, we often have certain assumptions about our data that should be invariant across updates to your dataset. Engarde is a lightweight way to explicitly state your assumptions and check that they’re actually true.

@is_shape(-1, 10)
@is_monotonic(strict=True)
@none_missing()
def compute(df):
    # complex operations to determine result
    ...
    return result

We state our assumptions as decorators, and verify that they are true upon the result of the function.

engarde is similar in spirit to the R library assertr.

Usage

There are two main ways to use engarde, depending on whether you’re working interactively or not. For interactive use, I’d suggest using DataFrame.pipe to run the check. For non-interactive use, each of the checks are wrapped into a decorator. You can decorate the functions that makeup your ETL pipeline with the checks that should hold true at that stage in the pipeline. Checkout Example to see engarde in action.

Contents

Installation and Dependencies

Engarde itself is pure python, so I’d just use pip:

pip install engarde

It does depend on pandas, which may be more difficult to pip install. You might consider conda if you have trouble installing pandas and its dependencies. Once you have the dependencies sorted out a pip install engarde should work.

If you’re using conda, engarde is available in the conda-forge channel:

conda install -c conda-forge engarde

Example

Engarde really shines when you have a dataset that regularly receives updates. We’ll work with a data set of customer preferences on trains. This is a static dataset and isn’t being updated, but you could imagine that each month the Dutch authorities upload a new month’s worth of data.

We can start by making some very basic assertions, that the dataset is the correct shape, and that a few columns are the correct dtypes. Assertions are made as decorators to functions that return a DataFrame.

In [1]: import pandas as pd

In [2]: import engarde.decorators as ed

In [3]: pd.set_option('display.max_rows', 10)

In [4]: dtypes = dict(
   ...:     price1=int,
   ...:     price2=int,
   ...:     time1=int,
   ...:     time2=int,
   ...:     change1=int,
   ...:     change2=int,
   ...:     comfort1=int,
   ...:     comfort2=int
   ...: )

In [5]: @ed.is_shape((None, 11))
   ...: @ed.has_dtypes(items=dtypes)
   ...: def unload():
   ...:     trains = pd.read_csv("data/trains.csv", index_col=0)
   ...:     return trains

In [6]: unload()
Out[6]:
     id  choiceid   choice  price1  time1  change1  comfort1  price2  time2  \
1     1         1  choice1    2400    150        0         1    4000    150
2     1         2  choice1    2400    150        0         1    3200    130
3     1         3  choice1    2400    115        0         1    4000    115
4     1         4  choice2    4000    130        0         1    3200    150
5     1         5  choice2    2400    150        0         1    3200    150
..   ..       ...      ...     ...    ...      ...       ...     ...    ...
347  30         7  choice1    2100    135        1         1    2800    135
348  30         8  choice1    2100    125        1         1    3500    125
349  30         9  choice1    2100    150        0         0    2800    125
350  30        10  choice1    2800    125        0         1    2800    135
351  30        11  choice2    3500    125        1         0    2800    135

     change2  comfort2
1          0         1
2          0         1
3          0         0
4          0         0
5          0         0
..       ...       ...
347        1         0
348        1         0
349        0         1
350        1         0
351        1         0

[351 rows x 11 columns]

One very important part of the design of Engarde is that your code, the code actually doing the work, shouldn’t have to change. I don’t want a bunch of asserts cluttering up the logic of what’s happening. This is a perfect case for decorators.

The order of execution here is unload returns the DataFrame, trains. Next, ed.has_dtypes asserts that trains has the correct dtypes, as specified with dtypes. Once that assert passes, has_dtypes passes trains along to the next check, and so on, until the original caller gets back trains.

Each row of this dataset contains a passengers preference over two routes. Each route has an associated cost, travel time, comfort level, and number of changes. Like any good economist, we’ll assume people are rational: their first choice is surely going to be better in at least one way than their second choice (faster, more comfortable, ...). This is fundamental to our analysis later on, so we’ll explicitly state it in our code, and check it in our data.

 In [7]: def rational(df):
    ...:     """
    ...:     Check that at least one criteria is better.
    ...:     """
    ...:     r = ((df.price1 < df.price2) | (df.time1 < df.time2) |
    ...:          (df.change1 < df.change2) | (df.comfort1 > df.comfort2))
    ...:     return r
    ...:

 In [8]: @ed.is_shape((None, 11))
    ...: @ed.has_dtypes(items=dtypes)
    ...: @ed.verify_all(rational)
    ...: def unload():
    ...:     trains = pd.read_csv("data/trains.csv", index_col=0)
    ...:     return trains
    ...:

 In [9]: df = unload()
 ---------------------------------------------------------------------------
 AssertionError                            Traceback (most recent call last)
 <ipython-input-9-b108f050ce4e> in <module>()
 ----> 1 df = unload()

 /Users/tom.augspurger/sandbox/engarde/engarde/decorators.py in wrapper(*args, **kwargs)
      22         @wraps(func)
      23         def wrapper(*args, **kwargs):
 ---> 24             result = func(*args, **kwargs)
      25             ck.is_shape(result, shape)
      26             return result

 /Users/tom.augspurger/sandbox/engarde/engarde/decorators.py in wrapper(*args, **kwargs)
     115         @wraps(func)
     116         def wrapper(*args, **kwargs):
 --> 117             result = func(*args, **kwargs)
     118             ck.has_dtypes(result, items)
     119             return result

 /Users/tom.augspurger/sandbox/engarde/engarde/decorators.py in wrapper(*operation_args, **operation_kwargs)
     147         def wrapper(*operation_args, **operation_kwargs):
     148             result = operation_func(*operation_args, **operation_kwargs)
 --> 149             vfunc(result, func, *args, **kwargs)
     150             return result
     151         return wrapper

 /Users/tom.augspurger/sandbox/engarde/engarde/generic.py in verify_all(df, check, *args, **kwargs)
      40     result = check(df, *args, **kwargs)
      41     try:
 ---> 42         assert np.all(result)
      43     except AssertionError as e:
      44         msg = "{} not true for all".format(check.__name__)

 AssertionError: ('rational not true for all',      id  choiceid   choice  price1  time1  change1  comfort1  price2  time2  \
 13    2         3  choice2    2450    121        0         0    2450     93
 18    2         8  choice2    2975    108        0         0    2450    108
 27    3         6  choice2    1920    106        0         0    1440     96
 28    3         7  choice1    1920    106        0         0    1920     96
 33    4         1  choice2     545    105        1         1     545     85
 ..   ..       ...      ...     ...    ...      ...       ...     ...    ...
 306  27         9  choice1    3920    140        1         1    3920    125
 319  28         8  choice2    2450    133        1         1    2450    108
 325  28        14  choice2    2450    123        0         1    2450    108
328  28        17  choice2    2815    108        0         1    2450    108
330  29         2  choice2    2800    140        2         0    2800    120

     change2  comfort2
13         0         1
18         0         1
27         0         1
28         0         1
33         1         1
..       ...       ...
306        0         2
319        0         2
325        0         2
328        0         2
330        0         1

[42 rows x 11 columns])

So our check failed, apparently people aren’t rational... Engarde has printed the name of the failed assertion and the rows that are False. We’re simply resusing pandas printing machinery, so set pd.options.display.max_rows to display more or fewer rows.

We’ll fix this problem by ignoring those people (why change your mind when you can change the data?).

In [16]: @ed.verify_all(rational)
   ....: def drop_silly_people(df):
   ....:     r = df.query("price1 < price2 | time1 < time2 |"
   ....:                  "change1 < change2 | comfort1 > comfort2")
   ....:     return r
   ....:

In [17]: @ed.is_shape((None, 11))
   ....: @ed.has_dtypes(items=dtypes)
   ....: def unload():
   ....:     trains = pd.read_csv("data/trains.csv", index_col=0)
   ....:     return trains

In [18]: df = unload().pipe(drop_silly_people)

In [19]: df.head()
Out[19]:
   id  choiceid   choice  price1  time1  change1  comfort1  price2  time2  \
1   1         1  choice1    2400    150        0         1    4000    150
2   1         2  choice1    2400    150        0         1    3200    130
3   1         3  choice1    2400    115        0         1    4000    115
4   1         4  choice2    4000    130        0         1    3200    150
5   1         5  choice2    2400    150        0         1    3200    150

   change2  comfort2
1        0         1
2        0         1
3        0         0
4        0         0
5        0         0

All of our assertions have “passed” now, so we’re happy and our analysis can proceed.

Design

It’s important that engarde not get in your way. Your task is hard enough without a bunch of assertions cluttering up the logic of the code. And yet, it does help to explicitly state the assumptions fundamental to your analysis. Decorators provide a nice compromise.

Checks

Each checks takes a DataFrame, arguments necessary for the check, asserts the truth of the check, and returns the original DataFrame. If the assertion fails, an AssertionError is raised and engarde tries to print out some informative information about where the failure occurred.

The exceptions to the above rule are for generic assertions verify, verify_all, and verify_any. These take an additional argument, assertion_func, a function taking a DataFrame and returning some kind of booleans. You can think of any of the built-in checks, like none_missing as special cases of the generic verify functions where assertion_func has been fixed.

Decorators

Each check has an associated decorator. The decorator simply marshals arguments, allowing you to make your assertions outside the actual logic of your code. Personally, this is the most compelling use-case for engarde. You have a data source that pushes updates to a dataset. The updates are (or should be) similarly shaped. Perhaps you have some automated reporting derived from the dataset, and you wish to fail early if a crucial assumption is violated.

API

checks

This file contains the functions doing the actual asserts. You can potentially use this file during interactive sessions, probably via the pipe method.

checks.py

Each function in here should

  • Take a DataFrame as its first argument, maybe optional arguments
  • Makes its assert on the result
  • Return the original DataFrame
engarde.checks.is_monotonic(df, items=None, increasing=None, strict=False)

Asserts that the DataFrame is monotonic.

Parameters:

df : Series or DataFrame

items : dict

mapping columns to conditions (increasing, strict)

increasing : None or bool

None is either increasing or decreasing.

strict : whether the comparison should be strict

Returns:

df : DataFrame

engarde.checks.is_same_as(df, df_to_compare, **kwargs)

Assert that two pandas dataframes are the equal

Parameters:

df : pandas DataFrame

df_to_compare : pandas DataFrame

**kwargs : dict

keyword arguments passed through to panda’s assert_frame_equal

Returns:

df : DataFrame

engarde.checks.is_shape(df, shape)

Asserts that the DataFrame is of a known shape.

Parameters:

df : DataFrame

shape : tuple

(n_rows, n_columns). Use None or -1 if you don’t care about a dimension.

Returns:

df : DataFrame

engarde.checks.none_missing(df, columns=None)

Asserts that there are no missing values (NaNs) in the DataFrame.

Parameters:

df : DataFrame

columns : list

list of columns to restrict the check to

Returns:

df : DataFrame

same as the original

engarde.checks.unique_index(df)

Assert that the index is unique

Parameters:df : DataFrame
Returns:df : DataFrame
engarde.checks.within_n_std(df, n=3)

Assert that every value is within n standard deviations of its column’s mean.

Parameters:

df : DataFame

n : int

number of standard deviations from the mean

Returns:

df : DataFrame

engarde.checks.within_range(df, items=None)

Assert that a DataFrame is within a range.

Parameters:

df : DataFame

items : dict

mapping of columns (k) to a (low, high) tuple (v) that df[k] is expected to be between.

Returns:

df : DataFrame

engarde.checks.within_set(df, items=None)

Assert that df is a subset of items

Parameters:

df : DataFrame

items : dict

mapping of columns (k) to array-like of values (v) that df[k] is expected to be a subset of

Returns:

df : DataFrame

engarde.checks.has_dtypes(df, items)

Assert that a DataFrame has dtypes

Parameters:

df: DataFrame

items: dict

mapping of columns to dtype.

Returns:

df : DataFrame

engarde.checks.verify(df, check, *args, **kwargs)

Generic verify. Assert that check(df, *args, **kwargs) is true.

Parameters:

df : DataFrame

check : function

Should take DataFrame and **kwargs. Returns bool

Returns:

df : DataFrame

same as the input.

engarde.checks.verify_all(df, check, *args, **kwargs)

Verify that all the entries in check(df, *args, **kwargs) are true.

engarde.checks.verify_any(df, check, *args, **kwargs)

Verify that any of the entries in check(df, *args, **kwargs) is true

decorators

engarde.decorators.none_missing(columns=None)

Asserts that no missing values (NaN) are found

engarde.decorators.within_range(items)

Check that a DataFrame’s values are within a range.

Parameters:

items : dict or array-like

dict maps columss to (lower, upper) array-like checks the same (lower, upper) for each column

engarde.decorators.within_set(items)

Check that DataFrame values are within set.

>>> @within_set({'A': {1, 3}})
>>> def f(df):
        return df
engarde.decorators.has_dtypes(items)

Tests that the dtypes are as specified in items.

engarde.decorators.verify(func, *args, **kwargs)

Assert that func(df, *args, **kwargs) is true.

engarde.decorators.verify_all(func, *args, **kwargs)

Assert that all of func(*args, **kwargs) are true.

engarde.decorators.verify_any(func, *args, **kwargs)

Assert that any of func(*args, **kwargs) are true.

engarde.decorators.within_n_std(n=3)

Tests that all values are within 3 standard deviations of their mean.

This file provides a nice API for each of the checks, designed to fit seamlessly into an ETL pipeline. Each of the functions defined here can be applied to a functino that returns a DataFrame.