Welcome to pybroom’s documentation!

Latest Version:0.2

Pybroom is a small python 3+ library for converting collections of fit results (curve fitting or other optimizations) to Pandas DataFrame in tidy format (or long-form) (Wickham 2014). Once fit results are in tidy DataFrames, it is possible to leverage common patterns for tidy data analysis. Furthermore powerful visual explorations using multi-facet plots becomes easy thanks to libraries like seaborn natively supporting tidy DataFrames.

Installation

You can install pybroom from PyPI using the following command:

pip install pybroom

or from conda-forge using:

conda install -c conda-forge pybroom

Dependencies are python 3.4+, pandas and lmfit (0.9.5+, which in turn requires scipy). However, matplotlib and seaborn are strongly recommended (and necessary to run the example notebooks).

Why Pybroom?

The Problem

DataFrames in tidy format (or long-form) follow a simple rule: one “observation” per row and one “variable” per column. This simple structure makes it easy to process the data with clear and well-understood idioms (filtering, aggregation, etc.) and allows plot libraries to automatically generate complex plots in which many variables are compared. Plotting libraries supporting tidy DataFrames include seaborn, recent versions of matplotlib, bokeh and altair.

But, while data is oftentimes represented in tidy DataFrames, fit results are usually stored in a variety of custom objects and are harder to manipulate, compare and plot.

Pybroom to the rescue!

Pybroom allows to convert several types of fit results to tidy DataFrames, and is particularly useful for handling collections of such fit results. Pybroom development was inspired by the R library broom. You can watch this video for details of the philosophy behind broom (and by extension pybroom).

Like the R library broom, pybroom provides 3 functions: glance, tidy and augment. The glance function returns fit statistics, one for each fit result (e.g. fit method, number of iterations, chi-square etc.). The tidy function returns data for each fitted parameter (e.g. fitted value, gradient, bounds, etc.). The augment function returns data with the same size as the fitted data points (evaluated best-fit model, residuals, etc.). Additionally, pybroom has two functions tidy_to_dict and dict_to_tidy for conversion between dictionaries and 2-columns tidy DataFrames.

Collections of fit results can be in list, dict, or any nested dict/list combination. When a collection of fit result is used as input, pybroom functions return a DataFrame with additional “categorical” column(s) containing the dict keys or the list index.

Currently, supported fit result object are:

  • scipy.optimize.OptimizeResult returned by several functions in scipy.optimize;
  • lmfit.model.ModelResult (returned by lmfit.Model.fit());
  • lmfit.minimizer.MinimizerResult (returned by lmfit.minimizer()).

Note that the 3 functions (glance, tidy and augment) are implemented only for the fit-result objects that are relevant. For example, augment cannot process lmfit’s MinimizerResult or scipy’s OptimizeResult because there is little or no data relevant to each data point.

Support for result objects from other libraries such as sklearn can be added based on user request (PR welcome!).

Pybroom’s Release Notes

Version 0.2

  • Improved support for scipy.optimize fit result.

  • In addition to list of fit results, pybroom now supports:

    • dict of fit results,
    • dict of lists of fit results
    • any other nested combination.
  • When input contains a dict, pybroom adds “key” column of type pandas.Categorical. When input contains a list, pybroom adds a “key” column (i.e. list index) of type int64.

  • Updated and expanded documentation and notebooks.

Version 0.1

  • Support lmfit and scipy.optimize fit results.
  • Support lists of fit results.

Pybroom API Documentation

This module contains the 3 main pybroom’s functions:

These functions take one or multiple fit results as input and return a “tidy” (or long-form) DataFrame. The glance function returns fit statistics, one for each fit result (e.g. fit method, number of iterations, chi-square etc.). The tidy function returns data for each fitted parameter (e.g. fitted value, gradient, bounds, etc.). The augment function returns data with the same size as the fitted data points (evaluated best-fit model, residuals, etc.).

In the case of multiple fit results, pybroom functions accept a list, a dict or a nested structure of dict and lists (for example a dict of lists of fit results). The example below shows some use cases.

Note

pybroom functions are particularly convenient when tidying a collection of fit results. The following examples are valid for all the 3 pybroom functions. If results is a list of datasets (e.g. data replicates), the returned dataframe will have an additional “index” column containing the index of the dataset in the list. If results is a dict of fit results (e.g. results from different fit methods or models on the same dataset), then the “index” column contains the keys of the dict (each key identifies a fit result). In the previous two example, var_names should contains the name of the “index” column (a string). Nested structures are also possible. For example, when fitting a list of datasets with different methods, we can build a dict of lists of fit results where the dict keys are the method names and the items in the list are fit results for the different datasets. In this case the returned dataframe has two additional “index” columns: one with the dict keys and one with the list index. The tuple (key, list index) identifies each single fit result. In this case var_names should be a list of column names for the keys and index column respectively (list of strings)

Example

The following examples shows pybroom output when multiple fit results are used. The glance function is used as example but the same logic (and input arguments) can be also passsed to tidy and augment.

Input is a list of fit results:

>>> results = [fit_res1, fit_res2, fit_res3]
>>> br.glance(results, var_names='dataset')

  num_params num_data_points      redchi      AIC  dataset
0          6             101  0.00911793 -468.634        0
1          6             101  0.00996431 -459.669        1
2          6             101   0.0109456 -450.183        2

Input is a dict of fit results:

>>> results = {'A': fit_res1, 'B': fit_res2, 'C': fit_res3}
>>> br.glance(results, var_names='function')

  num_params num_data_points      redchi      AIC function
0          6             101  0.00911793 -468.634        A
1          6             101  0.00996431 -459.669        B
2          6             101   0.0109456 -450.183        C

Input is a dict of lists of fit results:

>>> results = {'A': [fit_res1, fit_res2], 'B': [fit_res3, fit_res4]}
>>> br.glance(results, var_names=['function', 'dataset'])

  num_params num_data_points      redchi      AIC  dataset function
0          6             101  0.00911793 -468.634        0        A
1          6             101  0.00996431 -459.669        1        A
2          6             101   0.0109456 -450.183        0        B
3          6             101   0.0176529 -401.908        1        B

Main Functions

The 3 high-level functions glance(), tidy() and augment() allows tidying one or more fit results. These are pybroom’s most generic functions, accepting all the the supported fit result objects, as well as a list/dict of such objects. See also the examples at the beginning of this page and the example notebooks.

pybroom.glance(results, var_names='key', **kwargs)

Tidy DataFrame containing fit summaries from`result`.

A function to tidy any of the supported fit result (or a list of fit results). This function will identify input type and call the relative “specialized” tidying function. When the input is a list, the returned DataFrame contains data from all the fit results. Supported fit result objects are lmfit.ModelResult, lmfit.MinimizeResult and scipy.optimize.OptimizeResult.

Parameters:
  • result (fit result object or list) – one of the supported fit result objects or a list of supported fit result objects. When a list, all the elements need to be of the same type.
  • var_names (string or list) – name(s) of the column(s) containing an “index” that is different for each element in the set of fit results.
  • **kwargs – additional arguments passed to the underlying specialized tidying function.
Returns:

A DataFrame with one row for each passed fit result. Columns include fit summaries such as reduced chi-square, number of evaluation, successful convergence, AIC, BIC, etc. When a list of fit-result objects is passed, the column var_name (‘item’ by default) contains the index of the object in the list.

See also

For more details on the returned DataFrame and on additional arguments refer to the specialized tidying functions: glance_lmfit_result() and glance_scipy_result().

pybroom.tidy(result, var_names='key', **kwargs)

Tidy DataFrame containing fitted parameter data from result.

A function to tidy any of the supported fit result (or a list of fit results). This function will identify input type and call the relative “specialized” tidying function. When the input is a list, the returned DataFrame contains data from all the fit results. Supported fit result objects are lmfit.ModelResult, lmfit.MinimizeResult and scipy.optimize.OptimizeResult.

Parameters:
  • result (fit result object or list) – one of the supported fit result objects or a list of supported fit result objects. When a list, all the elements need to be of the same type.
  • var_names (string or list) – name(s) of the column(s) containing an “index” that is different for each element in the set of fit results.
  • param_names (string or list of string) – names of the fitted parameters for fit results which don’t include parameter’s names (such as scipy’s OptimizeResult). It can either be a list of strings or a single string with space-separated names.
  • **kwargs – additional arguments passed to the underlying specialized tidying function.
Returns:

A DataFrame with one row for each fitted parameter. Columns include parameter properties such as best-fit value, standard error, eventual bounds/constrains, etc. When a list of fit-result objects is passed, the column var_name (‘item’ by default) contains the index of the object in the list.

See also

For more details on the returned DataFrame and on additional arguments refer to the specialized tidying functions: tidy_lmfit_result() and tidy_scipy_result().

pybroom.augment(results, var_names='key', **kwargs)

Tidy DataFrame containing fit data from result.

A function to tidy any of the supported fit result (or a list of fit results). This function will identify input type and call the relative “specialized” tidying function. When the input is a list or a dict of fit results, the returned DataFrame contains data from all the fit results. In this case data from different fit results is identified by the values in the additional “index” (or categorical) column(s) whose name(s) are specified in var_names.

Parameters:
  • results (fit result object or list) – one of the supported fit result objects or a list of supported fit result objects. When a list, all the elements need to be of the same type.
  • var_names (string or list) – name(s) of the column(s) containing an “index” that is different for each element in the set of fit results. See the example section below.
  • **kwargs – additional arguments passed to the underlying specialized tidying function.
Returns:

A DataFrame with one row for each data point used in the fit. It contains the input data, the model evaluated at the data points with best fitted parameters, error ranges, etc. When a list of fit-result objects is passed, the column var_name (‘item’ by default) contains the index of the object in the list.

Dictionary conversions

The two functions tidy_to_dict() and dict_to_tidy() provide the ability to convert a tidy DataFrame to and from a python dictionary.

pybroom.tidy_to_dict(df, key='name', value='value', keys_exclude=None, cast_value=<class 'float'>)

Convert a tidy DataFrame into a dictionary.

This function converts two columns from an input tidy (or long-form) DataFrame into a dictionary. A typical use-case is passing parameters stored in tidy DataFrame to a python function. The arguments key and value contain the name of the DataFrame columns containing the keys and the values of the dictionary.

Parameters:
  • df (pandas.DataFrame) – the “tidy” DataFrame containing the data. Two columns of this DataFrame should contain the keys and the values to construct the dictionary.
  • key (string or scalar) – name of the DataFrame column containing the keys of the dictionary.
  • value (string or scalar) – name of the DataFrame column containing the values of the dictionary.
  • keys_exclude (iterable or None) – list of keys excluded when building the returned dictionary.
  • cast_value (callable or None) – callable used to cast the value of each item in the dictionary. If None, no casting is performed and the resulting values are 1-element pandas.Series. Default is the python built-in float. Other typical values may be int or str.
Returns:

A dictionary with keys and values extracted from the input (tidy) DataFrame.

See also: dict_to_tidy().

pybroom.dict_to_tidy(dc, key='name', value='value', keys_exclude=None)

Convert a dictionary into a tidy DataFrame.

This function converts a dictionary into a “tidy” (or long-form) DataFrame with two columns: one containing the keys and the other containing the values from the dictionary. Names of the columns can be specified with the key and value argument.

Parameters:
  • dc (dict) – the input dictionary used to build the DataFrame.
  • key (string or scalar) – name of the DataFrame column containing the keys of the dictionary.
  • value (string or scalar) – name of the DataFrame column containing the values of the dictionary.
  • keys_exclude (iterable or None) – list of keys excluded when building the returned DataFrame.
Returns:

A two-columns tidy DataFrame containing the data in the dictionary.

See also: tidy_to_dict().

Specialized functions

These are the specialized (i.e. low-level) functions, each converting one specific object to a tidy DataFrame.

pybroom.glance_scipy_result(result)

Tidy summary statistics from scipy’s OptimizeResult.

Normally this function is not called directly but invoked by the general purpose function glance().

Parameters:result (OptimizeResult) – the fit result object.
Returns:A DataFrame in tidy format with one row and several summary statistics as columns.

Note

Possible columns of the returned DataFrame include:

  • success (bool): whether the fit succeed
  • cost (float): cost function
  • optimality (float): optimality parameter as returned by scipy.optimize.least_squares.
  • nfev (int): number of objective function evaluations
  • njev (int): number of jacobian function evaluations
  • nit (int): number of iterations
  • status (int): status returned by the fit routine
  • message (string): message returned by the fit routine
pybroom.tidy_scipy_result(result, param_names, **kwargs)

Tidy parameters data from scipy’s OptimizeResult.

Normally this function is not called directly but invoked by the general purpose function tidy(). Since OptimizeResult has a raw array of fitted parameters but no names, the parameters’ names need to be passed in param_names.

Parameters:
  • result (OptimizeResult) – the fit result object.
  • param_names (string or list of string) – names of the fitted parameters. It can either be a list of strings or a single string with space-separated names.
Returns:

A DataFrame in tidy format with one row for each parameter.

Note

These two columns are always present in the returned DataFrame:

  • name (string): name of the parameter.
  • value (number): value of the parameter after the optimization.

Optional columns (depending on the type of result) are:

  • grad (float): gradient for each parameter
  • active_mask (int)
pybroom.glance_lmfit_result(result)

Tidy summary statistics from lmfit’s ModelResult or MinimizerResult.

Normally this function is not called directly but invoked by the general purpose function glance().

Parameters:result (ModelResult or MinimizerResult) – the fit result object.
Returns:A DataFrame in tidy format with one row and several summary statistics as columns.

Note

The columns of the returned DataFrame are:

  • model (string): model name (only for ModelResult)
  • method (string): method used for the optimization (e.g. leastsq).
  • num_params (int): number of varied parameters
  • ndata (int):
  • chisqr (float): chi-square statistics.
  • redchi (float): reduced chi-square statistics.
  • AIC (float): Akaike Information Criterion statistics.
  • BIC (float): Bayes Information Criterion statistics.
  • num_func_eval (int): number of evaluations of the objective function during the fit.
  • num_data_points (int): number of data points (e.g. samples) used for the fit.
pybroom.tidy_lmfit_result(result)

Tidy parameters from lmfit’s ModelResult or MinimizerResult.

Normally this function is not called directly but invoked by the general purpose function tidy().

Parameters:result (ModelResult or MinimizerResult) – the fit result object.
Returns:A DataFrame in tidy format with one row for each parameter.

Note

The (possible) columns of the returned DataFrame are:

  • name (string): name of the parameter.
  • value (number): value of the parameter after the optimization.
  • init_value (number): initial value of the parameter before the optimization.
  • min, max (numbers): bounds of the parameter
  • vary (bool): whether the parameter has been varied during the optimization.
  • expr (string): constraint expression for the parameter.
  • stderr (float): standard error for the parameter.
pybroom._augment_lmfit_modelresult(result)

Tidy data values and fitted model from lmfit.model.ModelResult.

PyBroom Example - Simple

This notebook is part of pybroom.

This notebook shows the simplest usage of pybroom when performing a curve fit of a single dataset. Possible applications are only hinted. For a more complex (and interesting!) example using multiple datasets see pybroom-example-multi-datasets.
In [1]:
import numpy as np
from numpy import sqrt, pi, exp, linspace
from lmfit import Model
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format='retina'  # for hi-dpi displays
In [2]:
import lmfit
print('lmfit: %s' % lmfit.__version__)
lmfit: 0.9.5
In [3]:
import pybroom as br

Create Noisy Data

In [4]:
x = np.linspace(-10, 10, 101)
In [5]:
peak1 = lmfit.models.GaussianModel(prefix='p1_')
peak2 = lmfit.models.GaussianModel(prefix='p2_')
model = peak1 + peak2
In [6]:
params = model.make_params(p1_amplitude=1, p2_amplitude=1,
                           p1_sigma=1, p2_sigma=1)
In [7]:
y_data = model.eval(x=x, p1_center=-1, p2_center=2, p1_sigma=0.5, p2_sigma=1, p1_amplitude=1, p2_amplitude=2)
y_data.shape
Out[7]:
(101,)
In [8]:
y_data += np.random.randn(*y_data.shape)/10
In [9]:
plt.plot(x, y_data)
Out[9]:
[<matplotlib.lines.Line2D at 0x7fd5ee324d30>]
_images/notebooks_pybroom-example_11_1.png

Model Fitting

In [10]:
params = model.make_params(p1_center=0, p2_center=3,
                           p1_sigma=0.5, p2_sigma=1,
                           p1_amplitude=1, p2_amplitude=2)
result = model.fit(y_data, x=x, params=params)

Fit result from an lmfit Model can be inspected with with fit_report or params.pretty_print():

In [11]:
print(result.fit_report())
[[Model]]
    (Model(gaussian, prefix='p1_') + Model(gaussian, prefix='p2_'))
[[Fit Statistics]]
    # function evals   = 74
    # data points      = 101
    # variables        = 6
    chi-square         = 0.948
    reduced chi-square = 0.010
    Akaike info crit   = -459.550
    Bayesian info crit = -443.859
[[Variables]]
    p1_sigma:       0.50381801 +/- 0.041780 (8.29%) (init= 0.5)
    p1_center:     -0.94157635 +/- 0.041206 (4.38%) (init= 0)
    p1_amplitude:   1.04915437 +/- 0.075032 (7.15%) (init= 1)
    p2_sigma:       0.85357363 +/- 0.051774 (6.07%) (init= 1)
    p2_center:      2.03505014 +/- 0.050235 (2.47%) (init= 3)
    p2_amplitude:   1.88623359 +/- 0.097151 (5.15%) (init= 2)
    p1_fwhm:        1.18640072 +/- 0.098384 (8.29%)  == '2.3548200*p1_sigma'
    p1_height:      0.83076041 +/- 0.057986 (6.98%)  == '0.3989423*p1_amplitude/max(1.e-15, p1_sigma)'
    p2_fwhm:        2.01001225 +/- 0.121920 (6.07%)  == '2.3548200*p2_sigma'
    p2_height:      0.88158577 +/- 0.044848 (5.09%)  == '0.3989423*p2_amplitude/max(1.e-15, p2_sigma)'
[[Correlations]] (unreported correlations are <  0.100)
    C(p1_sigma, p1_amplitude)    =  0.600
    C(p2_sigma, p2_amplitude)    =  0.599
    C(p1_sigma, p2_sigma)        = -0.214
    C(p1_amplitude, p2_sigma)    = -0.203
    C(p1_center, p2_sigma)       = -0.163
    C(p1_sigma, p2_amplitude)    = -0.158
    C(p1_amplitude, p2_amplitude)  = -0.146
    C(p1_sigma, p2_center)       =  0.116
    C(p1_center, p2_amplitude)   = -0.116
    C(p1_amplitude, p2_center)   =  0.101

In [12]:
result.params.pretty_print()
Name             Value      Min      Max   Stderr     Vary     Expr
p1_amplitude     1.049     -inf      inf  0.07503     True     None
p1_center      -0.9416     -inf      inf  0.04121     True     None
p1_fwhm          1.186     -inf      inf  0.09838    False 2.3548200*p1_sigma
p1_height       0.8308     -inf      inf  0.05799    False 0.3989423*p1_amplitude/max(1.e-15, p1_sigma)
p1_sigma        0.5038        0      inf  0.04178     True     None
p2_amplitude     1.886     -inf      inf  0.09715     True     None
p2_center        2.035     -inf      inf  0.05024     True     None
p2_fwhm           2.01     -inf      inf   0.1219    False 2.3548200*p2_sigma
p2_height       0.8816     -inf      inf  0.04485    False 0.3989423*p2_amplitude/max(1.e-15, p2_sigma)
p2_sigma        0.8536        0      inf  0.05177     True     None

These methods a re convenient but extracting the data from the lmfit object requires some work and the knowledge of lmfit object structure.

pybroom comes to help, extracting data from fit results and returning pandas DataFrame in tidy format that can be much more easily manipulated, filtered and plotted.

Glance

Glancing at the fit results (dropping some verbose columns):

In [13]:
dg = br.glance(result)
dg.drop('model', 1).drop('message', 1)
Out[13]:
method num_params num_data_points chisqr redchi AIC BIC num_func_eval success
0 leastsq 6 101 0.947728 0.009976 -459.549628 -443.858905 74 True

The glance function returns a DataFrame with one row per fit-result object.

Application Idea

If you fit N models to the same dataset you can compare statistics such as reduced-\(\chi^2\)

Or, fitting several with several methods (and datasets) you can study the convergence properties using reduced-\(\chi^2\), number of function evaluation and success rate.

Tidy

Tidy fit results for all the parameters:

In [14]:
dt = br.tidy(result)
dt
Out[14]:
name value min max vary expr stderr init_value
0 p1_amplitude 1.049154 -inf inf True None 0.075033 1.0
1 p1_center -0.941576 -inf inf True None 0.041206 0.0
2 p1_fwhm 1.186401 -inf inf False 2.3548200*p1_sigma 0.098385 NaN
3 p1_height 0.830760 -inf inf False 0.3989423*p1_amplitude/max(1.e-15, p1_sigma) 0.057987 NaN
4 p1_sigma 0.503818 0.000000 inf True None 0.041780 0.5
5 p2_amplitude 1.886234 -inf inf True None 0.097151 2.0
6 p2_center 2.035050 -inf inf True None 0.050235 3.0
7 p2_fwhm 2.010012 -inf inf False 2.3548200*p2_sigma 0.121920 NaN
8 p2_height 0.881586 -inf inf False 0.3989423*p2_amplitude/max(1.e-15, p2_sigma) 0.044849 NaN
9 p2_sigma 0.853574 0.000000 inf True None 0.051775 1.0

The tidy function returns one row for each parameter.

In [15]:
dt.loc[dt.name == 'p1_center']
Out[15]:
name value min max vary expr stderr init_value
1 p1_center -0.941576 -inf inf True None 0.041206 0.0
Augment

Tidy dataframe with data function of the independent variable (‘x’). Columns include the data being fitted, best fit, best fit components, residuals, etc.

In [16]:
da = br.augment(result)
da.head()
Out[16]:
x data best_fit residual Model(gaussian, prefix='p1_') Model(gaussian, prefix='p2_')
0 -10.0 -0.056887 5.979251e-44 0.056887 5.290582e-71 5.979251e-44
1 -9.8 0.012225 1.583027e-42 -0.012225 6.151539e-68 1.583027e-42
2 -9.6 -0.028639 3.967227e-41 0.028639 6.109787e-65 3.967227e-41
3 -9.4 -0.037673 9.411146e-40 0.037673 5.183588e-62 9.411146e-40
4 -9.2 0.014174 2.113270e-38 -0.014174 3.756617e-59 2.113270e-38

The `augment <>`__ function returns one row for each data point.

In [17]:
d = br.augment(result)
In [18]:
fig, ax = plt.subplots(2, 1, figsize=(7, 8))
ax[1].plot('x', 'data', data=d, marker='o', ls='None')
ax[1].plot('x', "Model(gaussian, prefix='p1_')", data=d, lw=2, ls='--')
ax[1].plot('x', "Model(gaussian, prefix='p2_')", data=d, lw=2, ls='--')
ax[1].plot('x', 'best_fit', data=d, lw=2)
ax[0].plot('x', 'residual', data=d);
_images/notebooks_pybroom-example_30_0.png
Application Idea

Fitting N datasets with the same model (or N models with the same dataset) you can automatically build a panel plot with seaborn using the dataset (or the model) as categorical variable. This example is illustrated in pybroom-example-multi-datasets.

PyBroom Example - Multiple Datasets

This notebook is part of pybroom.

This notebook demonstrate using pybroom when fitting a set of curves (curve fitting) using lmfit.Model. We will show that pybroom greatly simplifies comparing, filtering and plotting fit results from multiple datasets.
In [1]:
%matplotlib inline
%config InlineBackend.figure_format='retina'  # for hi-dpi displays
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from lmfit import Model
import lmfit
print('lmfit: %s' % lmfit.__version__)
lmfit: 0.9.5
/home/docs/checkouts/readthedocs.org/user_builds/pybroom/envs/stable/lib/python3.4/site-packages/IPython/html.py:14: ShimWarning: The `IPython.html` package has been deprecated. You should import from `notebook` instead. `IPython.html.widgets` has moved to `ipywidgets`.
  "`IPython.html.widgets` has moved to `ipywidgets`.", ShimWarning)
In [2]:
sns.set_style('whitegrid')
In [3]:
import pybroom as br

Create Noisy Data

We start simulating N datasets which are identical except for the additive noise.

In [4]:
N = 20
In [5]:
x = np.linspace(-10, 10, 101)
In [6]:
peak1 = lmfit.models.GaussianModel(prefix='p1_')
peak2 = lmfit.models.GaussianModel(prefix='p2_')
model = peak1 + peak2
In [7]:
#params = model.make_params(p1_amplitude=1.5, p2_amplitude=1,
#                           p1_sigma=1, p2_sigma=1)
In [8]:
Y_data = np.zeros((N, x.size))
Y_data.shape
Out[8]:
(20, 101)
In [9]:
for i in range(Y_data.shape[0]):
    Y_data[i] = model.eval(x=x, p1_center=-1, p2_center=2,
                           p1_sigma=0.5, p2_sigma=1.5,
                           p1_height=1, p2_height=0.5)
Y_data += np.random.randn(*Y_data.shape)/10
In [10]:
plt.plot(x, Y_data.T, '-k', alpha=0.1);
_images/notebooks_pybroom-example-multi-datasets_11_0.png

Model Fitting

Single-peak model

Define and fit a single Gaussian model to the \(N\) datasets:

In [11]:
model1 = lmfit.models.GaussianModel()
In [12]:
Results1 = [model1.fit(y, x=x) for y in Y_data]
Two-peaks model

Here, instead, we use a more appropriate Gaussian mixture model.

To fit the noisy data, the residuals (the difference between model and data) is minimized in the least-squares sense.

In [13]:
params = model.make_params(p1_center=0, p2_center=3,
                           p1_sigma=0.5, p2_sigma=1,
                           p1_amplitude=1, p2_amplitude=2)
In [14]:
Results = [model.fit(y, x=x, params=params) for y in Y_data]

Fit results from an lmfit Model can be inspected with with fit_report or params.pretty_print():

In [15]:
#print(Results[0].fit_report())
#Results[0].params.pretty_print()

This is good for peeking at the results. However, extracting these data from lmfit objects is quite a chore and requires good knowledge of lmfit objects structure.

pybroom helps in this task: it extracts data from fit results and returns familiar pandas DataFrame (in tidy format). Thanks to the tidy format these data can be much more easily manipulated, filtered and plotted.

Glance

A summary of the two-peaks model fit:

In [16]:
dg = br.glance(Results, var_names='dataset')

dg.drop('model', 1).drop('message', 1).head()
Out[16]:
method num_params num_data_points chisqr redchi AIC BIC num_func_eval success dataset
0 leastsq 6 101 1.029009 0.010832 -451.238919 -435.548196 108 True 0
1 leastsq 6 101 0.772442 0.008131 -480.205143 -464.514420 67 True 1
2 leastsq 6 101 0.679224 0.007150 -493.194374 -477.503651 108 True 2
3 leastsq 6 101 1.173722 0.012355 -437.949027 -422.258304 67 True 3
4 leastsq 6 101 0.899617 0.009470 -464.811547 -449.120824 108 True 4

A summary of the one-peak model fit:

In [17]:
dg1 = br.glance(Results1, var_names='dataset')

dg1.drop('model', 1).drop('message', 1).head()
Out[17]:
method num_params num_data_points chisqr redchi AIC BIC num_func_eval success dataset
0 leastsq 3 101 1.839862 0.018774 -398.548436 -390.703075 95 True 0
1 leastsq 3 101 1.734223 0.017696 -404.520681 -396.675319 51 True 1
2 leastsq 3 101 2.017742 0.020589 -389.227306 -381.381945 79 True 2
3 leastsq 3 101 2.475818 0.025263 -368.563507 -360.718145 71 True 3
4 leastsq 3 101 1.997141 0.020379 -390.263766 -382.418405 87 True 4
Tidy

Tidy fit results for all the parameters:

In [18]:
dt = br.tidy(Results, var_names='dataset')

Let’s see the results for a single dataset:

In [19]:
dt.query('dataset == 0')
Out[19]:
name value min max vary expr stderr init_value dataset
0 p1_amplitude 1.058758 -inf inf True None 0.101663 1.0 0
1 p1_center -0.919476 -inf inf True None 0.055832 0.0 0
2 p1_fwhm 1.332050 -inf inf False 2.3548200*p1_sigma 0.132812 NaN 0
3 p1_height 0.746697 -inf inf False 0.3989423*p1_amplitude/max(1.e-15, p1_sigma) 0.057760 NaN 0
4 p1_sigma 0.565670 0.000000 inf True None 0.056400 0.5 0
5 p2_amplitude 0.843828 -inf inf True None 0.137254 2.0 0
6 p2_center 2.098743 -inf inf True None 0.216380 3.0 0
7 p2_fwhm 2.822411 -inf inf False 2.3548200*p2_sigma 0.565785 NaN 0
8 p2_height 0.280867 -inf inf False 0.3989423*p2_amplitude/max(1.e-15, p2_sigma) 0.040433 NaN 0
9 p2_sigma 1.198568 0.000000 inf True None 0.240267 1.0 0

or for a single parameter across datasets:

In [20]:
dt.query('name == "p1_center"').head()
Out[20]:
name value min max vary expr stderr init_value dataset
1 p1_center -0.919476 -inf inf True None 0.055832 0.0 0
11 p1_center -0.920288 -inf inf True None 0.045847 0.0 1
21 p1_center -1.073648 -inf inf True None 0.034499 0.0 2
31 p1_center -1.004293 -inf inf True None 0.050047 0.0 3
41 p1_center -0.834925 -inf inf True None 0.054239 0.0 4
In [21]:
dt.query('name == "p1_center"')['value'].std()
Out[21]:
0.061450737267381442
In [22]:
dt.query('name == "p2_center"')['value'].std()
Out[22]:
0.33795899893243669

Note that there is a much larger error in fitting p2_center than p1_center.

In [23]:
dt.query('name == "p1_center"')['value'].hist()
dt.query('name == "p2_center"')['value'].hist(ax=plt.gca());
_images/notebooks_pybroom-example-multi-datasets_34_0.png
Augment

Tidy dataframe with data function of the independent variable (‘x’). Columns include the data being fitted, best fit, best fit components, residuals, etc.

In [24]:
da = br.augment(Results, var_names='dataset')
In [25]:
da1 = br.augment(Results1, var_names='dataset')
In [26]:
r = Results[0]

Let’s see the results for a single dataset:

In [27]:
da.query('dataset == 0').head()
Out[27]:
x data best_fit residual Model(gaussian, prefix='p1_') Model(gaussian, prefix='p2_') dataset
0 -10.0 0.065218 2.099674e-23 -0.065218 8.253463e-57 2.099674e-23 0
1 -9.8 0.200365 1.115915e-22 -0.200365 2.261483e-54 1.115915e-22 0
2 -9.6 -0.167772 5.767901e-22 0.167772 5.468405e-52 5.767901e-22 0
3 -9.4 -0.044383 2.899425e-21 0.044383 1.166912e-49 2.899425e-21 0
4 -9.2 -0.126514 1.417468e-20 0.126514 2.197484e-47 1.417468e-20 0

Plotting a single dataset is simplified compared to a manual plot:

In [28]:
da0 = da.query('dataset == 0')
In [29]:
plt.plot('x', 'data', data=da0, marker='o', ls='None')
plt.plot('x', "Model(gaussian, prefix='p1_')", data=da0, lw=2, ls='--')
plt.plot('x', "Model(gaussian, prefix='p2_')", data=da0, lw=2, ls='--')
plt.plot('x', 'best_fit', data=da0, lw=2);
plt.legend()
Out[29]:
<matplotlib.legend.Legend at 0x7f76ef370eb8>
_images/notebooks_pybroom-example-multi-datasets_43_1.png

But keep in mind that, for a single dataset, we could use the lmfit method as well (which is even simpler):

In [30]:
Results[0].plot_fit();
_images/notebooks_pybroom-example-multi-datasets_45_0.png

However, things become much more interesting when we want to plot multiple datasets or models as in the next section.

Comparison of different datasets
In [31]:
grid = sns.FacetGrid(da.query('dataset < 6'), col="dataset", hue="dataset", col_wrap=3)
grid.map(plt.plot, 'x', 'data', marker='o', ls='None', ms=3, color='k')
grid.map(plt.plot, 'x', "Model(gaussian, prefix='p1_')", ls='--')
grid.map(plt.plot, 'x', "Model(gaussian, prefix='p2_')", ls='--')
grid.map(plt.plot, "x", "best_fit");
_images/notebooks_pybroom-example-multi-datasets_48_0.png
Comparison of one- or two-peaks models

Here we plot a comparison of the two fitted models (one or two peaks) for different datasets.

First we create a single tidy DataFrame with data from the two models:

In [32]:
da['model'] = 'twopeaks'
da1['model'] = 'onepeak'
da_tot = pd.concat([da, da1], ignore_index=True)

Then we perfom a facet plot with seaborn:

In [33]:
grid = sns.FacetGrid(da_tot.query('dataset < 6'), col="dataset", hue="model", col_wrap=3)
grid.map(plt.plot, 'x', 'data', marker='o', ls='None', ms=3, color='k')
grid.map(plt.plot, "x", "best_fit")
grid.add_legend();
_images/notebooks_pybroom-example-multi-datasets_53_0.png

Note that the “tidy” organization of data allows plot libraries such as seaborn to automatically infer most information to create complex plots with simple commands. Without tidy data, instead, a manual creation of such plots becomes a daunting task.

PyBroom Example - Multiple Datasets - Minimize

This notebook is part of pybroom.

This notebook demonstrate using pybroom when performing Maximum-Likelihood fitting (scalar minimization as opposed to curve fitting) of a set of datasets with lmfit.minimize. We will show that pybroom greatly simplifies comparing, filtering and plotting fit results from multiple datasets. For an example using curve fitting see pybroom-example-multi-datasets.
In [1]:
%matplotlib inline
%config InlineBackend.figure_format='retina'  # for hi-dpi displays
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.pylab import normpdf
import seaborn as sns
from lmfit import Model
import lmfit
print('lmfit: %s' % lmfit.__version__)
lmfit: 0.9.5
/home/docs/checkouts/readthedocs.org/user_builds/pybroom/envs/stable/lib/python3.4/site-packages/IPython/html.py:14: ShimWarning: The `IPython.html` package has been deprecated. You should import from `notebook` instead. `IPython.html.widgets` has moved to `ipywidgets`.
  "`IPython.html.widgets` has moved to `ipywidgets`.", ShimWarning)
In [2]:
sns.set_style('whitegrid')
In [3]:
import pybroom as br

Create Noisy Data

Simulate N datasets which are identical except for the additive noise.

In [4]:
N = 20  # number of datasets
n = 1000  # number of sample in each dataset

np.random.seed(1)
d1 = np.random.randn(20, int(0.6*n))*0.5 - 2
d2 = np.random.randn(20, int(0.4*n))*1.5 + 2
d = np.hstack((d1, d2))
In [5]:
ds = pd.DataFrame(data=d, columns=range(d.shape[1])).stack().reset_index()
ds.columns = ['dataset', 'sample', 'data']
ds.head()
Out[5]:
dataset sample data
0 0 0 -1.187827
1 0 1 -2.305878
2 0 2 -2.264086
3 0 3 -2.536484
4 0 4 -1.567296
In [6]:
kws = dict(bins = np.arange(-5, 5.1, 0.1), histtype='step',
           lw=2, color='k', alpha=0.1)
for i in range(N):
    ds.loc[ds.dataset == i, :].data.plot.hist(**kws)
_images/notebooks_pybroom-example-multi-datasets-minimize_7_0.png

Model Fitting

Two-peaks model

Here, we use a Gaussian mixture distribution for fitting the data.

We fit the data using the Maximum-Likelihood method, i.e. we minimize the (negative) log-likelihood function:

In [7]:
# Model PDF to be maximized
def model_pdf(x, a2, mu1, mu2, sig1, sig2):
    a1 = 1 - a2
    return (a1 * normpdf(x, mu1, sig1) +
            a2 * normpdf(x, mu2, sig2))

# Function to be minimized by lmfit
def log_likelihood_lmfit(params, x):
    pnames = ('a2', 'mu1', 'mu2', 'sig1', 'sig2')
    kws = {n: params[n] for n in pnames}
    return -np.log(model_pdf(x, **kws)).sum()

We define the parameters and “fit” the \(N\) datasets by minimizing the (scalar) function log_likelihood_lmfit:

In [8]:
params = lmfit.Parameters()
params.add('a2', 0.5, min=0, max=1)
params.add('mu1', -1, min=-5, max=5)
params.add('mu2', 1, min=-5, max=5)
params.add('sig1', 1, min=1e-6)
params.add('sig2', 1, min=1e-6)
params.add('ax', expr='a2')   # just a test for a derived parameter

Results = [lmfit.minimize(log_likelihood_lmfit, params, args=(di,),
                          nan_policy='omit', method='least_squares')
           for di in d]

Fit results can be inspected with lmfit.fit_report() or params.pretty_print():

In [9]:
print(lmfit.fit_report(Results[0]))
print()
Results[0].params.pretty_print()
[[Fit Statistics]]
    # function evals   = 137
    # data points      = 1
    # variables        = 6
    chi-square         = 3136682.002
    reduced chi-square = -627336.400
    Akaike info crit   = 26.959
    Bayesian info crit = 14.959
[[Variables]]
    a2:     0.39327059 (init= 0.5)
    mu1:   -1.96272254 (init=-1)
    mu2:    2.13070136 (init= 1)
    sig1:   0.49895469 (init= 1)
    sig2:   1.45925458 (init= 1)
    ax:     0.39327059  == 'a2'
[[Correlations]] (unreported correlations are <  0.100)

Name     Value      Min      Max   Stderr     Vary     Expr
a2      0.3933        0        1     None     True     None
ax      0.3933     -inf      inf     None    False       a2
mu1     -1.963       -5        5     None     True     None
mu2      2.131       -5        5     None     True     None
sig1     0.499    1e-06      inf     None     True     None
sig2     1.459    1e-06      inf     None     True     None

This is good for peeking at the results. However, extracting these data from lmfit objects is quite a chore and requires good knowledge of lmfit objects structure.

pybroom helps in this task: it extracts data from fit results and returns familiar pandas DataFrame (in tidy format). Thanks to the tidy format these data can be much more easily manipulated, filtered and plotted.

Let’s use the glance and tidy functions:

In [10]:
dg = br.glance(Results)
dg.drop('message', 1).head()
Out[10]:
method num_params num_data_points chisqr redchi AIC BIC num_func_eval success key
0 least_squares 6 1 3.136682e+06 -627336.400468 26.958676 14.958676 137 True 0
1 least_squares 6 1 3.060677e+06 -612135.352681 26.934147 14.934147 112 True 1
2 least_squares 6 1 3.260012e+06 -652002.425980 26.997241 14.997241 141 True 2
3 least_squares 6 1 3.096436e+06 -619287.111223 26.945762 14.945762 105 True 3
4 least_squares 6 1 3.097056e+06 -619411.157122 26.945962 14.945962 145 True 4
In [11]:
dt = br.tidy(Results, var_names='dataset')
dt.query('dataset == 0')
Out[11]:
name value min max vary expr stderr init_value dataset
0 a2 0.393271 0.000000 1.000000 True None NaN 0.000000 0
1 ax 0.393271 -inf inf False a2 NaN NaN 0
2 mu1 -1.962723 -5.000000 5.000000 True None NaN -0.201358 0
3 mu2 2.130701 -5.000000 5.000000 True None NaN 0.201358 0
4 sig1 0.498955 0.000001 inf True None NaN 1.732050 0
5 sig2 1.459255 0.000001 inf True None NaN 1.732050 0

Note that while glance returns one row per fit result, the tidy function return one row per fitted parameter.

We can query the value of one parameter (peak position) across the multiple datasets:

In [12]:
dt.query('name == "mu1"').head()
Out[12]:
name value min max vary expr stderr init_value dataset
2 mu1 -1.962723 -5.0 5.0 True None NaN -0.201358 0
8 mu1 -2.003590 -5.0 5.0 True None NaN -0.201358 1
14 mu1 -1.950642 -5.0 5.0 True None NaN -0.201358 2
20 mu1 -2.015723 -5.0 5.0 True None NaN -0.201358 3
26 mu1 -2.009203 -5.0 5.0 True None NaN -0.201358 4

By computing the standard deviation of the peak positions:

In [13]:
dt.query('name == "mu1"')['value'].std()
Out[13]:
0.024538412206652392
In [14]:
dt.query('name == "mu2"')['value'].std()
Out[14]:
0.12309821388429468

we see that the estimation of mu1 as less error than the estimation of mu2.

This difference can be also observed in the histogram of the fitted values:

In [15]:
dt.query('name == "mu1"')['value'].hist()
dt.query('name == "mu2"')['value'].hist(ax=plt.gca());
_images/notebooks_pybroom-example-multi-datasets-minimize_25_0.png

We can also use pybroom’s tidy_to_dict and dict_to_tidy functions to convert a set of fitted parameters to a dict (and vice-versa):

In [16]:
kwd_params = br.tidy_to_dict(dt.loc[dt['dataset'] == 0])
kwd_params
Out[16]:
{'a2': 0.39327059477457676,
 'ax': 0.39327059477457676,
 'mu1': -1.962722540352284,
 'mu2': 2.1307013636579843,
 'sig1': 0.49895469278155413,
 'sig2': 1.4592545828997032}
In [17]:
br.dict_to_tidy(kwd_params)
Out[17]:
name value
0 a2 0.393271
1 ax 0.393271
2 mu1 -1.962723
3 mu2 2.130701
4 sig1 0.498955
5 sig2 1.459255

This conversion is useful to call a python functions passing argument values from a tidy DataFrame.

For example, here we use tidy_to_dict to easily plot the model distribution:

In [18]:
bins = np.arange(-5, 5.01, 0.25)
x = bins[:-1] + 0.5*(bins[1] - bins[0])
grid = sns.FacetGrid(ds.query('dataset < 6'), col='dataset', hue='dataset', col_wrap=3)
grid.map(plt.hist, 'data', bins=bins, normed=True);
for i, ax in enumerate(grid.axes):
    kw_pars = br.tidy_to_dict(dt.loc[dt.dataset == i], keys_exclude=['ax'])
    y = model_pdf(x, **kw_pars)
    ax.plot(x, y, lw=2, color='k')
_images/notebooks_pybroom-example-multi-datasets-minimize_30_0.png
Single-peak model

For the sake of the example we also fit the \(N\) datasets with a single Gaussian distribution:

In [19]:
def model_pdf1(x, mu, sig):
    return normpdf(x, mu, sig)

def log_likelihood_lmfit1(params, x):
    return -np.log(model_pdf1(x, **params.valuesdict())).sum()
In [20]:
params = lmfit.Parameters()
params.add('mu', 0, min=-5, max=5)
params.add('sig', 1, min=1e-6)

Results1 = [lmfit.minimize(log_likelihood_lmfit1, params, args=(di,),
                          nan_policy='omit', method='least_squares')
           for di in d]
In [21]:
dg1 = br.glance(Results)
dg1.drop('message', 1).head()
Out[21]:
method num_params num_data_points chisqr redchi AIC BIC num_func_eval success key
0 least_squares 6 1 3.136682e+06 -627336.400468 26.958676 14.958676 137 True 0
1 least_squares 6 1 3.060677e+06 -612135.352681 26.934147 14.934147 112 True 1
2 least_squares 6 1 3.260012e+06 -652002.425980 26.997241 14.997241 141 True 2
3 least_squares 6 1 3.096436e+06 -619287.111223 26.945762 14.945762 105 True 3
4 least_squares 6 1 3.097056e+06 -619411.157122 26.945962 14.945962 145 True 4
In [22]:
dt1 = br.tidy(Results1, var_names='dataset')
dt1.query('dataset == 0')
Out[22]:
name value min max vary expr stderr init_value dataset
0 mu -0.352842 -5.000000 5.000000 True NaN NaN 0.00000 0
1 sig 2.233056 0.000001 inf True NaN NaN 1.73205 0

Augment?

Pybroom augment function extracts information that is the same size as the input dataset, for example the array of residuals. In this case, however, we performed a scalar minimization (the log-likelihood function returns a scalar) and therefore the MinimizerResult object does not contain any residual array or other data of the same size as the dataset.

Comparing fit results

We will do instead a comparison of single and two-peaks distribution using the results from the tidy function obtained in the previous section.

We start with the following plot:

In [23]:
dt['model'] = 'twopeaks'
dt1['model'] = 'onepeak'
dt_tot = pd.concat([dt, dt1], ignore_index=True)
In [24]:
bins = np.arange(-5, 5.01, 0.25)
x = bins[:-1] + 0.5*(bins[1] - bins[0])
grid = sns.FacetGrid(ds.query('dataset < 6'), col='dataset', hue='dataset', col_wrap=3)
grid.map(plt.hist, 'data', bins=bins, normed=True);
for i, ax in enumerate(grid.axes):
    kw_pars = br.tidy_to_dict(dt_tot.loc[(dt_tot.dataset == i) & (dt_tot.model == 'onepeak')])
    y1 = model_pdf1(x, **kw_pars)
    li1, = ax.plot(x, y1, lw=2, color='k', alpha=0.5)
    kw_pars = br.tidy_to_dict(dt_tot.loc[(dt_tot.dataset == i) & (dt_tot.model == 'twopeaks')], keys_exclude=['ax'])
    y = model_pdf(x, **kw_pars)
    li, = ax.plot(x, y, lw=2, color='k')
grid.add_legend(legend_data=dict(onepeak=li1, twopeaks=li),
                label_order=['onepeak', 'twopeaks'], title='model');
_images/notebooks_pybroom-example-multi-datasets-minimize_39_0.png

The problem is that FacetGrid only takes one DataFrame as input. In the previous example we provide the DataFrame of “experimental” data (ds) and use the .map method to plot histograms of the different datasets. The fitted distributions, instead, are plotted manually in the for loop.

We can invert the approach, and pass to FacetGrid the DataFrame of fitted parameters (dt_tot), while leaving the simple histogram for manual plotting. In this case we need to write an helper function (_plot) that knows how to plot a distribution given a set of parameter:

In [25]:
def _plot(names, values, x, label=None, color=None):
    df = pd.concat([names, values], axis=1)
    kw_pars = br.tidy_to_dict(df, keys_exclude=['ax'])
    func = model_pdf1 if label == 'onepeak' else model_pdf
    y = func(x, **kw_pars)
    plt.plot(x, y, lw=2, color=color, label=label)
In [26]:
bins = np.arange(-5, 5.01, 0.25)
x = bins[:-1] + 0.5*(bins[1] - bins[0])
grid = sns.FacetGrid(dt_tot.query('dataset < 6'), col='dataset', hue='model', col_wrap=3)
grid.map(_plot, 'name', 'value', x=x)
grid.add_legend()
for i, ax in enumerate(grid.axes):
    ax.hist(ds.query('dataset == %d' % i).data, bins=bins, histtype='stepfilled', normed=True,
            color='gray', alpha=0.5);
_images/notebooks_pybroom-example-multi-datasets-minimize_42_0.png

For an even better (i.e. simpler) example of plots of fit results see pybroom-example-multi-datasets.

PyBroom Example - Multiple Datasets - Scipy Robust Fit

This notebook is part of pybroom.

This notebook demonstrate using pybroom when fitting a set of curves (curve fitting) using robust fitting and scipy. We will show that pybroom greatly simplifies comparing, filtering and plotting fit results from multiple datasets. See pybroom-example-multi-datasets for an example using lmfit.Model instead of directly scipy.
In [1]:
%matplotlib inline
%config InlineBackend.figure_format='retina'  # for hi-dpi displays
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.pylab import normpdf
import seaborn as sns
from lmfit import Model
import lmfit
print('lmfit: %s' % lmfit.__version__)
lmfit: 0.9.5
/home/docs/checkouts/readthedocs.org/user_builds/pybroom/envs/stable/lib/python3.4/site-packages/IPython/html.py:14: ShimWarning: The `IPython.html` package has been deprecated. You should import from `notebook` instead. `IPython.html.widgets` has moved to `ipywidgets`.
  "`IPython.html.widgets` has moved to `ipywidgets`.", ShimWarning)
In [2]:
sns.set_style('whitegrid')
In [3]:
import pybroom as br

Create Noisy Data

We start simulating N datasets which are identical except for the additive noise.

In [4]:
N = 200
In [5]:
x = np.linspace(-10, 10, 101)
In [6]:
peak1 = lmfit.models.GaussianModel(prefix='p1_')
peak2 = lmfit.models.GaussianModel(prefix='p2_')
model = peak1 + peak2
In [7]:
#params = model.make_params(p1_amplitude=1.5, p2_amplitude=1,
#                           p1_sigma=1, p2_sigma=1)
In [8]:
Y_data = np.zeros((N, x.size))
Y_data.shape, x.shape
Out[8]:
((200, 101), (101,))
In [9]:
for i in range(Y_data.shape[0]):
    Y_data[i] = model.eval(x=x, p1_center=-1, p2_center=2,
                           p1_sigma=0.5, p2_sigma=1,
                           p1_height=1, p2_height=0.5)
Y_data += np.random.randn(*Y_data.shape)/10

Add some outliers:

In [10]:
num_outliers = int(Y_data.size * 0.05)
idx_ol = np.random.randint(low=0, high=Y_data.size, size=num_outliers)
Y_data.reshape(-1)[idx_ol] = (np.random.rand(num_outliers) - 0.5)*4
In [11]:
plt.plot(x, Y_data.T, 'ok', alpha=0.1);
plt.title('%d simulated datasets, with outliers' % N);
_images/notebooks_pybroom-example-multi-datasets-scipy-robust-fit_13_0.png

Model Fitting

curve_fit()
In [12]:
import scipy.optimize as so
In [13]:
from collections import namedtuple
In [14]:
# Model PDF to be maximized
def model_pdf(x, a1, a2, mu1, mu2, sig1, sig2):
    return (a1 * normpdf(x, mu1, sig1) +
            a2 * normpdf(x, mu2, sig2))
In [15]:
result = so.curve_fit(model_pdf, x, Y_data[0])
/home/docs/checkouts/readthedocs.org/user_builds/pybroom/envs/stable/lib/python3.4/site-packages/scipy/optimize/minpack.py:715: OptimizeWarning: Covariance of the parameters could not be estimated
  category=OptimizeWarning)
In [16]:
type(result), type(result[0]), type(result[1])
Out[16]:
(tuple, numpy.ndarray, numpy.ndarray)
In [17]:
result[0]
Out[17]:
array([ 2.45302397,  0.68802988,  0.48958039,  1.87556378,  2.04035127,
        0.00274318])

Using a namedtuple is a clean way to assign names to an array of paramenters:

In [18]:
Params = namedtuple('Params', 'a1 a2 mu1 mu2 sig1 sig2')
In [19]:
p = Params(*result[0])
p
Out[19]:
Params(a1=2.4530239688534445, a2=0.68802988315235913, mu1=0.4895803935552317, mu2=1.8755637799433695, sig1=2.040351270505754, sig2=0.0027431809705545929)

Unfortunately, not much data is returned by curve_fit, a 2-element tuple with:

  • array of best-fit parameters
  • array of jacobian

Therefore curve_fit is not very useful for detailed comparison of fit results. A better interface for curve fitting would be lmfit.Model (see this other notebook).

In the current notebook we keep exploring further options offered by scipy.optimize.

least_squares()

As an example, we use the least_squares function which supports robust loss functions and constraints.

We need to define the residuals:

In [20]:
def residuals(p, x, y):
    return y - model_pdf(x, *p)

Then, we fit the N datasets with different loss functions storing result in a dict containing lists:

In [21]:
losses = ('linear', 'huber', 'cauchy')
Results = {}
for loss in losses:
    Results[loss] = [so.least_squares(residuals, (1,1,0,1,1,1), args=(x, y), loss=loss, f_scale=0.5)
                     for y in Y_data]
NOTE: For more details on robust fitting and on the different loss functions see Robust nonlinear regression in scipy.
In [22]:
# result = Results['cauchy'][0]
# for k in result.keys():
#     print(k, type(result[k]))

Tidying the results

Now we tidy the results, combining the results for the different loss functions in a single DataFrames.

We start with the glance function, which returns one row per fit result:

In [23]:
dg_tot = br.glance(Results, var_names=['loss', 'dataset'])
dg_tot.head()
Out[23]:
success cost optimality nfev njev message dataset loss
0 True 0.886438 0.000010 15 13 `ftol` termination condition is satisfied. 0 cauchy
1 True 1.427066 0.000009 13 12 `ftol` termination condition is satisfied. 1 cauchy
2 True 1.449007 0.000047 14 13 `ftol` termination condition is satisfied. 2 cauchy
3 True 2.366572 0.000025 14 13 `ftol` termination condition is satisfied. 3 cauchy
4 True 1.264164 0.000006 14 12 `ftol` termination condition is satisfied. 4 cauchy
In [24]:
dg_tot.success.all()
Out[24]:
False

Then we apply tidy, which returns one row per parameter.

Since the object OptimzeResult returned by scipy.optimize does only contains an array of parameters, we need to pass the names as as additional argument:

In [25]:
pnames = 'a1 a2 mu1 mu2 sig1 sig2'
dt_tot = br.tidy(Results, var_names=['loss', 'dataset'], param_names=pnames)
dt_tot.head()
Out[25]:
name value grad active_mask dataset loss
0 a1 0.974333 -3.912972e-10 0.0 0 cauchy
1 a2 1.183553 -5.232544e-07 0.0 0 cauchy
2 mu1 -1.005827 -5.739109e-07 0.0 0 cauchy
3 mu2 1.817380 5.058517e-06 0.0 0 cauchy
4 sig1 0.503600 2.930292e-07 0.0 0 cauchy

Finally, we cannot apply the augment function, since the OptimizeResult object does not include much per-data-point information (it may contain the array of residuals).

Plots

First we plot the peak position and sigmas distributions:

In [26]:
kws = dict(bins = np.arange(-2, 4, 0.1), histtype='step', lw=2)
for loss in losses:
    dt_tot.query('(name == "mu1" or name == "mu2") and loss == "%s"' % loss)['value'].hist(label=loss, **kws)
    kws['ax'] = plt.gca()
plt.title(' Distribution of peaks centers')
plt.legend();
_images/notebooks_pybroom-example-multi-datasets-scipy-robust-fit_39_0.png
In [27]:
kws = dict(bins = np.arange(0, 4, 0.1), histtype='step', lw=2)
for loss in losses:
    dt_tot.query('(name == "sig1" or name == "sig2") and loss == "%s"' % loss)['value'].hist(label=loss, **kws)
    kws['ax'] = plt.gca()
plt.title(' Distribution of peaks sigmas')
plt.legend();
_images/notebooks_pybroom-example-multi-datasets-scipy-robust-fit_40_0.png

A more complete overview for all the fit paramenters can be obtained with a factorplot:

In [28]:
sns.factorplot(x='loss', y='value', data=dt_tot, col='name', hue='loss',
               col_wrap=4, kind='point', sharey=False);
_images/notebooks_pybroom-example-multi-datasets-scipy-robust-fit_42_0.png

From all the previous plots we see that, as espected, using robust fitting with higher damping of outlier (i.e. cauchy vs huber or linear) results in more accurate fit results.

Finally, we can have a peek at the comparison of raw data and fitted models for a few datatsets.

Since OptimizeResults does not include “augmented” data we need to generate these data by evaluating the model with the best-fit parameters. We use seaborn’s FacetGrid, passing a custom function _plot for model evaluation:

In [29]:
def _plot(names, values, x, label=None, color=None):
    df = pd.concat([names, values], axis=1)
    kw_pars = br.tidy_to_dict(df)
    y = model_pdf(x, **kw_pars)
    plt.plot(x, y, lw=2, color=color, label=label)
In [30]:
grid = sns.FacetGrid(dt_tot.query('dataset < 9'), col='dataset', hue='loss', col_wrap=3)
grid.map(_plot, 'name', 'value', x=x)
grid.add_legend()
for i, ax in enumerate(grid.axes):
    ax.plot(x, Y_data[i], 'o', ms=3, color='k')
plt.ylim(-1, 1.5)
Out[30]:
(-1, 1.5)
_images/notebooks_pybroom-example-multi-datasets-scipy-robust-fit_45_1.png

For comparison, the ModelResult object returned by lmfit, contains not only the evaluated model but also the evaluation of the single components (each single peak in this case). Therefore the above plot can be generated more straighforwardly using the “augmented” data. See the notebook pybroom-example-multi-datasets for an example.

Indices and tables