Welcome to pybroom’s documentation!¶
Latest Version: | 0.2 |
---|
Pybroom is a small python 3+ library for converting collections of fit results (curve fitting or other optimizations) to Pandas DataFrame in tidy format (or long-form) (Wickham 2014). Once fit results are in tidy DataFrames, it is possible to leverage common patterns for tidy data analysis. Furthermore powerful visual explorations using multi-facet plots becomes easy thanks to libraries like seaborn natively supporting tidy DataFrames.
Installation¶
You can install pybroom from PyPI using the following command:
pip install pybroom
or from conda-forge using:
conda install -c conda-forge pybroom
Dependencies are python 3.4+, pandas and lmfit (0.9.5+, which in turn requires scipy). However, matplotlib and seaborn are strongly recommended (and necessary to run the example notebooks).
Why Pybroom?¶
The Problem¶
DataFrames in tidy format (or long-form) follow a simple rule: one “observation” per row and one “variable” per column. This simple structure makes it easy to process the data with clear and well-understood idioms (filtering, aggregation, etc.) and allows plot libraries to automatically generate complex plots in which many variables are compared. Plotting libraries supporting tidy DataFrames include seaborn, recent versions of matplotlib, bokeh and altair.
But, while data is oftentimes represented in tidy DataFrames, fit results are usually stored in a variety of custom objects and are harder to manipulate, compare and plot.
Pybroom to the rescue!¶
Pybroom allows to convert several types of fit results to tidy DataFrames, and is particularly useful for handling collections of such fit results. Pybroom development was inspired by the R library broom. You can watch this video for details of the philosophy behind broom (and by extension pybroom).
Like the R library broom, pybroom provides 3 functions: glance, tidy and augment. The glance function returns fit statistics, one for each fit result (e.g. fit method, number of iterations, chi-square etc.). The tidy function returns data for each fitted parameter (e.g. fitted value, gradient, bounds, etc.). The augment function returns data with the same size as the fitted data points (evaluated best-fit model, residuals, etc.). Additionally, pybroom has two functions tidy_to_dict and dict_to_tidy for conversion between dictionaries and 2-columns tidy DataFrames.
Collections of fit results can be in list, dict, or any nested dict/list combination. When a collection of fit result is used as input, pybroom functions return a DataFrame with additional “categorical” column(s) containing the dict keys or the list index.
Currently, supported fit result object are:
- scipy.optimize.OptimizeResult returned by several functions in scipy.optimize;
- lmfit.model.ModelResult (returned by lmfit.Model.fit());
- lmfit.minimizer.MinimizerResult (returned by lmfit.minimizer()).
Note that the 3 functions (glance, tidy and augment) are implemented only for the fit-result objects that are relevant. For example, augment cannot process lmfit’s MinimizerResult or scipy’s OptimizeResult because there is little or no data relevant to each data point.
Support for result objects from other libraries such as sklearn can be added based on user request (PR welcome!).
Pybroom’s Release Notes¶
Version 0.2¶
Improved support for scipy.optimize fit result.
In addition to list of fit results, pybroom now supports:
- dict of fit results,
- dict of lists of fit results
- any other nested combination.
When input contains a dict, pybroom adds “key” column of type pandas.Categorical. When input contains a list, pybroom adds a “key” column (i.e. list index) of type int64.
Updated and expanded documentation and notebooks.
Version 0.1¶
- Support lmfit and scipy.optimize fit results.
- Support lists of fit results.
Pybroom API Documentation¶
This module contains the 3 main pybroom’s functions:
These functions take one or multiple fit results as input and return a “tidy” (or long-form) DataFrame. The glance function returns fit statistics, one for each fit result (e.g. fit method, number of iterations, chi-square etc.). The tidy function returns data for each fitted parameter (e.g. fitted value, gradient, bounds, etc.). The augment function returns data with the same size as the fitted data points (evaluated best-fit model, residuals, etc.).
In the case of multiple fit results, pybroom functions accept a list, a dict or a nested structure of dict and lists (for example a dict of lists of fit results). The example below shows some use cases.
Note
pybroom functions are particularly convenient when tidying a collection of fit results. The following examples are valid for all the 3 pybroom functions. If results is a list of datasets (e.g. data replicates), the returned dataframe will have an additional “index” column containing the index of the dataset in the list. If results is a dict of fit results (e.g. results from different fit methods or models on the same dataset), then the “index” column contains the keys of the dict (each key identifies a fit result). In the previous two example, var_names should contains the name of the “index” column (a string). Nested structures are also possible. For example, when fitting a list of datasets with different methods, we can build a dict of lists of fit results where the dict keys are the method names and the items in the list are fit results for the different datasets. In this case the returned dataframe has two additional “index” columns: one with the dict keys and one with the list index. The tuple (key, list index) identifies each single fit result. In this case var_names should be a list of column names for the keys and index column respectively (list of strings)
Example
The following examples shows pybroom output when multiple fit results are used. The glance function is used as example but the same logic (and input arguments) can be also passsed to tidy and augment.
Input is a list of fit results:
>>> results = [fit_res1, fit_res2, fit_res3]
>>> br.glance(results, var_names='dataset')
num_params num_data_points redchi AIC dataset
0 6 101 0.00911793 -468.634 0
1 6 101 0.00996431 -459.669 1
2 6 101 0.0109456 -450.183 2
Input is a dict of fit results:
>>> results = {'A': fit_res1, 'B': fit_res2, 'C': fit_res3}
>>> br.glance(results, var_names='function')
num_params num_data_points redchi AIC function
0 6 101 0.00911793 -468.634 A
1 6 101 0.00996431 -459.669 B
2 6 101 0.0109456 -450.183 C
Input is a dict of lists of fit results:
>>> results = {'A': [fit_res1, fit_res2], 'B': [fit_res3, fit_res4]}
>>> br.glance(results, var_names=['function', 'dataset'])
num_params num_data_points redchi AIC dataset function
0 6 101 0.00911793 -468.634 0 A
1 6 101 0.00996431 -459.669 1 A
2 6 101 0.0109456 -450.183 0 B
3 6 101 0.0176529 -401.908 1 B
Main Functions¶
The 3 high-level functions glance()
, tidy()
and augment()
allows tidying one or more fit results.
These are pybroom’s most generic functions, accepting all the
the supported fit result objects, as well as a list/dict of such objects.
See also the examples at the beginning of this page and the example notebooks.
-
pybroom.
glance
(results, var_names='key', **kwargs)¶ Tidy DataFrame containing fit summaries from`result`.
A function to tidy any of the supported fit result (or a list of fit results). This function will identify input type and call the relative “specialized” tidying function. When the input is a list, the returned DataFrame contains data from all the fit results. Supported fit result objects are lmfit.ModelResult, lmfit.MinimizeResult and scipy.optimize.OptimizeResult.
Parameters: - result (fit result object or list) – one of the supported fit result objects or a list of supported fit result objects. When a list, all the elements need to be of the same type.
- var_names (string or list) – name(s) of the column(s) containing an “index” that is different for each element in the set of fit results.
- **kwargs – additional arguments passed to the underlying specialized tidying function.
Returns: A DataFrame with one row for each passed fit result. Columns include fit summaries such as reduced chi-square, number of evaluation, successful convergence, AIC, BIC, etc. When a list of fit-result objects is passed, the column var_name (‘item’ by default) contains the index of the object in the list.
See also
For more details on the returned DataFrame and on additional arguments refer to the specialized tidying functions:
glance_lmfit_result()
andglance_scipy_result()
.
-
pybroom.
tidy
(result, var_names='key', **kwargs)¶ Tidy DataFrame containing fitted parameter data from result.
A function to tidy any of the supported fit result (or a list of fit results). This function will identify input type and call the relative “specialized” tidying function. When the input is a list, the returned DataFrame contains data from all the fit results. Supported fit result objects are lmfit.ModelResult, lmfit.MinimizeResult and scipy.optimize.OptimizeResult.
Parameters: - result (fit result object or list) – one of the supported fit result objects or a list of supported fit result objects. When a list, all the elements need to be of the same type.
- var_names (string or list) – name(s) of the column(s) containing an “index” that is different for each element in the set of fit results.
- param_names (string or list of string) – names of the fitted parameters for fit results which don’t include parameter’s names (such as scipy’s OptimizeResult). It can either be a list of strings or a single string with space-separated names.
- **kwargs – additional arguments passed to the underlying specialized tidying function.
Returns: A DataFrame with one row for each fitted parameter. Columns include parameter properties such as best-fit value, standard error, eventual bounds/constrains, etc. When a list of fit-result objects is passed, the column var_name (‘item’ by default) contains the index of the object in the list.
See also
For more details on the returned DataFrame and on additional arguments refer to the specialized tidying functions:
tidy_lmfit_result()
andtidy_scipy_result()
.
-
pybroom.
augment
(results, var_names='key', **kwargs)¶ Tidy DataFrame containing fit data from result.
A function to tidy any of the supported fit result (or a list of fit results). This function will identify input type and call the relative “specialized” tidying function. When the input is a list or a dict of fit results, the returned DataFrame contains data from all the fit results. In this case data from different fit results is identified by the values in the additional “index” (or categorical) column(s) whose name(s) are specified in var_names.
Parameters: - results (fit result object or list) – one of the supported fit result objects or a list of supported fit result objects. When a list, all the elements need to be of the same type.
- var_names (string or list) – name(s) of the column(s) containing an “index” that is different for each element in the set of fit results. See the example section below.
- **kwargs – additional arguments passed to the underlying specialized tidying function.
Returns: A DataFrame with one row for each data point used in the fit. It contains the input data, the model evaluated at the data points with best fitted parameters, error ranges, etc. When a list of fit-result objects is passed, the column var_name (‘item’ by default) contains the index of the object in the list.
Dictionary conversions¶
The two functions tidy_to_dict()
and dict_to_tidy()
provide
the ability to convert a tidy DataFrame to and from a python dictionary.
-
pybroom.
tidy_to_dict
(df, key='name', value='value', keys_exclude=None, cast_value=<class 'float'>)¶ Convert a tidy DataFrame into a dictionary.
This function converts two columns from an input tidy (or long-form) DataFrame into a dictionary. A typical use-case is passing parameters stored in tidy DataFrame to a python function. The arguments key and value contain the name of the DataFrame columns containing the keys and the values of the dictionary.
Parameters: - df (pandas.DataFrame) – the “tidy” DataFrame containing the data. Two columns of this DataFrame should contain the keys and the values to construct the dictionary.
- key (string or scalar) – name of the DataFrame column containing the keys of the dictionary.
- value (string or scalar) – name of the DataFrame column containing the values of the dictionary.
- keys_exclude (iterable or None) – list of keys excluded when building the returned dictionary.
- cast_value (callable or None) – callable used to cast the value of each item in the dictionary. If None, no casting is performed and the resulting values are 1-element pandas.Series. Default is the python built-in float. Other typical values may be int or str.
Returns: A dictionary with keys and values extracted from the input (tidy) DataFrame.
See also:
dict_to_tidy()
.
-
pybroom.
dict_to_tidy
(dc, key='name', value='value', keys_exclude=None)¶ Convert a dictionary into a tidy DataFrame.
This function converts a dictionary into a “tidy” (or long-form) DataFrame with two columns: one containing the keys and the other containing the values from the dictionary. Names of the columns can be specified with the key and value argument.
Parameters: - dc (dict) – the input dictionary used to build the DataFrame.
- key (string or scalar) – name of the DataFrame column containing the keys of the dictionary.
- value (string or scalar) – name of the DataFrame column containing the values of the dictionary.
- keys_exclude (iterable or None) – list of keys excluded when building the returned DataFrame.
Returns: A two-columns tidy DataFrame containing the data in the dictionary.
See also:
tidy_to_dict()
.
Specialized functions¶
These are the specialized (i.e. low-level) functions, each converting one specific object to a tidy DataFrame.
-
pybroom.
glance_scipy_result
(result)¶ Tidy summary statistics from scipy’s OptimizeResult.
Normally this function is not called directly but invoked by the general purpose function
glance()
.Parameters: result (OptimizeResult) – the fit result object. Returns: A DataFrame in tidy format with one row and several summary statistics as columns. Note
Possible columns of the returned DataFrame include:
- success (bool): whether the fit succeed
- cost (float): cost function
- optimality (float): optimality parameter as returned by scipy.optimize.least_squares.
- nfev (int): number of objective function evaluations
- njev (int): number of jacobian function evaluations
- nit (int): number of iterations
- status (int): status returned by the fit routine
- message (string): message returned by the fit routine
-
pybroom.
tidy_scipy_result
(result, param_names, **kwargs)¶ Tidy parameters data from scipy’s OptimizeResult.
Normally this function is not called directly but invoked by the general purpose function
tidy()
. Since OptimizeResult has a raw array of fitted parameters but no names, the parameters’ names need to be passed in param_names.Parameters: - result (OptimizeResult) – the fit result object.
- param_names (string or list of string) – names of the fitted parameters. It can either be a list of strings or a single string with space-separated names.
Returns: A DataFrame in tidy format with one row for each parameter.
Note
These two columns are always present in the returned DataFrame:
- name (string): name of the parameter.
- value (number): value of the parameter after the optimization.
Optional columns (depending on the type of result) are:
- grad (float): gradient for each parameter
- active_mask (int)
-
pybroom.
glance_lmfit_result
(result)¶ Tidy summary statistics from lmfit’s ModelResult or MinimizerResult.
Normally this function is not called directly but invoked by the general purpose function
glance()
.Parameters: result (ModelResult or MinimizerResult) – the fit result object. Returns: A DataFrame in tidy format with one row and several summary statistics as columns. Note
The columns of the returned DataFrame are:
- model (string): model name (only for ModelResult)
- method (string): method used for the optimization (e.g. leastsq).
- num_params (int): number of varied parameters
- ndata (int):
- chisqr (float): chi-square statistics.
- redchi (float): reduced chi-square statistics.
- AIC (float): Akaike Information Criterion statistics.
- BIC (float): Bayes Information Criterion statistics.
- num_func_eval (int): number of evaluations of the objective function during the fit.
- num_data_points (int): number of data points (e.g. samples) used for the fit.
-
pybroom.
tidy_lmfit_result
(result)¶ Tidy parameters from lmfit’s ModelResult or MinimizerResult.
Normally this function is not called directly but invoked by the general purpose function
tidy()
.Parameters: result (ModelResult or MinimizerResult) – the fit result object. Returns: A DataFrame in tidy format with one row for each parameter. Note
The (possible) columns of the returned DataFrame are:
- name (string): name of the parameter.
- value (number): value of the parameter after the optimization.
- init_value (number): initial value of the parameter before the optimization.
- min, max (numbers): bounds of the parameter
- vary (bool): whether the parameter has been varied during the optimization.
- expr (string): constraint expression for the parameter.
- stderr (float): standard error for the parameter.
-
pybroom.
_augment_lmfit_modelresult
(result)¶ Tidy data values and fitted model from lmfit.model.ModelResult.
PyBroom Example - Simple¶
This notebook is part of pybroom.
This notebook shows the simplest usage of pybroom when performing a curve fit of a single dataset. Possible applications are only hinted. For a more complex (and interesting!) example using multiple datasets see pybroom-example-multi-datasets.
In [1]:
import numpy as np
from numpy import sqrt, pi, exp, linspace
from lmfit import Model
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format='retina' # for hi-dpi displays
In [2]:
import lmfit
print('lmfit: %s' % lmfit.__version__)
lmfit: 0.9.5
In [3]:
import pybroom as br
Create Noisy Data¶
In [4]:
x = np.linspace(-10, 10, 101)
In [5]:
peak1 = lmfit.models.GaussianModel(prefix='p1_')
peak2 = lmfit.models.GaussianModel(prefix='p2_')
model = peak1 + peak2
In [6]:
params = model.make_params(p1_amplitude=1, p2_amplitude=1,
p1_sigma=1, p2_sigma=1)
In [7]:
y_data = model.eval(x=x, p1_center=-1, p2_center=2, p1_sigma=0.5, p2_sigma=1, p1_amplitude=1, p2_amplitude=2)
y_data.shape
Out[7]:
(101,)
In [8]:
y_data += np.random.randn(*y_data.shape)/10
In [9]:
plt.plot(x, y_data)
Out[9]:
[<matplotlib.lines.Line2D at 0x7fd5ee324d30>]

Model Fitting¶
In [10]:
params = model.make_params(p1_center=0, p2_center=3,
p1_sigma=0.5, p2_sigma=1,
p1_amplitude=1, p2_amplitude=2)
result = model.fit(y_data, x=x, params=params)
Fit result from an lmfit Model
can be inspected with with
fit_report
or params.pretty_print()
:
In [11]:
print(result.fit_report())
[[Model]]
(Model(gaussian, prefix='p1_') + Model(gaussian, prefix='p2_'))
[[Fit Statistics]]
# function evals = 74
# data points = 101
# variables = 6
chi-square = 0.948
reduced chi-square = 0.010
Akaike info crit = -459.550
Bayesian info crit = -443.859
[[Variables]]
p1_sigma: 0.50381801 +/- 0.041780 (8.29%) (init= 0.5)
p1_center: -0.94157635 +/- 0.041206 (4.38%) (init= 0)
p1_amplitude: 1.04915437 +/- 0.075032 (7.15%) (init= 1)
p2_sigma: 0.85357363 +/- 0.051774 (6.07%) (init= 1)
p2_center: 2.03505014 +/- 0.050235 (2.47%) (init= 3)
p2_amplitude: 1.88623359 +/- 0.097151 (5.15%) (init= 2)
p1_fwhm: 1.18640072 +/- 0.098384 (8.29%) == '2.3548200*p1_sigma'
p1_height: 0.83076041 +/- 0.057986 (6.98%) == '0.3989423*p1_amplitude/max(1.e-15, p1_sigma)'
p2_fwhm: 2.01001225 +/- 0.121920 (6.07%) == '2.3548200*p2_sigma'
p2_height: 0.88158577 +/- 0.044848 (5.09%) == '0.3989423*p2_amplitude/max(1.e-15, p2_sigma)'
[[Correlations]] (unreported correlations are < 0.100)
C(p1_sigma, p1_amplitude) = 0.600
C(p2_sigma, p2_amplitude) = 0.599
C(p1_sigma, p2_sigma) = -0.214
C(p1_amplitude, p2_sigma) = -0.203
C(p1_center, p2_sigma) = -0.163
C(p1_sigma, p2_amplitude) = -0.158
C(p1_amplitude, p2_amplitude) = -0.146
C(p1_sigma, p2_center) = 0.116
C(p1_center, p2_amplitude) = -0.116
C(p1_amplitude, p2_center) = 0.101
In [12]:
result.params.pretty_print()
Name Value Min Max Stderr Vary Expr
p1_amplitude 1.049 -inf inf 0.07503 True None
p1_center -0.9416 -inf inf 0.04121 True None
p1_fwhm 1.186 -inf inf 0.09838 False 2.3548200*p1_sigma
p1_height 0.8308 -inf inf 0.05799 False 0.3989423*p1_amplitude/max(1.e-15, p1_sigma)
p1_sigma 0.5038 0 inf 0.04178 True None
p2_amplitude 1.886 -inf inf 0.09715 True None
p2_center 2.035 -inf inf 0.05024 True None
p2_fwhm 2.01 -inf inf 0.1219 False 2.3548200*p2_sigma
p2_height 0.8816 -inf inf 0.04485 False 0.3989423*p2_amplitude/max(1.e-15, p2_sigma)
p2_sigma 0.8536 0 inf 0.05177 True None
These methods a re convenient but extracting the data from the lmfit object requires some work and the knowledge of lmfit object structure.
pybroom comes to help, extracting data from fit results and returning pandas DataFrame in tidy format that can be much more easily manipulated, filtered and plotted.
Glance¶
Glancing at the fit results (dropping some verbose columns):
In [13]:
dg = br.glance(result)
dg.drop('model', 1).drop('message', 1)
Out[13]:
method | num_params | num_data_points | chisqr | redchi | AIC | BIC | num_func_eval | success | |
---|---|---|---|---|---|---|---|---|---|
0 | leastsq | 6 | 101 | 0.947728 | 0.009976 | -459.549628 | -443.858905 | 74 | True |
The glance function returns a DataFrame with one row per fit-result object.
Application Idea¶
If you fit N models to the same dataset you can compare statistics such as reduced-\(\chi^2\)
Or, fitting several with several methods (and datasets) you can study the convergence properties using reduced-\(\chi^2\), number of function evaluation and success rate.
Tidy¶
Tidy fit results for all the parameters:
In [14]:
dt = br.tidy(result)
dt
Out[14]:
name | value | min | max | vary | expr | stderr | init_value | |
---|---|---|---|---|---|---|---|---|
0 | p1_amplitude | 1.049154 | -inf | inf | True | None | 0.075033 | 1.0 |
1 | p1_center | -0.941576 | -inf | inf | True | None | 0.041206 | 0.0 |
2 | p1_fwhm | 1.186401 | -inf | inf | False | 2.3548200*p1_sigma | 0.098385 | NaN |
3 | p1_height | 0.830760 | -inf | inf | False | 0.3989423*p1_amplitude/max(1.e-15, p1_sigma) | 0.057987 | NaN |
4 | p1_sigma | 0.503818 | 0.000000 | inf | True | None | 0.041780 | 0.5 |
5 | p2_amplitude | 1.886234 | -inf | inf | True | None | 0.097151 | 2.0 |
6 | p2_center | 2.035050 | -inf | inf | True | None | 0.050235 | 3.0 |
7 | p2_fwhm | 2.010012 | -inf | inf | False | 2.3548200*p2_sigma | 0.121920 | NaN |
8 | p2_height | 0.881586 | -inf | inf | False | 0.3989423*p2_amplitude/max(1.e-15, p2_sigma) | 0.044849 | NaN |
9 | p2_sigma | 0.853574 | 0.000000 | inf | True | None | 0.051775 | 1.0 |
The tidy function returns one row for each parameter.
In [15]:
dt.loc[dt.name == 'p1_center']
Out[15]:
name | value | min | max | vary | expr | stderr | init_value | |
---|---|---|---|---|---|---|---|---|
1 | p1_center | -0.941576 | -inf | inf | True | None | 0.041206 | 0.0 |
Augment¶
Tidy dataframe with data function of the independent variable (‘x’). Columns include the data being fitted, best fit, best fit components, residuals, etc.
In [16]:
da = br.augment(result)
da.head()
Out[16]:
x | data | best_fit | residual | Model(gaussian, prefix='p1_') | Model(gaussian, prefix='p2_') | |
---|---|---|---|---|---|---|
0 | -10.0 | -0.056887 | 5.979251e-44 | 0.056887 | 5.290582e-71 | 5.979251e-44 |
1 | -9.8 | 0.012225 | 1.583027e-42 | -0.012225 | 6.151539e-68 | 1.583027e-42 |
2 | -9.6 | -0.028639 | 3.967227e-41 | 0.028639 | 6.109787e-65 | 3.967227e-41 |
3 | -9.4 | -0.037673 | 9.411146e-40 | 0.037673 | 5.183588e-62 | 9.411146e-40 |
4 | -9.2 | 0.014174 | 2.113270e-38 | -0.014174 | 3.756617e-59 | 2.113270e-38 |
The `augment <>`__ function returns one row for each data point.
In [17]:
d = br.augment(result)
In [18]:
fig, ax = plt.subplots(2, 1, figsize=(7, 8))
ax[1].plot('x', 'data', data=d, marker='o', ls='None')
ax[1].plot('x', "Model(gaussian, prefix='p1_')", data=d, lw=2, ls='--')
ax[1].plot('x', "Model(gaussian, prefix='p2_')", data=d, lw=2, ls='--')
ax[1].plot('x', 'best_fit', data=d, lw=2)
ax[0].plot('x', 'residual', data=d);

Application Idea¶
Fitting N datasets with the same model (or N models with the same
dataset) you can automatically build a panel plot with seaborn
using
the dataset (or the model) as categorical variable. This example is
illustrated in
pybroom-example-multi-datasets.
PyBroom Example - Multiple Datasets¶
This notebook is part of pybroom.
This notebook demonstrate using pybroom when fitting a set of curves (curve fitting) using lmfit.Model. We will show that pybroom greatly simplifies comparing, filtering and plotting fit results from multiple datasets.
In [1]:
%matplotlib inline
%config InlineBackend.figure_format='retina' # for hi-dpi displays
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from lmfit import Model
import lmfit
print('lmfit: %s' % lmfit.__version__)
lmfit: 0.9.5
/home/docs/checkouts/readthedocs.org/user_builds/pybroom/envs/stable/lib/python3.4/site-packages/IPython/html.py:14: ShimWarning: The `IPython.html` package has been deprecated. You should import from `notebook` instead. `IPython.html.widgets` has moved to `ipywidgets`.
"`IPython.html.widgets` has moved to `ipywidgets`.", ShimWarning)
In [2]:
sns.set_style('whitegrid')
In [3]:
import pybroom as br
Create Noisy Data¶
We start simulating N datasets which are identical except for the additive noise.
In [4]:
N = 20
In [5]:
x = np.linspace(-10, 10, 101)
In [6]:
peak1 = lmfit.models.GaussianModel(prefix='p1_')
peak2 = lmfit.models.GaussianModel(prefix='p2_')
model = peak1 + peak2
In [7]:
#params = model.make_params(p1_amplitude=1.5, p2_amplitude=1,
# p1_sigma=1, p2_sigma=1)
In [8]:
Y_data = np.zeros((N, x.size))
Y_data.shape
Out[8]:
(20, 101)
In [9]:
for i in range(Y_data.shape[0]):
Y_data[i] = model.eval(x=x, p1_center=-1, p2_center=2,
p1_sigma=0.5, p2_sigma=1.5,
p1_height=1, p2_height=0.5)
Y_data += np.random.randn(*Y_data.shape)/10
In [10]:
plt.plot(x, Y_data.T, '-k', alpha=0.1);

Model Fitting¶
Single-peak model¶
Define and fit a single Gaussian model to the \(N\) datasets:
In [11]:
model1 = lmfit.models.GaussianModel()
In [12]:
Results1 = [model1.fit(y, x=x) for y in Y_data]
Two-peaks model¶
Here, instead, we use a more appropriate Gaussian mixture model.
To fit the noisy data, the residuals (the difference between model and data) is minimized in the least-squares sense.
In [13]:
params = model.make_params(p1_center=0, p2_center=3,
p1_sigma=0.5, p2_sigma=1,
p1_amplitude=1, p2_amplitude=2)
In [14]:
Results = [model.fit(y, x=x, params=params) for y in Y_data]
Fit results from an lmfit Model
can be inspected with with
fit_report
or params.pretty_print()
:
In [15]:
#print(Results[0].fit_report())
#Results[0].params.pretty_print()
This is good for peeking at the results. However, extracting these data from lmfit objects is quite a chore and requires good knowledge of lmfit objects structure.
pybroom helps in this task: it extracts data from fit results and returns familiar pandas DataFrame (in tidy format). Thanks to the tidy format these data can be much more easily manipulated, filtered and plotted.
Glance¶
A summary of the two-peaks model fit:
In [16]:
dg = br.glance(Results, var_names='dataset')
dg.drop('model', 1).drop('message', 1).head()
Out[16]:
method | num_params | num_data_points | chisqr | redchi | AIC | BIC | num_func_eval | success | dataset | |
---|---|---|---|---|---|---|---|---|---|---|
0 | leastsq | 6 | 101 | 1.029009 | 0.010832 | -451.238919 | -435.548196 | 108 | True | 0 |
1 | leastsq | 6 | 101 | 0.772442 | 0.008131 | -480.205143 | -464.514420 | 67 | True | 1 |
2 | leastsq | 6 | 101 | 0.679224 | 0.007150 | -493.194374 | -477.503651 | 108 | True | 2 |
3 | leastsq | 6 | 101 | 1.173722 | 0.012355 | -437.949027 | -422.258304 | 67 | True | 3 |
4 | leastsq | 6 | 101 | 0.899617 | 0.009470 | -464.811547 | -449.120824 | 108 | True | 4 |
A summary of the one-peak model fit:
In [17]:
dg1 = br.glance(Results1, var_names='dataset')
dg1.drop('model', 1).drop('message', 1).head()
Out[17]:
method | num_params | num_data_points | chisqr | redchi | AIC | BIC | num_func_eval | success | dataset | |
---|---|---|---|---|---|---|---|---|---|---|
0 | leastsq | 3 | 101 | 1.839862 | 0.018774 | -398.548436 | -390.703075 | 95 | True | 0 |
1 | leastsq | 3 | 101 | 1.734223 | 0.017696 | -404.520681 | -396.675319 | 51 | True | 1 |
2 | leastsq | 3 | 101 | 2.017742 | 0.020589 | -389.227306 | -381.381945 | 79 | True | 2 |
3 | leastsq | 3 | 101 | 2.475818 | 0.025263 | -368.563507 | -360.718145 | 71 | True | 3 |
4 | leastsq | 3 | 101 | 1.997141 | 0.020379 | -390.263766 | -382.418405 | 87 | True | 4 |
Tidy¶
Tidy fit results for all the parameters:
In [18]:
dt = br.tidy(Results, var_names='dataset')
Let’s see the results for a single dataset:
In [19]:
dt.query('dataset == 0')
Out[19]:
name | value | min | max | vary | expr | stderr | init_value | dataset | |
---|---|---|---|---|---|---|---|---|---|
0 | p1_amplitude | 1.058758 | -inf | inf | True | None | 0.101663 | 1.0 | 0 |
1 | p1_center | -0.919476 | -inf | inf | True | None | 0.055832 | 0.0 | 0 |
2 | p1_fwhm | 1.332050 | -inf | inf | False | 2.3548200*p1_sigma | 0.132812 | NaN | 0 |
3 | p1_height | 0.746697 | -inf | inf | False | 0.3989423*p1_amplitude/max(1.e-15, p1_sigma) | 0.057760 | NaN | 0 |
4 | p1_sigma | 0.565670 | 0.000000 | inf | True | None | 0.056400 | 0.5 | 0 |
5 | p2_amplitude | 0.843828 | -inf | inf | True | None | 0.137254 | 2.0 | 0 |
6 | p2_center | 2.098743 | -inf | inf | True | None | 0.216380 | 3.0 | 0 |
7 | p2_fwhm | 2.822411 | -inf | inf | False | 2.3548200*p2_sigma | 0.565785 | NaN | 0 |
8 | p2_height | 0.280867 | -inf | inf | False | 0.3989423*p2_amplitude/max(1.e-15, p2_sigma) | 0.040433 | NaN | 0 |
9 | p2_sigma | 1.198568 | 0.000000 | inf | True | None | 0.240267 | 1.0 | 0 |
or for a single parameter across datasets:
In [20]:
dt.query('name == "p1_center"').head()
Out[20]:
name | value | min | max | vary | expr | stderr | init_value | dataset | |
---|---|---|---|---|---|---|---|---|---|
1 | p1_center | -0.919476 | -inf | inf | True | None | 0.055832 | 0.0 | 0 |
11 | p1_center | -0.920288 | -inf | inf | True | None | 0.045847 | 0.0 | 1 |
21 | p1_center | -1.073648 | -inf | inf | True | None | 0.034499 | 0.0 | 2 |
31 | p1_center | -1.004293 | -inf | inf | True | None | 0.050047 | 0.0 | 3 |
41 | p1_center | -0.834925 | -inf | inf | True | None | 0.054239 | 0.0 | 4 |
In [21]:
dt.query('name == "p1_center"')['value'].std()
Out[21]:
0.061450737267381442
In [22]:
dt.query('name == "p2_center"')['value'].std()
Out[22]:
0.33795899893243669
Note that there is a much larger error in fitting p2_center
than
p1_center
.
In [23]:
dt.query('name == "p1_center"')['value'].hist()
dt.query('name == "p2_center"')['value'].hist(ax=plt.gca());

Augment¶
Tidy dataframe with data function of the independent variable (‘x’). Columns include the data being fitted, best fit, best fit components, residuals, etc.
In [24]:
da = br.augment(Results, var_names='dataset')
In [25]:
da1 = br.augment(Results1, var_names='dataset')
In [26]:
r = Results[0]
Let’s see the results for a single dataset:
In [27]:
da.query('dataset == 0').head()
Out[27]:
x | data | best_fit | residual | Model(gaussian, prefix='p1_') | Model(gaussian, prefix='p2_') | dataset | |
---|---|---|---|---|---|---|---|
0 | -10.0 | 0.065218 | 2.099674e-23 | -0.065218 | 8.253463e-57 | 2.099674e-23 | 0 |
1 | -9.8 | 0.200365 | 1.115915e-22 | -0.200365 | 2.261483e-54 | 1.115915e-22 | 0 |
2 | -9.6 | -0.167772 | 5.767901e-22 | 0.167772 | 5.468405e-52 | 5.767901e-22 | 0 |
3 | -9.4 | -0.044383 | 2.899425e-21 | 0.044383 | 1.166912e-49 | 2.899425e-21 | 0 |
4 | -9.2 | -0.126514 | 1.417468e-20 | 0.126514 | 2.197484e-47 | 1.417468e-20 | 0 |
Plotting a single dataset is simplified compared to a manual plot:
In [28]:
da0 = da.query('dataset == 0')
In [29]:
plt.plot('x', 'data', data=da0, marker='o', ls='None')
plt.plot('x', "Model(gaussian, prefix='p1_')", data=da0, lw=2, ls='--')
plt.plot('x', "Model(gaussian, prefix='p2_')", data=da0, lw=2, ls='--')
plt.plot('x', 'best_fit', data=da0, lw=2);
plt.legend()
Out[29]:
<matplotlib.legend.Legend at 0x7f76ef370eb8>

But keep in mind that, for a single dataset, we could use the lmfit method as well (which is even simpler):
In [30]:
Results[0].plot_fit();

However, things become much more interesting when we want to plot multiple datasets or models as in the next section.
Comparison of different datasets¶
In [31]:
grid = sns.FacetGrid(da.query('dataset < 6'), col="dataset", hue="dataset", col_wrap=3)
grid.map(plt.plot, 'x', 'data', marker='o', ls='None', ms=3, color='k')
grid.map(plt.plot, 'x', "Model(gaussian, prefix='p1_')", ls='--')
grid.map(plt.plot, 'x', "Model(gaussian, prefix='p2_')", ls='--')
grid.map(plt.plot, "x", "best_fit");

Comparison of one- or two-peaks models¶
Here we plot a comparison of the two fitted models (one or two peaks) for different datasets.
First we create a single tidy DataFrame with data from the two models:
In [32]:
da['model'] = 'twopeaks'
da1['model'] = 'onepeak'
da_tot = pd.concat([da, da1], ignore_index=True)
Then we perfom a facet plot with seaborn:
In [33]:
grid = sns.FacetGrid(da_tot.query('dataset < 6'), col="dataset", hue="model", col_wrap=3)
grid.map(plt.plot, 'x', 'data', marker='o', ls='None', ms=3, color='k')
grid.map(plt.plot, "x", "best_fit")
grid.add_legend();

Note that the “tidy” organization of data allows plot libraries such as seaborn to automatically infer most information to create complex plots with simple commands. Without tidy data, instead, a manual creation of such plots becomes a daunting task.
PyBroom Example - Multiple Datasets - Minimize¶
This notebook is part of pybroom.
This notebook demonstrate using pybroom when performing Maximum-Likelihood fitting (scalar minimization as opposed to curve fitting) of a set of datasets with lmfit.minimize. We will show that pybroom greatly simplifies comparing, filtering and plotting fit results from multiple datasets. For an example using curve fitting see pybroom-example-multi-datasets.
In [1]:
%matplotlib inline
%config InlineBackend.figure_format='retina' # for hi-dpi displays
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.pylab import normpdf
import seaborn as sns
from lmfit import Model
import lmfit
print('lmfit: %s' % lmfit.__version__)
lmfit: 0.9.5
/home/docs/checkouts/readthedocs.org/user_builds/pybroom/envs/stable/lib/python3.4/site-packages/IPython/html.py:14: ShimWarning: The `IPython.html` package has been deprecated. You should import from `notebook` instead. `IPython.html.widgets` has moved to `ipywidgets`.
"`IPython.html.widgets` has moved to `ipywidgets`.", ShimWarning)
In [2]:
sns.set_style('whitegrid')
In [3]:
import pybroom as br
Create Noisy Data¶
Simulate N datasets which are identical except for the additive noise.
In [4]:
N = 20 # number of datasets
n = 1000 # number of sample in each dataset
np.random.seed(1)
d1 = np.random.randn(20, int(0.6*n))*0.5 - 2
d2 = np.random.randn(20, int(0.4*n))*1.5 + 2
d = np.hstack((d1, d2))
In [5]:
ds = pd.DataFrame(data=d, columns=range(d.shape[1])).stack().reset_index()
ds.columns = ['dataset', 'sample', 'data']
ds.head()
Out[5]:
dataset | sample | data | |
---|---|---|---|
0 | 0 | 0 | -1.187827 |
1 | 0 | 1 | -2.305878 |
2 | 0 | 2 | -2.264086 |
3 | 0 | 3 | -2.536484 |
4 | 0 | 4 | -1.567296 |
In [6]:
kws = dict(bins = np.arange(-5, 5.1, 0.1), histtype='step',
lw=2, color='k', alpha=0.1)
for i in range(N):
ds.loc[ds.dataset == i, :].data.plot.hist(**kws)

Model Fitting¶
Two-peaks model¶
Here, we use a Gaussian mixture distribution for fitting the data.
We fit the data using the Maximum-Likelihood method, i.e. we minimize the (negative) log-likelihood function:
In [7]:
# Model PDF to be maximized
def model_pdf(x, a2, mu1, mu2, sig1, sig2):
a1 = 1 - a2
return (a1 * normpdf(x, mu1, sig1) +
a2 * normpdf(x, mu2, sig2))
# Function to be minimized by lmfit
def log_likelihood_lmfit(params, x):
pnames = ('a2', 'mu1', 'mu2', 'sig1', 'sig2')
kws = {n: params[n] for n in pnames}
return -np.log(model_pdf(x, **kws)).sum()
We define the parameters and “fit” the \(N\) datasets by minimizing
the (scalar) function log_likelihood_lmfit
:
In [8]:
params = lmfit.Parameters()
params.add('a2', 0.5, min=0, max=1)
params.add('mu1', -1, min=-5, max=5)
params.add('mu2', 1, min=-5, max=5)
params.add('sig1', 1, min=1e-6)
params.add('sig2', 1, min=1e-6)
params.add('ax', expr='a2') # just a test for a derived parameter
Results = [lmfit.minimize(log_likelihood_lmfit, params, args=(di,),
nan_policy='omit', method='least_squares')
for di in d]
Fit results can be inspected with lmfit.fit_report()
or
params.pretty_print()
:
In [9]:
print(lmfit.fit_report(Results[0]))
print()
Results[0].params.pretty_print()
[[Fit Statistics]]
# function evals = 137
# data points = 1
# variables = 6
chi-square = 3136682.002
reduced chi-square = -627336.400
Akaike info crit = 26.959
Bayesian info crit = 14.959
[[Variables]]
a2: 0.39327059 (init= 0.5)
mu1: -1.96272254 (init=-1)
mu2: 2.13070136 (init= 1)
sig1: 0.49895469 (init= 1)
sig2: 1.45925458 (init= 1)
ax: 0.39327059 == 'a2'
[[Correlations]] (unreported correlations are < 0.100)
Name Value Min Max Stderr Vary Expr
a2 0.3933 0 1 None True None
ax 0.3933 -inf inf None False a2
mu1 -1.963 -5 5 None True None
mu2 2.131 -5 5 None True None
sig1 0.499 1e-06 inf None True None
sig2 1.459 1e-06 inf None True None
This is good for peeking at the results. However, extracting these data from lmfit objects is quite a chore and requires good knowledge of lmfit objects structure.
pybroom helps in this task: it extracts data from fit results and returns familiar pandas DataFrame (in tidy format). Thanks to the tidy format these data can be much more easily manipulated, filtered and plotted.
Let’s use the glance and tidy functions:
In [10]:
dg = br.glance(Results)
dg.drop('message', 1).head()
Out[10]:
method | num_params | num_data_points | chisqr | redchi | AIC | BIC | num_func_eval | success | key | |
---|---|---|---|---|---|---|---|---|---|---|
0 | least_squares | 6 | 1 | 3.136682e+06 | -627336.400468 | 26.958676 | 14.958676 | 137 | True | 0 |
1 | least_squares | 6 | 1 | 3.060677e+06 | -612135.352681 | 26.934147 | 14.934147 | 112 | True | 1 |
2 | least_squares | 6 | 1 | 3.260012e+06 | -652002.425980 | 26.997241 | 14.997241 | 141 | True | 2 |
3 | least_squares | 6 | 1 | 3.096436e+06 | -619287.111223 | 26.945762 | 14.945762 | 105 | True | 3 |
4 | least_squares | 6 | 1 | 3.097056e+06 | -619411.157122 | 26.945962 | 14.945962 | 145 | True | 4 |
In [11]:
dt = br.tidy(Results, var_names='dataset')
dt.query('dataset == 0')
Out[11]:
name | value | min | max | vary | expr | stderr | init_value | dataset | |
---|---|---|---|---|---|---|---|---|---|
0 | a2 | 0.393271 | 0.000000 | 1.000000 | True | None | NaN | 0.000000 | 0 |
1 | ax | 0.393271 | -inf | inf | False | a2 | NaN | NaN | 0 |
2 | mu1 | -1.962723 | -5.000000 | 5.000000 | True | None | NaN | -0.201358 | 0 |
3 | mu2 | 2.130701 | -5.000000 | 5.000000 | True | None | NaN | 0.201358 | 0 |
4 | sig1 | 0.498955 | 0.000001 | inf | True | None | NaN | 1.732050 | 0 |
5 | sig2 | 1.459255 | 0.000001 | inf | True | None | NaN | 1.732050 | 0 |
Note that while glance returns one row per fit result, the tidy function return one row per fitted parameter.
We can query the value of one parameter (peak position) across the multiple datasets:
In [12]:
dt.query('name == "mu1"').head()
Out[12]:
name | value | min | max | vary | expr | stderr | init_value | dataset | |
---|---|---|---|---|---|---|---|---|---|
2 | mu1 | -1.962723 | -5.0 | 5.0 | True | None | NaN | -0.201358 | 0 |
8 | mu1 | -2.003590 | -5.0 | 5.0 | True | None | NaN | -0.201358 | 1 |
14 | mu1 | -1.950642 | -5.0 | 5.0 | True | None | NaN | -0.201358 | 2 |
20 | mu1 | -2.015723 | -5.0 | 5.0 | True | None | NaN | -0.201358 | 3 |
26 | mu1 | -2.009203 | -5.0 | 5.0 | True | None | NaN | -0.201358 | 4 |
By computing the standard deviation of the peak positions:
In [13]:
dt.query('name == "mu1"')['value'].std()
Out[13]:
0.024538412206652392
In [14]:
dt.query('name == "mu2"')['value'].std()
Out[14]:
0.12309821388429468
we see that the estimation of mu1
as less error than the estimation
of mu2
.
This difference can be also observed in the histogram of the fitted values:
In [15]:
dt.query('name == "mu1"')['value'].hist()
dt.query('name == "mu2"')['value'].hist(ax=plt.gca());

We can also use pybroom’s tidy_to_dict and dict_to_tidy functions to convert a set of fitted parameters to a dict (and vice-versa):
In [16]:
kwd_params = br.tidy_to_dict(dt.loc[dt['dataset'] == 0])
kwd_params
Out[16]:
{'a2': 0.39327059477457676,
'ax': 0.39327059477457676,
'mu1': -1.962722540352284,
'mu2': 2.1307013636579843,
'sig1': 0.49895469278155413,
'sig2': 1.4592545828997032}
In [17]:
br.dict_to_tidy(kwd_params)
Out[17]:
name | value | |
---|---|---|
0 | a2 | 0.393271 |
1 | ax | 0.393271 |
2 | mu1 | -1.962723 |
3 | mu2 | 2.130701 |
4 | sig1 | 0.498955 |
5 | sig2 | 1.459255 |
This conversion is useful to call a python functions passing argument values from a tidy DataFrame.
For example, here we use tidy_to_dict
to easily plot the model
distribution:
In [18]:
bins = np.arange(-5, 5.01, 0.25)
x = bins[:-1] + 0.5*(bins[1] - bins[0])
grid = sns.FacetGrid(ds.query('dataset < 6'), col='dataset', hue='dataset', col_wrap=3)
grid.map(plt.hist, 'data', bins=bins, normed=True);
for i, ax in enumerate(grid.axes):
kw_pars = br.tidy_to_dict(dt.loc[dt.dataset == i], keys_exclude=['ax'])
y = model_pdf(x, **kw_pars)
ax.plot(x, y, lw=2, color='k')

Single-peak model¶
For the sake of the example we also fit the \(N\) datasets with a single Gaussian distribution:
In [19]:
def model_pdf1(x, mu, sig):
return normpdf(x, mu, sig)
def log_likelihood_lmfit1(params, x):
return -np.log(model_pdf1(x, **params.valuesdict())).sum()
In [20]:
params = lmfit.Parameters()
params.add('mu', 0, min=-5, max=5)
params.add('sig', 1, min=1e-6)
Results1 = [lmfit.minimize(log_likelihood_lmfit1, params, args=(di,),
nan_policy='omit', method='least_squares')
for di in d]
In [21]:
dg1 = br.glance(Results)
dg1.drop('message', 1).head()
Out[21]:
method | num_params | num_data_points | chisqr | redchi | AIC | BIC | num_func_eval | success | key | |
---|---|---|---|---|---|---|---|---|---|---|
0 | least_squares | 6 | 1 | 3.136682e+06 | -627336.400468 | 26.958676 | 14.958676 | 137 | True | 0 |
1 | least_squares | 6 | 1 | 3.060677e+06 | -612135.352681 | 26.934147 | 14.934147 | 112 | True | 1 |
2 | least_squares | 6 | 1 | 3.260012e+06 | -652002.425980 | 26.997241 | 14.997241 | 141 | True | 2 |
3 | least_squares | 6 | 1 | 3.096436e+06 | -619287.111223 | 26.945762 | 14.945762 | 105 | True | 3 |
4 | least_squares | 6 | 1 | 3.097056e+06 | -619411.157122 | 26.945962 | 14.945962 | 145 | True | 4 |
In [22]:
dt1 = br.tidy(Results1, var_names='dataset')
dt1.query('dataset == 0')
Out[22]:
name | value | min | max | vary | expr | stderr | init_value | dataset | |
---|---|---|---|---|---|---|---|---|---|
0 | mu | -0.352842 | -5.000000 | 5.000000 | True | NaN | NaN | 0.00000 | 0 |
1 | sig | 2.233056 | 0.000001 | inf | True | NaN | NaN | 1.73205 | 0 |
Augment?¶
Pybroom
augment
function extracts information that is the same size as the input
dataset, for example the array of residuals. In this case, however, we
performed a scalar minimization (the log-likelihood function returns a
scalar) and therefore the MinimizerResult
object does not contain
any residual array or other data of the same size as the dataset.
Comparing fit results¶
We will do instead a comparison of single and two-peaks distribution
using the results from the tidy
function obtained in the previous
section.
We start with the following plot:
In [23]:
dt['model'] = 'twopeaks'
dt1['model'] = 'onepeak'
dt_tot = pd.concat([dt, dt1], ignore_index=True)
In [24]:
bins = np.arange(-5, 5.01, 0.25)
x = bins[:-1] + 0.5*(bins[1] - bins[0])
grid = sns.FacetGrid(ds.query('dataset < 6'), col='dataset', hue='dataset', col_wrap=3)
grid.map(plt.hist, 'data', bins=bins, normed=True);
for i, ax in enumerate(grid.axes):
kw_pars = br.tidy_to_dict(dt_tot.loc[(dt_tot.dataset == i) & (dt_tot.model == 'onepeak')])
y1 = model_pdf1(x, **kw_pars)
li1, = ax.plot(x, y1, lw=2, color='k', alpha=0.5)
kw_pars = br.tidy_to_dict(dt_tot.loc[(dt_tot.dataset == i) & (dt_tot.model == 'twopeaks')], keys_exclude=['ax'])
y = model_pdf(x, **kw_pars)
li, = ax.plot(x, y, lw=2, color='k')
grid.add_legend(legend_data=dict(onepeak=li1, twopeaks=li),
label_order=['onepeak', 'twopeaks'], title='model');

The problem is that FacetGrid
only takes one DataFrame as input. In
the previous example we provide the DataFrame of “experimental” data
(ds
) and use the .map
method to plot histograms of the different
datasets. The fitted distributions, instead, are plotted manually in the
for loop.
We can invert the approach, and pass to FacetGrid
the DataFrame of
fitted parameters (dt_tot
), while leaving the simple histogram for
manual plotting. In this case we need to write an helper function
(_plot
) that knows how to plot a distribution given a set of
parameter:
In [25]:
def _plot(names, values, x, label=None, color=None):
df = pd.concat([names, values], axis=1)
kw_pars = br.tidy_to_dict(df, keys_exclude=['ax'])
func = model_pdf1 if label == 'onepeak' else model_pdf
y = func(x, **kw_pars)
plt.plot(x, y, lw=2, color=color, label=label)
In [26]:
bins = np.arange(-5, 5.01, 0.25)
x = bins[:-1] + 0.5*(bins[1] - bins[0])
grid = sns.FacetGrid(dt_tot.query('dataset < 6'), col='dataset', hue='model', col_wrap=3)
grid.map(_plot, 'name', 'value', x=x)
grid.add_legend()
for i, ax in enumerate(grid.axes):
ax.hist(ds.query('dataset == %d' % i).data, bins=bins, histtype='stepfilled', normed=True,
color='gray', alpha=0.5);

For an even better (i.e. simpler) example of plots of fit results see pybroom-example-multi-datasets.
PyBroom Example - Multiple Datasets - Scipy Robust Fit¶
This notebook is part of pybroom.
This notebook demonstrate using pybroom when fitting a set of curves (curve fitting) using robust fitting and scipy. We will show that pybroom greatly simplifies comparing, filtering and plotting fit results from multiple datasets. See pybroom-example-multi-datasets for an example usinglmfit.Model
instead of directly scipy.
In [1]:
%matplotlib inline
%config InlineBackend.figure_format='retina' # for hi-dpi displays
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.pylab import normpdf
import seaborn as sns
from lmfit import Model
import lmfit
print('lmfit: %s' % lmfit.__version__)
lmfit: 0.9.5
/home/docs/checkouts/readthedocs.org/user_builds/pybroom/envs/stable/lib/python3.4/site-packages/IPython/html.py:14: ShimWarning: The `IPython.html` package has been deprecated. You should import from `notebook` instead. `IPython.html.widgets` has moved to `ipywidgets`.
"`IPython.html.widgets` has moved to `ipywidgets`.", ShimWarning)
In [2]:
sns.set_style('whitegrid')
In [3]:
import pybroom as br
Create Noisy Data¶
We start simulating N datasets which are identical except for the additive noise.
In [4]:
N = 200
In [5]:
x = np.linspace(-10, 10, 101)
In [6]:
peak1 = lmfit.models.GaussianModel(prefix='p1_')
peak2 = lmfit.models.GaussianModel(prefix='p2_')
model = peak1 + peak2
In [7]:
#params = model.make_params(p1_amplitude=1.5, p2_amplitude=1,
# p1_sigma=1, p2_sigma=1)
In [8]:
Y_data = np.zeros((N, x.size))
Y_data.shape, x.shape
Out[8]:
((200, 101), (101,))
In [9]:
for i in range(Y_data.shape[0]):
Y_data[i] = model.eval(x=x, p1_center=-1, p2_center=2,
p1_sigma=0.5, p2_sigma=1,
p1_height=1, p2_height=0.5)
Y_data += np.random.randn(*Y_data.shape)/10
Add some outliers:
In [10]:
num_outliers = int(Y_data.size * 0.05)
idx_ol = np.random.randint(low=0, high=Y_data.size, size=num_outliers)
Y_data.reshape(-1)[idx_ol] = (np.random.rand(num_outliers) - 0.5)*4
In [11]:
plt.plot(x, Y_data.T, 'ok', alpha=0.1);
plt.title('%d simulated datasets, with outliers' % N);

Model Fitting¶
curve_fit()¶
In [12]:
import scipy.optimize as so
In [13]:
from collections import namedtuple
In [14]:
# Model PDF to be maximized
def model_pdf(x, a1, a2, mu1, mu2, sig1, sig2):
return (a1 * normpdf(x, mu1, sig1) +
a2 * normpdf(x, mu2, sig2))
In [15]:
result = so.curve_fit(model_pdf, x, Y_data[0])
/home/docs/checkouts/readthedocs.org/user_builds/pybroom/envs/stable/lib/python3.4/site-packages/scipy/optimize/minpack.py:715: OptimizeWarning: Covariance of the parameters could not be estimated
category=OptimizeWarning)
In [16]:
type(result), type(result[0]), type(result[1])
Out[16]:
(tuple, numpy.ndarray, numpy.ndarray)
In [17]:
result[0]
Out[17]:
array([ 2.45302397, 0.68802988, 0.48958039, 1.87556378, 2.04035127,
0.00274318])
Using a namedtuple
is a clean way to assign names to an array of
paramenters:
In [18]:
Params = namedtuple('Params', 'a1 a2 mu1 mu2 sig1 sig2')
In [19]:
p = Params(*result[0])
p
Out[19]:
Params(a1=2.4530239688534445, a2=0.68802988315235913, mu1=0.4895803935552317, mu2=1.8755637799433695, sig1=2.040351270505754, sig2=0.0027431809705545929)
Unfortunately, not much data is returned by curve_fit
, a 2-element
tuple with:
- array of best-fit parameters
- array of jacobian
Therefore curve_fit
is not very useful for detailed comparison of
fit results. A better interface for curve fitting would be lmfit.Model
(see this other notebook).
In the current notebook we keep exploring further options offered by
scipy.optimize
.
least_squares()¶
As an example, we use the least_squares
function which supports
robust loss functions and constraints.
We need to define the residuals:
In [20]:
def residuals(p, x, y):
return y - model_pdf(x, *p)
Then, we fit the N datasets with different loss functions storing result in a dict containing lists:
In [21]:
losses = ('linear', 'huber', 'cauchy')
Results = {}
for loss in losses:
Results[loss] = [so.least_squares(residuals, (1,1,0,1,1,1), args=(x, y), loss=loss, f_scale=0.5)
for y in Y_data]
In [22]:
# result = Results['cauchy'][0]
# for k in result.keys():
# print(k, type(result[k]))
Tidying the results¶
Now we tidy the results, combining the results for the different loss functions in a single DataFrames.
We start with the glance
function, which returns one row per fit
result:
In [23]:
dg_tot = br.glance(Results, var_names=['loss', 'dataset'])
dg_tot.head()
Out[23]:
success | cost | optimality | nfev | njev | message | dataset | loss | |
---|---|---|---|---|---|---|---|---|
0 | True | 0.886438 | 0.000010 | 15 | 13 | `ftol` termination condition is satisfied. | 0 | cauchy |
1 | True | 1.427066 | 0.000009 | 13 | 12 | `ftol` termination condition is satisfied. | 1 | cauchy |
2 | True | 1.449007 | 0.000047 | 14 | 13 | `ftol` termination condition is satisfied. | 2 | cauchy |
3 | True | 2.366572 | 0.000025 | 14 | 13 | `ftol` termination condition is satisfied. | 3 | cauchy |
4 | True | 1.264164 | 0.000006 | 14 | 12 | `ftol` termination condition is satisfied. | 4 | cauchy |
In [24]:
dg_tot.success.all()
Out[24]:
False
Then we apply tidy
, which returns one row per parameter.
Since the object OptimzeResult
returned by scipy.optimize
does
only contains an array of parameters, we need to pass the names as as
additional argument:
In [25]:
pnames = 'a1 a2 mu1 mu2 sig1 sig2'
dt_tot = br.tidy(Results, var_names=['loss', 'dataset'], param_names=pnames)
dt_tot.head()
Out[25]:
name | value | grad | active_mask | dataset | loss | |
---|---|---|---|---|---|---|
0 | a1 | 0.974333 | -3.912972e-10 | 0.0 | 0 | cauchy |
1 | a2 | 1.183553 | -5.232544e-07 | 0.0 | 0 | cauchy |
2 | mu1 | -1.005827 | -5.739109e-07 | 0.0 | 0 | cauchy |
3 | mu2 | 1.817380 | 5.058517e-06 | 0.0 | 0 | cauchy |
4 | sig1 | 0.503600 | 2.930292e-07 | 0.0 | 0 | cauchy |
Finally, we cannot apply the augment
function, since the
OptimizeResult
object does not include much per-data-point
information (it may contain the array of residuals).
Plots¶
First we plot the peak position and sigmas distributions:
In [26]:
kws = dict(bins = np.arange(-2, 4, 0.1), histtype='step', lw=2)
for loss in losses:
dt_tot.query('(name == "mu1" or name == "mu2") and loss == "%s"' % loss)['value'].hist(label=loss, **kws)
kws['ax'] = plt.gca()
plt.title(' Distribution of peaks centers')
plt.legend();

In [27]:
kws = dict(bins = np.arange(0, 4, 0.1), histtype='step', lw=2)
for loss in losses:
dt_tot.query('(name == "sig1" or name == "sig2") and loss == "%s"' % loss)['value'].hist(label=loss, **kws)
kws['ax'] = plt.gca()
plt.title(' Distribution of peaks sigmas')
plt.legend();

A more complete overview for all the fit paramenters can be obtained with a factorplot:
In [28]:
sns.factorplot(x='loss', y='value', data=dt_tot, col='name', hue='loss',
col_wrap=4, kind='point', sharey=False);

From all the previous plots we see that, as espected, using robust
fitting with higher damping of outlier (i.e. cauchy
vs huber
or
linear
) results in more accurate fit results.
Finally, we can have a peek at the comparison of raw data and fitted models for a few datatsets.
Since OptimizeResults
does not include “augmented” data we need to
generate these data by evaluating the model with the best-fit
parameters. We use seaborn’s FacetGrid
, passing a custom function
_plot
for model evaluation:
In [29]:
def _plot(names, values, x, label=None, color=None):
df = pd.concat([names, values], axis=1)
kw_pars = br.tidy_to_dict(df)
y = model_pdf(x, **kw_pars)
plt.plot(x, y, lw=2, color=color, label=label)
In [30]:
grid = sns.FacetGrid(dt_tot.query('dataset < 9'), col='dataset', hue='loss', col_wrap=3)
grid.map(_plot, 'name', 'value', x=x)
grid.add_legend()
for i, ax in enumerate(grid.axes):
ax.plot(x, Y_data[i], 'o', ms=3, color='k')
plt.ylim(-1, 1.5)
Out[30]:
(-1, 1.5)

For comparison, the ModelResult
object returned by lmfit, contains
not only the evaluated model but also the evaluation of the single
components (each single peak in this case). Therefore the above plot can
be generated more straighforwardly using the “augmented” data. See the
notebook
pybroom-example-multi-datasets
for an example.