pyblock¶
pyblock implements the reblocking analysis (see, for example, the description given by Flyvbjerg and Petersen [1]), to remove serial correlation from a data set and hence obtain an improved estimate of the standard error. Functions for additional analysis, interpretation and manipulation of the resultant mean and standard error estimates are also provided.
A command-line interface is currently not provided but the API is simple to use from either a Python/IPython shell or to create an application-specific script.
Installation¶
Dependencies¶
- numpy
- pandas (0.13 and later)
- matplotlib
pandas is only required for pyblock.pd_utils
and
pyblock.error
and matplotlib for pyblock.pd_utils
. Hence pandas and/or
matplotlib need not be installed if those submodules are not required, in which case
pyblock/__init__.py
must be modified to stop the pyblock.pd_utils
and
pyblock.error
from being automatically imported.
Installation instructions¶
pyblock can be installed from PyPI:
$ pip install pyblock
or from the source package:
$ python setup.py install
Both pip
and setup.py
have options for installing in non-default locations, such
as home directories. Add --help
to the above commands for details.
Alternatively, pyblock can be used directly from source by adding the location of the
pyblock directory to the PYTHONPATH
environment variable.
pyblock API¶
pyblock is a python module for analysis of correlated data.
pyblock.blocking¶
Tools for reblocking of data to remove serial correlation from data sets.
-
pyblock.blocking.
reblock
(data, rowvar=1, ddof=None, weights=None)¶ Blocking analysis of correlated data.
Repeatedly average neighbouring data points in order to remove the effect of serial correlation on the estimate of the standard error of a data set, as described by Flyvbjerg and Petersen [Flyvbjerg]. The standard error is constant (within error bars) once the correlation has been removed.
If a weighting is provided then the weighted variance and standard error of each variable is calculated, as described in [Pozzi]. Bessel correction is obtained using the “effective sample size” from [Madansky].
- data :
numpy.ndarray
- 1D or 2D array containing multiple variables and data points. See
rowvar
. - rowvar : int
- If
rowvar
is non-zero (default) then each row represents a variable and each column a data point per variable. Otherwise the relationship is swapped. Only used if data is a 2D array. - ddof : int
- If not
None
, then the standard error and covariance are normalised by \((N - \text{ddof})\), where \(N\) is the number of data points per variable. Otherwise, the numpy default is used (i.e. \((N - 1)\)). - weights :
numpy.array
- A 1D weighting of the data to be reblocked. For multidimensional data an identical weighting is applied to the data for each variable.
- block_info :
list
ofcollections.namedtuple()
Statistics from each reblocking iteration. Each tuple contains:
- block : int
- blocking iteration. Each iteration successively averages neighbouring pairs of data points. The final data point is discarded if the number of data points is odd.
- ndata: int
- number of data points in the blocking iteration.
- mean :
numpy.ndarray
- mean of each variable in the data set.
- cov :
numpy.ndarray
- covariance matrix.
- std_err :
numpy.ndarray
- standard error of each variable.
- std_err_err :
numpy.ndarray
- an estimate of the error in the standard error, assuming a Gaussian distribution.
[Flyvbjerg] “Error estimates on averages of correlated data”, H. Flyvbjerg, H.G. Petersen, J. Chem. Phys. 91, 461 (1989). [Pozzi] “Exponential smoothing weighted correlations”, F. Pozzi, T. Matteo, T. Aste, Eur. Phys. J. B. 85, 175 (2012). [Madansky] “An Analysis of WinCross, SPSS, and Mentor Procedures for Estimating the Variance of a Weighted Mean”, A. Madansky, H. G. B. Alexander, www.analyticalgroup.com/download/weighted_variance.pdf - data :
-
pyblock.blocking.
find_optimal_block
(ndata, stats)¶ Find the optimal block length from a reblocking calculation.
Inspect a reblocking calculation and find the block length which minimises the stochastic error and removes the effect of correlation from the data set. This follows the procedures detailed by [Wolff] and [Lee] et al.
- ndata : int
- number of data points (‘observations’) in the data set.
- stats : list of tuples
- statistics in the format as returned by
pyblock.blocking.reblock()
.
- list of int
- the optimal block index for each variable (i.e. the first block index in which the correlation has been removed). If NaN, then the statistics provided were not sufficient to estimate the correlation length and more data should be collected.
[Wolff] (Eq 47) and [Lee] et al. (Eq 14) give the optimal block size to be
\[B^3 = 2 n n_{\text{corr}}^2\]where \(n\) is the number of data points in the data set, \(B\) is the number of data points in each ‘block’ (ie the data set has been divided into \(n/B\) contiguous blocks) and \(n_{\text{corr}}\). [todo] - describe n_corr. Following the scheme proposed by [Lee] et al., we hence look for the largest block size which satisfies
\[B^3 >= 2 n n_{\text{corr}}^2.\]From Eq 13 in [Lee] et al. (which they cast in terms of the variance):
\[n_{\text{err}} SE = SE_{\text{true}}\]where the ‘error factor’, \(n_{\text{err}}\), is the square root of the estimated correlation length, \(SE\) is the standard error of the data set and \(SE_{\text{true}}\) is the true standard error once the correlation length has been taken into account. Hence the condition becomes:
\[B^3 >= 2 n (SE(B) / SE(0))^4\]where \(SE(B)\) is the estimate of the standard error of the data divided in blocks of size \(B\).
I am grateful to Will Vigor for discussions and the initial implementation.
[Wolff] (1, 2) “Monte Carlo errors with less errors”, U. Wolff, Comput. Phys. Commun. 156, 143 (2004) and arXiv:hep-lat/0306017. [Lee] (1, 2, 3, 4) “Strategies for improving the efficiency of quantum Monte Carlo calculations”, R.M. Lee, G.J. Conduit, N. Nemec, P. Lopez Rios, N.D. Drummond, Phys. Rev. E. 83, 066706 (2011).
pyblock.error¶
Simple error propogation.
Note
We only implement the functions as we need them…
-
pyblock.error.
ratio
(stats_A, stats_B, cov_AB, data_len)¶ Calculate the mean and standard error of \(f(A,B) = A/B\).
- stats_A :
pandas.Series
orpandas.DataFrame
- Statistics (containing at least the ‘mean’ and ‘standard error’ fields) for
variable \(A\). The rows contain different values of these statistics
(e.g. from a reblocking analysis) if
pandas.DataFrame
are passed. - stats_B :
pandas.Series
orpandas.DataFrame
- Similarly for variable \(B\).
- cov_AB : float or
pandas.Series
- Covariance between variables \(A\) and \(B\). If
stats_A
andstats_B
arepandas.DataFrame
, then this must be apandas.Series
, with the same index asstats_A
andstats_B
. - data_len : int or
pandas.Series
- Number of data points (‘observations’) used to obtain the statistics given
in
stats_A
andstats_B
. Ifstats_A
andstats_B
arepandas.DataFrame
, then this must be apandas.Series
, with the same index asstats_A
andstats_B
.
- stats :
pandas.Series
orpandas.DataFrame
- Mean and standard error (and, if possible/relevant, optimal reblock
iteration) for \(f(A,B)\). If
stats_A
,stats_B
arepandas.DataFrame
, this is apandas.DataFrame
with the same index, otherwise apandas.Series
is returned.
- stats_A :
-
pyblock.error.
product
(stats_A, stats_B, cov_AB, data_len)¶ Calculate the mean and standard error of \(f(A,B) = A \times B\).
See
ratio()
.See
ratio()
.
-
pyblock.error.
subtraction
(stats_A, stats_B, cov_AB, data_len)¶ Calculate the mean and standard error of \(f(A,B) = A - B\).
See
ratio()
.See
ratio()
.
-
pyblock.error.
addition
(stats_A, stats_B, cov_AB, data_len)¶ Calculate the mean and standard error of \(f(A,B) = A \plus B\).
See
ratio()
.See
ratio()
.
-
pyblock.error.
pretty_fmt_err
(val, err)¶ Pretty formatting of a value and associated error.
- val : number
- a (noisy) value.
- err: number
- error associated with the value.
- val_str : str
- Value to the number of significant digits known, with the error in the last digit in brackets.
>>> pretty_fmt_err(1.2345, 0.01) '1.23(1)' >>> pretty_fmt_err(12331, 40) '12330(40)'
Rounding is handled with Python’s round function, which handles rounding numbers at the midpoint in a range (eg 5 if round to the nearest 10) in a slightly odd way. As we’re normally dealing with noisy data and rounding to remove more than just one significant figure, this is unlikely to impact us.
pyblock.pd_utils¶
Pandas-based wrapper around pyblock.blocking
.
-
pyblock.pd_utils.
reblock
(data, axis=0, weights=None)¶ Blocking analysis of correlated data.
- data :
pandas.Series
orpandas.DataFrame
- Data to be blocked. See
axis
for order. - axis : int
- If non-zero, variables in data are in rows with the columns
corresponding to the observation values. Blocking is then performed along
the rows. Otherwise each column is a variable, the observations are in the
columns and blocking is performed down the columns. Only used if data is
a
pandas.DataFrame
. - weights :
pandas.Series
orpandas.DataFrame
- A 1D weighting of the data to be reblocked. For multidimensional data an identical weighting is applied to the data for each variable.
- data_len :
pandas.Series
- Number of data points used in each reblocking iteration. Note some reblocking iterations discard a data point if there were an odd number of data points in the previous iteration.
- block_info :
pandas.DataFrame
- Mean, standard error and estimated standard error for each variable at each reblock step.
- covariance :
pandas.DataFrame
- Covariance matrix at each reblock step.
pyblock.blocking.reblock()
:- numpy-based implementation; see for documentation and notes on the
reblocking procedure.
pyblock.pd_utils.reblock()
is a simple wrapper around this.
- data :
-
pyblock.pd_utils.
optimal_block
(block_sub_info)¶ Get the optimal block value from the reblocking data.
- block_sub_info:
pandas.DataFrame
orpandas.Series
- Reblocking data (i.e. the first item of the tuple returned by
reblock
), or a subset thereof containing the statistics columns for one or more data items.
- index : int
- Reblocking index corresponding to the reblocking iteration at which serial
correlation has been removed (as estimated by the procedure in
pyblock.blocking.find_optimal_block
). If multiple data sets are passed in block_sub_info, this is the maximum index out of all data sets. Set to inf if an optimal block is not found for a data set.
- ValueError
- block_sub_info contains no Series or column in DataFrame named ‘optimal block’.
- block_sub_info:
-
pyblock.pd_utils.
reblock_summary
(block_sub_info)¶ Get the data corresponding to the optimal block from the reblocking data.
- block_sub_info :
pandas.DataFrame
orpandas.Series
- Reblocking data (i.e. the first item of the tuple returned by
reblock
), or a subset thereof containing the statistics columns for one or more data items.
- summary :
pandas.DataFrame
- Mean, standard error and estimate of the error in the standard error corresponding to the optimal block size in the reblocking data (or largest optimal size if multiple data sets are given. The index is labelled with the data name, if known. An empty DataFrame is returned if no optimal block size was found.
- block_sub_info :
pyblock.plot¶
Helper for plotting reblocking plots.
-
pyblock.plot.
plot_reblocking
(block_info, plotfile=None, plotshow=True)¶ Plot the reblocking data.
- block_info :
pandas.DataFrame
- Reblocking data (i.e. the first item of the tuple returned by
reblock
). - plotfile : string
- If not null, save the plot to the given filename. If ‘-’, then show the
plot interactively. See also
plotshow
. - plotshow : bool
- If
plotfile
is not given or is ‘-’, then show the plot interactively.
- fig :
matplotlib.figure.Figure
- plot of the reblocking data.
- block_info :
pyblock.blocking
implements the reblocking algorithm [1] and an
algorithm [2], [3] for suggesting the most appropriate block size (and thus
estimate of the standard error in the data set) for data contained within
numpy
arrays. pyblock.pd_utils
provides a nice wrapper around
this using pandas
, and it is highly recommended to use this if possible.
pyblock.error
contains functions for simple error propagation and
formatting of output of a value and it’s associated error.
References¶
[1] | “Error estimates on averages of correlated data”, H. Flyvbjerg and H.G. Petersen, J. Chem. Phys. 91, 461 (1989). |
[2] | “Monte Carlo errors with less errors”, U. Wolff, Comput. Phys. Commun. 156, 143 (2004) and arXiv:hep-lat/0306017. |
[3] | “Strategies for improving the efficiency of quantum Monte Carlo calculations”, R. M. Lee, G. J. Conduit, N. Nemec, P. Lopez Rios, and N. D. Drummond, Phys. Rev. E. 83, 066706 (2011). |
pyblock tutorial¶
The estimate of the standard error of a set of data assumes that the data points are completely independent. If this is not true, then naively calculating the standard error of the entire data set can give a substantial underestimate of the true error. This arises in, for example, Monte Carlo simulations where the state at one step depends upon the state at the previous step. Data calculated from the stochastic state hence has serial correlations.
A simple way to remove these correlations is to repeatedly average neighbouring pairs of data points and calculate the standard error on the new data set. As no data is discarded in this process (assuming the data set contains \(2^n\) values), the error estimate should remain approximately constant if the data is truly independent.
pyblock is a python module for performing this reblocking analysis.
Normally correlated data comes from an experiment or simulation but we’ll use randomly generated data which is serially correlated in order to show how pyblock works.
import numpy
def corr_data(N, L):
'''Generate random correlated data containing 2^N data points.
Randon data is convolved over a 2^L/10 length to give the correlated signal.'''
return numpy.convolve(numpy.random.randn(2**N), numpy.ones(2**L)/10, 'same')
rand_data = corr_data(16, 6)
plot(rand_data);

If we zoom in, we can clearly see that neighbouring data points do not immediately appear to be independent:
plot(rand_data[:1000]);
plot(rand_data[40000:41000]);

pyblock can perform a reblocking analysis to get a better estimate of the standard error of the data set:
import pyblock
reblock_data = pyblock.blocking.reblock(rand_data)
for reblock_iter in reblock_data:
print(reblock_iter)
BlockTuple(block=0, ndata=65536, mean=array(0.029729412388881667), cov=array(0.6337749708548472), std_err=array(0.0031097650382892594), std_err_err=array(8.589659075051008e-06))
BlockTuple(block=1, ndata=32768, mean=array(0.02972941238888195), cov=array(0.628821230364245), std_err=array(0.004380650753518188), std_err_err=array(1.711217811903889e-05))
BlockTuple(block=2, ndata=16384, mean=array(0.02972941238888187), cov=array(0.6213291012014514), std_err=array(0.006158158716248116), std_err_err=array(3.402038032828577e-05))
BlockTuple(block=3, ndata=8192, mean=array(0.0297294123888818), cov=array(0.6072256553888598), std_err=array(0.00860954270047692), std_err_err=array(6.726615807324491e-05))
BlockTuple(block=4, ndata=4096, mean=array(0.02972941238888184), cov=array(0.5804081640995564), std_err=array(0.0119038318174598), std_err_err=array(0.00013153606075677518))
BlockTuple(block=5, ndata=2048, mean=array(0.02972941238888185), cov=array(0.5242933503018304), std_err=array(0.01600008163891877), std_err_err=array(0.0002500623334367383))
BlockTuple(block=6, ndata=1024, mean=array(0.029729412388881837), cov=array(0.4126013837636545), std_err=array(0.02007314222616115), std_err_err=array(0.00044377470816715493))
BlockTuple(block=7, ndata=512, mean=array(0.029729412388881854), cov=array(0.255910704765131), std_err=array(0.02235677962597468), std_err_err=array(0.0006993326391359534))
BlockTuple(block=8, ndata=256, mean=array(0.029729412388881854), cov=array(0.15369260703074866), std_err=array(0.024502280428847067), std_err_err=array(0.0010849792138732355))
BlockTuple(block=9, ndata=128, mean=array(0.029729412388881844), cov=array(0.07649773732547502), std_err=array(0.02444664747680699), std_err_err=array(0.0015339190875488663))
BlockTuple(block=10, ndata=64, mean=array(0.02972941238888185), cov=array(0.0455635621966979), std_err=array(0.026682028770755133), std_err_err=array(0.002377024048671685))
BlockTuple(block=11, ndata=32, mean=array(0.029729412388881847), cov=array(0.025945495376626042), std_err=array(0.028474492629712717), std_err_err=array(0.003616264180239503))
BlockTuple(block=12, ndata=16, mean=array(0.02972941238888184), cov=array(0.012627881930728472), std_err=array(0.02809346224071589), std_err_err=array(0.0051291409958865745))
BlockTuple(block=13, ndata=8, mean=array(0.02972941238888184), cov=array(0.006785523206998811), std_err=array(0.029123708570078285), std_err_err=array(0.00778363852153464))
BlockTuple(block=14, ndata=4, mean=array(0.02972941238888184), cov=array(0.005573075663761713), std_err=array(0.037326517597285024), std_err_err=array(0.015238486998060912))
BlockTuple(block=15, ndata=2, mean=array(0.02972941238888184), cov=array(0.006933024981306826), std_err=array(0.05887709648626886), std_err_err=array(0.04163239418201536))
The standard error of the original data set is clearly around 8 times too small. Note that the standard error of the last few reblock iterations fluctuates substantially—this is simply because of the small number of data points at those iterations.
In addition to the mean and standard error at each iteration, the covariance and an estimate of the error in the standard error are also calculated. Each tuple also contains the number of data points used at the given reblock iteration.
pyblock.blocking can also suggest the reblock iteration at which the standard error has converged (i.e. the iteration at which the serial correlation has been removed and every data point is truly independent).
opt = pyblock.blocking.find_optimal_block(len(rand_data), reblock_data)
print(opt)
print(reblock_data[opt[0]])
[10]
BlockTuple(block=10, ndata=64, mean=array(0.02972941238888185), cov=array(0.0455635621966979), std_err=array(0.026682028770755133), std_err_err=array(0.002377024048671685))
Whilst the above uses just a single data set, pyblock is designed to work on multiple data sets at once (e.g. multiple outputs from the same simulation). In that case, different optimal reblock iterations might be found for each data set. The only assumption is that the original data sets are of the same length.
It is aslo possible to reblock weighted data sets. If the pyblock.blocking routine is supplied with an array of weights in addition to the data, the weighted variance and standard error of each data set are calculated.
pandas integration¶
The core pyblock functionality is built upon numpy
.
However, it is more convenient to use the pandas
-based wrapper
around pyblock.blocking, not least because
it makes working with multiple data sets more pleasant.
import pandas as pd
rand_data = pd.Series(rand_data)
rand_data.head()
0 -0.294901
1 -0.360847
2 -0.386010
3 -0.496183
4 -0.625507
dtype: float64
(data_length, reblock_data, covariance) = pyblock.pd_utils.reblock(rand_data)
# number of data points at each reblock iteration
data_length
reblock
0 65536
1 32768
2 16384
3 8192
4 4096
5 2048
6 1024
7 512
8 256
9 128
10 64
11 32
12 16
13 8
14 4
15 2
Name: data length, dtype: int64
# mean, standard error and estimate of the error in the standard error at each
# reblock iteration
# Note the suggested reblock iteration is already indicated.
# pyblock names the data series 'data' if no name is provided in the
pandas.Series/pandas.DataFrame.
reblock_data
data | ||||
---|---|---|---|---|
mean | standard error | standard error error | optimal block | |
reblock | ||||
0 | 0.029729 | 0.003110 | 0.000009 | |
1 | 0.029729 | 0.004381 | 0.000017 | |
2 | 0.029729 | 0.006158 | 0.000034 | |
3 | 0.029729 | 0.008610 | 0.000067 | |
4 | 0.029729 | 0.011904 | 0.000132 | |
5 | 0.029729 | 0.016000 | 0.000250 | |
6 | 0.029729 | 0.020073 | 0.000444 | |
7 | 0.029729 | 0.022357 | 0.000699 | |
8 | 0.029729 | 0.024502 | 0.001085 | |
9 | 0.029729 | 0.024447 | 0.001534 | |
10 | 0.029729 | 0.026682 | 0.002377 | <--- |
11 | 0.029729 | 0.028474 | 0.003616 | |
12 | 0.029729 | 0.028093 | 0.005129 | |
13 | 0.029729 | 0.029124 | 0.007784 | |
14 | 0.029729 | 0.037327 | 0.015238 | |
15 | 0.029729 | 0.058877 | 0.041632 |
16 rows × 4 columns
# Covariance matrix is not so relevant for a single data set.
covariance
data | ||
---|---|---|
reblock | ||
0 | data | 0.633775 |
1 | data | 0.628821 |
2 | data | 0.621329 |
3 | data | 0.607226 |
4 | data | 0.580408 |
5 | data | 0.524293 |
6 | data | 0.412601 |
7 | data | 0.255911 |
8 | data | 0.153693 |
9 | data | 0.076498 |
10 | data | 0.045564 |
11 | data | 0.025945 |
12 | data | 0.012628 |
13 | data | 0.006786 |
14 | data | 0.005573 |
15 | data | 0.006933 |
16 rows × 1 columns
We can also plot the convergence of the standard error estimate and obtain a summary of the suggested data to quote:
pyblock.plot.plot_reblocking(reblock_data);

The standard error clearly converges to ~0.022. The suggested reblock iteration (which uses a slightly conservative formula) is indicated by the arrow on the plot.
pyblock.pd_utils.reblock_summary(reblock_data)
mean | standard error | standard error error | |
---|---|---|---|
data | 0.02972941 | 0.02668203 | 0.002377024 |
1 rows × 3 columns
pyblock.error also contains simple error
propogation functions for combining multiple noisy data sets and can
handle multiple data sets at once (contained either within a numpy
array using pyblock.blocking or within a
pandas.DataFrame
).