PyMC User’s Guide¶
Contents:
Introduction¶
Date: | 15 September 2015 |
---|---|
Authors: | Chris Fonnesbeck, Anand Patil, David Huard, John Salvatier |
Contact: | chris.fonnesbeck@vanderbilt.edu |
Web site: | http://github.com/pymc-devs/pymc |
Copyright: | This document has been placed in the public domain. |
License: | PyMC is released under the Academic Free License. |
Version: | 2.3.6 |
Purpose¶
PyMC is a python module that implements Bayesian statistical models and fitting algorithms, including Markov chain Monte Carlo. Its flexibility and extensibility make it applicable to a large suite of problems. Along with core sampling functionality, PyMC includes methods for summarizing output, plotting, goodness-of-fit and convergence diagnostics.
Features¶
PyMC provides functionalities to make Bayesian analysis as painless as possible. Here is a short list of some of its features:
- Fits Bayesian statistical models with Markov chain Monte Carlo and other algorithms.
- Includes a large suite of well-documented statistical distributions.
- Uses NumPy for numerics wherever possible.
- Includes a module for modeling Gaussian processes.
- Sampling loops can be paused and tuned manually, or saved and restarted later.
- Creates summaries including tables and plots.
- Traces can be saved to the disk as plain text, Python pickles, SQLite or MySQL database, or hdf5 archives.
- Several convergence diagnostics are available.
- Extensible: easily incorporates custom step methods and unusual probability distributions.
- MCMC loops can be embedded in larger programs, and results can be analyzed with the full power of Python.
What’s new in version 2¶
This second version of PyMC benefits from a major rewrite effort. Substantial improvements in code extensibility, user interface as well as in raw performance have been achieved. Most notably, the PyMC 2 series provides:
- New flexible object model and syntax (not backward-compatible).
- Reduced redundant computations: only relevant log-probability terms are computed, and these are cached.
- Optimized probability distributions.
- New adaptive blocked Metropolis step method.
- New slice sampler method.
- Much more!
Usage¶
First, define your model in a file, say mymodel.py (with comments, of course!):
# Import relevant modules
import pymc
import numpy as np
# Some data
n = 5*np.ones(4,dtype=int)
x = np.array([-.86,-.3,-.05,.73])
# Priors on unknown parameters
alpha = pymc.Normal('alpha',mu=0,tau=.01)
beta = pymc.Normal('beta',mu=0,tau=.01)
# Arbitrary deterministic function of parameters
@pymc.deterministic
def theta(a=alpha, b=beta):
"""theta = logit^{-1}(a+b)"""
return pymc.invlogit(a+b*x)
# Binomial likelihood for data
d = pymc.Binomial('d', n=n, p=theta, value=np.array([0.,1.,3.,5.]),\
observed=True)
Save this file, then from a python shell (or another file in the same directory), call:
import pymc
import mymodel
S = pymc.MCMC(mymodel, db='pickle')
S.sample(iter=10000, burn=5000, thin=2)
pymc.Matplot.plot(S)
This example will generate 10000 posterior samples, thinned by a factor of 2, with the first half discarded as burn-in. The sample is stored in a Python serialization (pickle) database.
History¶
PyMC began development in 2003, as an effort to generalize the process of building Metropolis-Hastings samplers, with an aim to making Markov chain Monte Carlo (MCMC) more accessible to non-statisticians (particularly ecologists). The choice to develop PyMC as a python module, rather than a standalone application, allowed the use MCMC methods in a larger modeling framework. By 2005, PyMC was reliable enough for version 1.0 to be released to the public. A small group of regular users, most associated with the University of Georgia, provided much of the feedback necessary for the refinement of PyMC to a usable state.
In 2006, David Huard and Anand Patil joined Chris Fonnesbeck on the development team for PyMC 2.0. This iteration of the software strives for more flexibility, better performance and a better end-user experience than any previous version of PyMC.
PyMC 2.2 was released in April 2012. It contains numerous bugfixes and optimizations, as well as a few new features, inculding improved output plotting, csv table output, improved imputation syntax, and posterior predictive check plots. PyMC 2.3 was released on October 31, 2013. It included Python 3 compatibility, the addition of the half-Cauchy distribution, improved summary plots, and some important bug fixes.
This user guide has been updated for version 2.3.
Relationship to other packages¶
PyMC in one of many general-purpose MCMC packages. The most prominent among them is WinBUGS, which has made MCMC (and with it, Bayesian statistics) accessible to a huge user community. Unlike PyMC, WinBUGS is a stand-alone, self-contained application. This can be an attractive feature for users without much programming experience, but others may find it constraining. A related package is JAGS, which provides a more UNIX-like implementation of the BUGS language. Other packages include Hierarchical Bayes Compiler and, more recently, STAN.
It would be difficult to meaningfully benchmark PyMC against these other packages because of the unlimited variety in Bayesian probability models and flavors of the MCMC algorithm. However, it is possible to anticipate how it will perform in broad terms.
PyMC’s number-crunching is done using a combination of industry-standard libraries (NumPy and the linear algebra libraries on which it depends) and hand-optimized Fortran routines. For models that are composed of variables valued as large arrays, PyMC will spend most of its time in these fast routines. In that case, it will be roughly as fast as packages written entirely in C and faster than WinBUGS. For finer-grained models containing mostly scalar variables, it will spend most of its time in coordinating Python code. In that case, despite our best efforts at optimization, PyMC will be significantly slower than packages written in C and on par with or slower than WinBUGS. However, as fine-grained models are often small and simple, the total time required for sampling is often quite reasonable despite this poorer performance.
We have chosen to spend time developing PyMC rather than using an existing package primarily because it allows us to build and efficiently fit any model we like within a productuve Python environment. We have emphasized extensibility throughout PyMC’s design, so if it doesn’t meet your needs out of the box, chances are you can make it do so with a relatively small amount of code. See the testimonials page on the wiki for reasons why other users have chosen PyMC.
Getting started¶
This guide provides all the information needed to install PyMC, code a Bayesian statistical model, run the sampler, save and visualize the results. In addition, it contains a list of the statistical distributions currently available. More examples and tutorials are available from the PyMC web site.
Installation¶
Date: | 1 July 2014 |
---|---|
Authors: | Chris Fonnesbeck, Anand Patil, David Huard, John Salvatier |
Contact: | chris.fonnesbeck@vanderbilt.edu |
Web site: | http://github.com/pymc-devs/pymc |
Copyright: | This document has been placed in the public domain. |
License: | PyMC is released under the Academic Free license. |
Version: | 2.3.6 |
PyMC is known to run on Mac OS X, Linux and Windows, but in theory should be able to work on just about any platform for which Python, a Fortran compiler and the NumPy module are available. However, installing some extra depencies can greatly improve PyMC’s performance and versatility. The following describes the required and optional dependencies and takes you through the installation process.
Dependencies¶
PyMC requires some prerequisite packages to be present on the system. Fortunately, there are currently only a few hard dependencies, and all are freely available online.
- Python version 2.6 or later.
- NumPy (1.6 or newer): The fundamental scientific programming package, it provides a multidimensional array type and many useful functions for numerical analysis.
- Matplotlib (1.0 or newer): 2D plotting library which produces publication quality figures in a variety of image formats and interactive environments
- SciPy (optional): Library of algorithms for mathematics, science and engineering.
- pyTables (optional): Package for managing hierarchical datasets and designed to efficiently and easily cope with extremely large amounts of data. Requires the HDF5 library.
- pydot (optional): Python interface to Graphviz’s Dot language, it allows PyMC to create both directed and non-directed graphical representations of models. Requires the Graphviz library.
- IPython (optional): An enhanced interactive Python shell and an architecture for interactive parallel computing.
- nose (optional): A test discovery-based unittest extension (required to run the test suite).
There are prebuilt distributions that include all required dependencies. For Mac OS X and Windows users, we recommend the Anaconda Python distribution. Anaconda comes bundled with most of these prerequisites. Note that depending on the currency of these distributions, some packages may need to be updated manually.
If, instead of installing the prebuilt binaries, you prefer (or have) to build
pymc
yourself, make sure you have a Fortran and a C compiler. There are
free compilers (gfortran, gcc) available on all platforms. Other compilers have
not been tested with PyMC but may work nonetheless.
Installation using EasyInstall¶
The easiest way to install PyMC is to type in a terminal:
easy_install pymc
Provided EasyInstall (part of the setuptools module) is installed and in your path, this should fetch and install the package from the Python Package Index. Make sure you have the appropriate administrative privileges to install software on your computer.
Installing from pre-built binaries¶
Pre-built binaries are available for Windows XP and Mac OS X. These can be installed as follows:
- Download the installer for your platform from PyPI or the GitHub download page.
2. Double-click the executable installation package, then follow the on-screen instructions.
For other platforms, you will need to build the package yourself from source. Fortunately, this should be relatively straightforward.
Anaconda¶
If you are running the Anaconda Python distribution you can install a PyMC binary from the Binstar package management service, using the conda utility:
conda install -c https://conda.binstar.org/pymc pymc
Compiling the source code¶
First download the source code from GitHub and unpack it. Then move into the unpacked directory and follow the platform specific instructions.
Windows¶
One way to compile PyMC on Windows is to install MinGW and MSYS. MinGW is the GNU Compiler Collection (GCC) augmented with Windows specific headers and libraries. MSYS is a POSIX-like console (bash) with UNIX command line tools. Download the Automated MinGW Installer and double-click on it to launch the installation process. You will be asked to select which components are to be installed: make sure the g77 compiler is selected and proceed with the instructions. Then download and install MSYS-1.0.exe, launch it and again follow the on-screen instructions.
Once this is done, launch the MSYS console, change into the PyMC directory and type:
python setup.py install
This will build the C and Fortran extension and copy the libraries and python modules in the site-packages directory of your Python distribution.
Some Windows users have reported problems building PyMC with MinGW, particularly under Enthought Python. An alternative approach in this case is to use the gcc and gfortran compilers that are bundled with EPD (located in the Scripts directory). In order to do this, you should add the EPD “Scripts” directory to your PATH environment variable (ensuring that it appears ahead of the MinGW binary directory, if it exists on your PATH). Then build PyMC using the install command above.
Alternatively, one may build the currently-available release of PyMC using pip.
Mac OS X or Linux¶
In a terminal, type:
python setup.py config_fc --fcompiler gfortran build
python setup.py install
The above syntax also assumes that you have gFortran installed and available.
The sudo command may be required to install PyMC into the Python
site-packages
directory if it has restricted privileges.
In addition, the python-dev package may be required to install PyMC on Linux systems.
Installing from GitHub¶
You can check out PyMC from the GitHub repository:
git clone git://github.com/pymc-devs/pymc.git
Previous versions are available in the /tags
directory.
Running the test suite¶
pymc
comes with a set of tests that verify that the critical components of
the code work as expected. To run these tests, users must have nose
installed. The tests are launched from a python shell:
import pymc
pymc.test()
In case of failures, messages detailing the nature of these failures will appear. In case this happens (it shouldn’t), please report the problems on the issue tracker (the issues tab on the GitHub page), specifying the version you are using and the environment.
Bugs and feature requests¶
Report problems with the installation, bugs in the code or feature request at the issue tracker. Comments and questions are welcome and should be addressed to PyMC’s mailing list.
Tutorial¶
This tutorial will guide you through a typical PyMC application. Familiarity with Python is assumed, so if you are new to Python, books such as [Lutz2007] or [Langtangen2009] are the place to start. Plenty of online documentation can also be found on the Python documentation page.
An example statistical model¶
Consider the following dataset, which is a time series of recorded coal mining disasters in the UK from 1851 to 1962 [Jarrett1979].
Occurrences of disasters in the time series is thought to be derived from a Poisson process with a large rate parameter in the early part of the time series, and from one with a smaller rate in the later part. We are interested in locating the change point in the series, which perhaps is related to changes in mining safety regulations.
We represent our conceptual model formally as a statistical model:
The symbols are defined as:
- \(D_t\): The number of disasters in year \(t\).
- \(r_t\): The rate parameter of the Poisson distribution of disasters in year \(t\).
- \(s\): The year in which the rate parameter changes (the switchpoint).
- \(e\): The rate parameter before the switchpoint \(s\).
- \(l\): The rate parameter after the switchpoint \(s\).
- \(t_l\), \(t_h\): The lower and upper boundaries of year \(t\).
- \(r_e\), \(r_l\): The rate parameters of the priors of the early and late rates, respectively.
Because we have defined \(D\) by its dependence on \(s\), \(e\) and \(l\), the latter three are known as the “parents” of \(D\) and \(D\) is called their “child”. Similarly, the parents of \(s\) are \(t_l\) and \(t_h\), and \(s\) is the child of \(t_l\) and \(t_h\).
Two types of variables¶
At the model-specification stage (before the data are observed), \(D\),
\(s\), \(e\), \(r\) and \(l\) are all random variables.
Bayesian “random” variables have not necessarily arisen from a physical random
process. The Bayesian interpretation of probability is epistemic, meaning
random variable \(x\)‘s probability distribution \(p(x)\) represents
our knowledge and uncertainty about \(x\)‘s value [Jaynes2003]. Candidate
values of \(x\) for which \(p(x)\) is high are relatively more
probable, given what we know. Random variables are represented in PyMC by the
classes Stochastic
and Deterministic
.
The only Deterministic
in the model is \(r\). If we knew the values of
\(r\)‘s parents (\(s\), \(l\) and \(e\)), we could compute the
value of \(r\) exactly. A Deterministic
like \(r\) is defined by a
mathematical function that returns its value given values for its parents.
Deterministic
variables are sometimes called the systemic part of the
model. The nomenclature is a bit confusing, because these objects usually
represent random variables; since the parents of \(r\) are random,
\(r\) is random also. A more descriptive (though more awkward) name for
this class would be DeterminedByValuesOfParents
.
On the other hand, even if the values of the parents of variables
switchpoint
, disasters (before observing the data), early_mean
or
late_mean
were known, we would still be uncertain of their values. These
variables are characterized by probability distributions that express how
plausible their candidate values are, given values for their parents. The
Stochastic
class represents these variables. A more descriptive name for
these objects might be RandomEvenGivenValuesOfParents
.
We can represent model disaster_model
in a file called
disaster_model.py
(the actual file can be found in pymc/examples/
) as
follows. First, we import the PyMC and NumPy namespaces:
from pymc import DiscreteUniform, Exponential, deterministic, Poisson, Uniform
import numpy as np
Notice that from pymc
we have only imported a select few objects that are
needed for this particular model, whereas the entire numpy
namespace has
been imported, and conveniently given a shorter name. Objects from NumPy are
subsequently accessed by prefixing np.
to the name. Either approach is
acceptable.
Next, we enter the actual data values into an array:
disasters_array = \
np.array([ 4, 5, 4, 0, 1, 4, 3, 4, 0, 6, 3, 3, 4, 0, 2, 6,
3, 3, 5, 4, 5, 3, 1, 4, 4, 1, 5, 5, 3, 4, 2, 5,
2, 2, 3, 4, 2, 1, 3, 2, 2, 1, 1, 1, 1, 3, 0, 0,
1, 0, 1, 1, 0, 0, 3, 1, 0, 3, 2, 2, 0, 1, 1, 1,
0, 1, 0, 1, 0, 0, 0, 2, 1, 0, 0, 0, 1, 1, 0, 2,
3, 3, 1, 1, 2, 1, 1, 1, 1, 2, 4, 2, 0, 0, 1, 4,
0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1])
Note that you don’t have to type in this entire array to follow along; the code
is available in the source tree, in this example script
. Next, we create the switchpoint variable
switchpoint
switchpoint = DiscreteUniform('switchpoint', lower=0, upper=110, doc='Switchpoint[year]')
DiscreteUniform
is a subclass of Stochastic
that represents
uniformly-distributed discrete variables. Use of this distribution suggests
that we have no preference a priori
regarding the location of the
switchpoint; all values are equally likely. Now we create the
exponentially-distributed variables early_mean
and late_mean
for the
early and late Poisson rates, respectively:
early_mean = Exponential('early_mean', beta=1.)
late_mean = Exponential('late_mean', beta=1.)
Next, we define the variable rate
, which selects the early rate
early_mean
for times before switchpoint
and the late rate late_mean
for times after switchpoint
. We create rate
using the deterministic
decorator, which converts the ordinary Python function rate
into a
Deterministic
object.:
@deterministic(plot=False)
def rate(s=switchpoint, e=early_mean, l=late_mean):
''' Concatenate Poisson means '''
out = np.empty(len(disasters_array))
out[:s] = e
out[s:] = l
return out
The last step is to define the number of disasters disasters
. This is a
stochastic variable but unlike switchpoint
, early_mean
and
late_mean
we have observed its value. To express this, we set the argument
observed
to True
(it is set to False
by default). This tells PyMC
that this object’s value should not be changed:
disasters = Poisson('disasters', mu=rate, value=disasters_array, observed=True)
Why are data and unknown variables represented by the same object?¶
Since its represented by a Stochastic
object, disasters is defined by its
dependence on its parent rate
even though its value is fixed. This isn’t
just a quirk of PyMC’s syntax; Bayesian hierarchical notation itself makes no
distinction between random variables and data. The reason is simple: to use
Bayes’ theorem to compute the posterior \(p(e,s,l \mid D)\) of model
disaster_model
, we require the likelihood \(p(D \mid e,s,l)\). Even
though disasters‘s value is known and fixed, we need to formally assign it a
probability distribution as if it were a random variable. Remember, the
likelihood and the probability function are essentially the same, except that
the former is regarded as a function of the parameters and the latter as a
function of the data.
This point can be counterintuitive at first, as many peoples’ instinct is to
regard data as fixed a priori and unknown variables as dependent on the data.
One way to understand this is to think of statistical models like
disaster_model
as predictive models for data, or as models of the
processes that gave rise to data. Before observing the value of disasters, we
could have sampled from its prior predictive distribution \(p(D)\) (i.e.
the marginal distribution of the data) as follows:
- Sample
early_mean
,switchpoint
andlate_mean
from their priors.- Sample disasters conditional on these values.
Even after we observe the value of disasters, we need to use this process
model to make inferences about early_mean
, switchpoint
and
late_mean
because it’s the only information we have about how the variables
are related.
Parents and children¶
We have above created a PyMC probability model, which is simply a linked
collection of variables. To see the nature of the links, import or run
disaster_model.py
and examine switchpoint
‘s parents
attribute from
the Python prompt:
>>> from pymc.examples import disaster_model
>>> disaster_model.switchpoint.parents
{'lower': 0, 'upper': 110}
The parents
dictionary shows us the distributional parameters of
switchpoint
, which are constants. Now let’s examine disasters‘s parents:
>>> disaster_model.disasters.parents
{'mu': <pymc.PyMCObjects.Deterministic 'rate' at 0x10623da50>}
We are using rate
as a distributional parameter of disasters (i.e.
rate
is disasters‘s parent). disasters internally labels rate
as
mu
, meaning rate
plays the role of the rate parameter in disasters‘s
Poisson distribution. Now examine rate
‘s children
attribute:
>>> disaster_model.rate.children
set([<pymc.distributions.Poisson 'disasters' at 0x10623da90>])
Because disasters considers rate
its parent, rate
considers
disasters its child. Unlike parents
, children
is a set (an unordered
collection of objects); variables do not associate their children with any
particular distributional role. Try examining the parents
and children
attributes of the other parameters in the model.
The following directed acyclic graph is a visualization of the parent-child
relationships in the model. Unobserved stochastic variables switchpoint
,
early_mean
and late_mean
are open ellipses, observed stochastic
variable disasters is a filled ellipse and deterministic variable rate
is
a triangle. Arrows point from parent to child and display the label that the
child assigns to the parent. See section Graphing models for more details.
As the examples above have shown, pymc objects need to have a name assigned,
such as switchpoint
, early_mean
or late_mean
. These names are used
for storage and post-processing:
- as keys in on-disk databases,
- as node labels in model graphs,
- as axis labels in plots of traces,
- as table labels in summary statistics.
A model instantiated with variables having identical names raises an error to avoid name conflicts in the database storing the traces. In general however, pymc uses references to the objects themselves, not their names, to identify variables.
Variables’ values and log-probabilities¶
All PyMC variables have an attribute called value
that stores the current
value of that variable. Try examining disasters‘s value, and you’ll see the
initial value we provided for it:
>>> disaster_model.disasters.value
array([4, 5, 4, 0, 1, 4, 3, 4, 0, 6, 3, 3, 4, 0, 2, 6, 3, 3, 5, 4, 5, 3, 1,
4, 4, 1, 5, 5, 3, 4, 2, 5, 2, 2, 3, 4, 2, 1, 3, 2, 2, 1, 1, 1, 1, 3,
0, 0, 1, 0, 1, 1, 0, 0, 3, 1, 0, 3, 2, 2, 0, 1, 1, 1, 0, 1, 0, 1, 0,
0, 0, 2, 1, 0, 0, 0, 1, 1, 0, 2, 3, 3, 1, 1, 2, 1, 1, 1, 1, 2, 4, 2,
0, 0, 1, 4, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1])
If you check the values of early_mean
, switchpoint
and late_mean
,
you’ll see random initial values generated by PyMC:
>>> disaster_model.switchpoint.value
44
>>> disaster_model.early_mean.value
0.33464706250079584
>>> disaster_model.late_mean.value
2.6491936762267811
Of course, since these are Stochastic
elements, your values will be
different than these. If you check rate
‘s value, you’ll see an array whose
first switchpoint
elements are early_mean
(here 0.33464706), and whose
remaining elements are late_mean
(here 2.64919368):
>>> disaster_model.rate.value
array([ 0.33464706, 0.33464706, 0.33464706, 0.33464706, 0.33464706,
0.33464706, 0.33464706, 0.33464706, 0.33464706, 0.33464706,
0.33464706, 0.33464706, 0.33464706, 0.33464706, 0.33464706,
0.33464706, 0.33464706, 0.33464706, 0.33464706, 0.33464706,
0.33464706, 0.33464706, 0.33464706, 0.33464706, 0.33464706,
0.33464706, 0.33464706, 0.33464706, 0.33464706, 0.33464706,
0.33464706, 0.33464706, 0.33464706, 0.33464706, 0.33464706,
0.33464706, 0.33464706, 0.33464706, 0.33464706, 0.33464706,
0.33464706, 0.33464706, 0.33464706, 0.33464706, 2.64919368,
2.64919368, 2.64919368, 2.64919368, 2.64919368, 2.64919368,
2.64919368, 2.64919368, 2.64919368, 2.64919368, 2.64919368,
2.64919368, 2.64919368, 2.64919368, 2.64919368, 2.64919368,
2.64919368, 2.64919368, 2.64919368, 2.64919368, 2.64919368,
2.64919368, 2.64919368, 2.64919368, 2.64919368, 2.64919368,
2.64919368, 2.64919368, 2.64919368, 2.64919368, 2.64919368,
2.64919368, 2.64919368, 2.64919368, 2.64919368, 2.64919368,
2.64919368, 2.64919368, 2.64919368, 2.64919368, 2.64919368,
2.64919368, 2.64919368, 2.64919368, 2.64919368, 2.64919368,
2.64919368, 2.64919368, 2.64919368, 2.64919368, 2.64919368,
2.64919368, 2.64919368, 2.64919368, 2.64919368, 2.64919368,
2.64919368, 2.64919368, 2.64919368, 2.64919368, 2.64919368,
2.64919368, 2.64919368, 2.64919368, 2.64919368, 2.64919368])
To compute its value, rate
calls the function we used to create it, passing
in the values of its parents.
Stochastic
objects can evaluate their probability mass or density functions
at their current values given the values of their parents. The logarithm of a
stochastic object’s probability mass or density can be accessed via the
logp
attribute. For vector-valued variables like disasters
, the
logp
attribute returns the sum of the logarithms of the joint probability
or density of all elements of the value. Try examining switchpoint
‘s and
disasters
‘s log-probabilities and early_mean
‘s and late_mean
‘s
log-densities:
>>> disaster_model.switchpoint.logp
-4.7095302013123339
>>> disaster_model.disasters.logp
-1080.5149888046033
>>> disaster_model.early_mean.logp
-0.33464706250079584
>>> disaster_model.late_mean.logp
-2.6491936762267811
Stochastic
objects need to call an internal function to compute their
logp
attributes, as rate
needed to call an internal function to compute
its value. Just as we created rate
by decorating a function that computes
its value, it’s possible to create custom Stochastic
objects by decorating
functions that compute their log-probabilities or densities (see chapter
Building models). Users are thus not limited to the set of
statistical distributions provided by PyMC.
Using Variables as parents of other Variables¶
Let’s take a closer look at our definition of rate
:
@deterministic(plot=False)
def rate(s=switchpoint, e=early_mean, l=late_mean):
''' Concatenate Poisson means '''
out = np.empty(len(disasters_array))
out[:s] = e
out[s:] = l
return out
The arguments switchpoint
, early_mean
and late_mean
are
Stochastic
objects, not numbers. If that is so, why aren’t errors raised
when we attempt to slice array out
up to a Stochastic
object?
Whenever a variable is used as a parent for a child variable, PyMC replaces it
with its value
attribute when the child’s value or log-probability is
computed. When rate
‘s value is recomputed, s.value
is passed to the
function as argument switchpoint
. To see the values of the parents of
rate
all together, look at rate.parents.value
.
Fitting the model with MCMC¶
PyMC provides several objects that fit probability models (linked collections
of variables) like ours. The primary such object, MCMC
, fits models with a
Markov chain Monte Carlo algorithm [Gamerman1997]. To create an MCMC
object to handle our model, import disaster_model.py
and use it as an
argument for MCMC
:
>>> from pymc.examples import disaster_model
>>> from pymc import MCMC
>>> M = MCMC(disaster_model)
In this case M
will expose variables switchpoint
, early_mean
,
late_mean
and disasters
as attributes; that is, M.switchpoint
will
be the same object as disaster_model.switchpoint
.
To run the sampler, call the MCMC object’s sample()
(or isample()
, for
interactive sampling) method with arguments for the number of iterations,
burn-in length, and thinning interval (if desired):
>>> M.sample(iter=10000, burn=1000, thin=10)
After a few seconds, you should see that sampling has finished normally. The model has been fitted.
What does it mean to fit a model?¶
Fitting a model means characterizing its posterior distribution somehow. In
this case, we are trying to represent the posterior \(p(s,e,l|D)\) by a set
of joint samples from it. To produce these samples, the MCMC sampler randomly
updates the values of switchpoint
, early_mean
and late_mean
according to the Metropolis-Hastings algorithm [Gelman2004] over a specified
number of iterations (iter
).
As the number of samples grows sufficiently large, the MCMC distributions of
switchpoint
, early_mean
and late_mean
converge to their joint
stationary distribution. In other words, their values can be considered as
random draws from the posterior \(p(s,e,l|D)\). PyMC assumes that the
burn
parameter specifies a sufficiently large number of iterations for
the algorithm to converge, so it is up to the user to verify that this is the
case (see chapter Model checking and diagnostics). Consecutive values sampled from
switchpoint
, early_mean
and late_mean
are always serially
dependent, since it is a Markov chain. MCMC often results in strong
autocorrelation among samples that can result in imprecise posterior inference.
To circumvent this, it is useful to thin the sample by only retaining every k
th sample, where \(k\) is an integer value. This thinning interval is
passed to the sampler via the thin
argument.
If you are not sure ahead of time what values to choose for the burn
and
thin
parameters, you may want to retain all the MCMC samples, that is to
set burn=0
and thin=1
, and then discard the burn-in period and thin
the samples after examining the traces (the series of samples). See
[Gelman2004] for general guidance.
Accessing the samples¶
The output of the MCMC algorithm is a trace, the sequence of retained samples
for each variable in the model. These traces can be accessed using the
trace(name, chain=-1)
method. For example:
>>> M.trace('switchpoint')[:]
array([41, 40, 40, ..., 43, 44, 44])
The trace slice [start:stop:step]
works just like the NumPy array slice. By
default, the returned trace array contains the samples from the last call to
sample
, that is, chain=-1
, but the trace from previous sampling runs
can be retrieved by specifying the correspondent chain index. To return the
trace from all chains, simply use chain=None
. [2]
Sampling output¶
You can examine the marginal posterior of any variable by plotting a histogram of its trace:
>>> from pylab import hist, show
>>> hist(M.trace('late_mean')[:])
(array([ 8, 52, 565, 1624, 2563, 2105, 1292, 488, 258, 45]),
array([ 0.52721865, 0.60788251, 0.68854637, 0.76921023, 0.84987409,
0.93053795, 1.01120181, 1.09186567, 1.17252953, 1.25319339]),
<a list of 10 Patch objects>)
>>> show()
You should see something like this:
PyMC has its own plotting functionality, via the optional matplotlib
module
as noted in the installation notes. The Matplot
module includes a plot
function that takes the model (or a single parameter) as an argument:
>>> from pymc.Matplot import plot
>>> plot(M)
For each variable in the model, plot
generates a composite figure, such as
this one for the switchpoint in the disasters model:
The upper left-hand pane of this figure shows the temporal series of the
samples from switchpoint
, while below is an autocorrelation plot of the
samples. The right-hand pane shows a histogram of the trace. The trace is
useful for evaluating and diagnosing the algorithm’s performance (see
[Gelman1996]), while the histogram is useful for visualizing the posterior.
For a non-graphical summary of the posterior, simply call M.stats()
.
Imputation of Missing Data¶
As with most textbook examples, the models we have examined so far assume that the associated data are complete. That is, there are no missing values corresponding to any observations in the dataset. However, many real-world datasets have missing observations, usually due to some logistical problem during the data collection process. The easiest way of dealing with observations that contain missing values is simply to exclude them from the analysis. However, this results in loss of information if an excluded observation contains valid values for other quantities, and can bias results. An alternative is to impute the missing values, based on information in the rest of the model.
For example, consider a survey dataset for some wildlife species:
Count | Site | Observer | Temperature |
---|---|---|---|
15 | 1 | 1 | 15 |
10 | 1 | 2 | NA |
6 | 1 | 1 | 11 |
Each row contains the number of individuals seen during the survey, along with three covariates: the site on which the survey was conducted, the observer that collected the data, and the temperature during the survey. If we are interested in modelling, say, population size as a function of the count and the associated covariates, it is difficult to accommodate the second observation because the temperature is missing (perhaps the thermometer was broken that day). Ignoring this observation will allow us to fit the model, but it wastes information that is contained in the other covariates.
In a Bayesian modelling framework, missing data are accommodated simply by treating them as unknown model parameters. Values for the missing data \(\tilde{y}\) are estimated naturally, using the posterior predictive distribution:
This describes additional data \(\tilde{y}\), which may either be considered unobserved data or potential future observations. We can use the posterior predictive distribution to model the likely values of missing data.
Consider the coal mining disasters data introduced previously. Assume that two years of data are missing from the time series; we indicate this in the data array by the use of an arbitrary placeholder value, None.:
x = np.array([ 4, 5, 4, 0, 1, 4, 3, 4, 0, 6, 3, 3, 4, 0, 2, 6,
3, 3, 5, 4, 5, 3, 1, 4, 4, 1, 5, 5, 3, 4, 2, 5,
2, 2, 3, 4, 2, 1, 3, None, 2, 1, 1, 1, 1, 3, 0, 0,
1, 0, 1, 1, 0, 0, 3, 1, 0, 3, 2, 2, 0, 1, 1, 1,
0, 1, 0, 1, 0, 0, 0, 2, 1, 0, 0, 0, 1, 1, 0, 2,
3, 3, 1, None, 2, 1, 1, 1, 1, 2, 4, 2, 0, 0, 1, 4,
0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1])
To estimate these values in PyMC, we generate a masked array. These are
specialised NumPy arrays that contain a matching True or False value for each
element to indicate if that value should be excluded from any computation.
Masked arrays can be generated using NumPy’s ma.masked_equal
function:
>>> masked_values = np.ma.masked_equal(x, value=None)
>>> masked_values
masked_array(data = [4 5 4 0 1 4 3 4 0 6 3 3 4 0 2 6 3 3 5 4 5 3 1 4 4 1 5 5 3
4 2 5 2 2 3 4 2 1 3 -- 2 1 1 1 1 3 0 0 1 0 1 1 0 0 3 1 0 3 2 2 0 1 1 1 0 1 0
1 0 0 0 2 1 0 0 0 1 1 0 2 3 3 1 -- 2 1 1 1 1 2 4 2 0 0 1 4 0 0 0 1 0 0 0 0 0 1
0 0 1 0 1],
mask = [False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False True False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False True
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False],
fill_value=?)
This masked array, in turn, can then be passed to one of PyMC’s data stochastic variables, which recognizes the masked array and replaces the missing values with Stochastic variables of the desired type. For the coal mining disasters problem, recall that disaster events were modeled as Poisson variates:
>>> from pymc import Poisson
>>> disasters = Poisson('disasters', mu=rate, value=masked_values, observed=True)
Here rate
is an array of means for each year of data, allocated according
to the location of the switchpoint. Each element in disasters is a Poisson
Stochastic, irrespective of whether the observation was missing or not. The
difference is that actual observations are data Stochastics
(observed=True
), while the missing values are non-data Stochastics. The
latter are considered unknown, rather than fixed, and therefore estimated by
the MCMC algorithm, just as unknown model parameters.
The entire model looks very similar to the original model:
# Switchpoint
switch = DiscreteUniform('switch', lower=0, upper=110)
# Early mean
early_mean = Exponential('early_mean', beta=1)
# Late mean
late_mean = Exponential('late_mean', beta=1)
@deterministic(plot=False)
def rate(s=switch, e=early_mean, l=late_mean):
"""Allocate appropriate mean to time series"""
out = np.empty(len(disasters_array))
# Early mean prior to switchpoint
out[:s] = e
# Late mean following switchpoint
out[s:] = l
return out
# The inefficient way, using the Impute function:
# D = Impute('D', Poisson, disasters_array, mu=r)
#
# The efficient way, using masked arrays:
# Generate masked array. Where the mask is true,
# the value is taken as missing.
masked_values = masked_array(disasters_array, mask=disasters_array==-999)
# Pass masked array to data stochastic, and it does the right thing
disasters = Poisson('disasters', mu=rate, value=masked_values, observed=True)
Here, we have used the masked_array
function, rather than masked_equal
,
and the value -999 as a placeholder for missing data. The result is the same.
Fine-tuning the MCMC algorithm¶
MCMC objects handle individual variables via step methods, which determine
how parameters are updated at each step of the MCMC algorithm. By default, step
methods are automatically assigned to variables by PyMC. To see which step
methods \(M\) is using, look at its step_method_dict
attribute with
respect to each parameter:
>>> M.step_method_dict[disaster_model.switchpoint]
[<pymc.StepMethods.DiscreteMetropolis object at 0x3e8cb50>]
>>> M.step_method_dict[disaster_model.early_mean]
[<pymc.StepMethods.Metropolis object at 0x3e8cbb0>]
>>> M.step_method_dict[disaster_model.late_mean]
[<pymc.StepMethods.Metropolis object at 0x3e8ccb0>]
The value of step_method_dict
corresponding to a particular variable is a
list of the step methods \(M\) is using to handle that variable.
You can force \(M\) to use a particular step method by calling
M.use_step_method
before telling it to sample. The following call will
cause \(M\) to handle late_mean
with a standard Metropolis
step
method, but with proposal standard deviation equal to \(2\):
>>> from pymc import Metropolis
>>> M.use_step_method(Metropolis, disaster_model.late_mean, proposal_sd=2.)
Another step method class, AdaptiveMetropolis
, is better at handling
highly-correlated variables. If your model mixes poorly, using
AdaptiveMetropolis
is a sensible first thing to try.
Beyond the basics¶
That was a brief introduction to basic PyMC usage. Many more topics are covered in the subsequent sections, including:
- Class
Potential
, another building block for probability models in addition toStochastic
andDeterministic
- Normal approximations
- Using custom probability distributions
- Object architecture
- Saving traces to the disk, or streaming them to the disk during sampling
- Writing your own step methods and fitting algorithms.
Also, be sure to check out the documentation for the Gaussian process extension, which is available on PyMC’s download page.
[2] | Note that the unknown variables switchpoint , early_mean , |
late_mean
and rate
will all accrue samples, but disasters will not
because its value has been observed and is not updated. Hence disasters has
no trace and calling M.trace('disasters')[:]
will raise an error.
Building models¶
Bayesian inference begins with specification of a probability model relating
unknown variables to data. PyMC provides three basic building blocks for
Bayesian probability models: Stochastic
, Deterministic
and
Potential
.
A Stochastic
object represents a variable whose value is not completely
determined by its parents, and a Deterministic
object represents a variable
that is entirely determined by its parents. In object-oriented programming
parlance, Stochastic
and Deterministic
are subclasses of the
Variable
class, which only serves as a template for other classes and is
never actually implemented in models.
The third basic class, Potential
, represents ‘factor potentials’
([Lauritzen1990],[Jordan2004]_), which are not variables but simply
log-likelihood terms and/or constraints that are multiplied into joint
distributions to modify them. Potential
and Variable
are subclasses of
Node
.
PyMC probability models are simply linked groups of Stochastic
,
Deterministic
and Potential
objects. These objects have very limited
awareness of the models in which they are embedded and do not themselves
possess methods for updating their values in fitting algorithms. Objects
responsible for fitting probability models are described in chapter
Fitting Models.
The Stochastic class¶
A stochastic variable has the following primary attributes:
value
:- The variable’s current value.
logp
:- The log-probability of the variable’s current value given the values of its parents.
A stochastic variable can optionally be endowed with a method called
random
, which draws a value for the variable given the values of its
parents [1]. Stochastic objects have the following additional attributes:
parents
:- A dictionary containing the variable’s parents. The keys of the dictionary
correspond to the names assigned to the variable’s parents by the variable,
and the values correspond to the actual parents. For example, the keys of
\(s\)‘s parents dictionary in model (
disastermodel
) would be't_l'
and't_h'
. Thanks to Python’s dynamic typing, the actual parents (i.e. the values of the dictionary) may be of any class or type. children
:- A set containing the variable’s children.
extended_parents
:- A set containing all the stochastic variables on which the variable depends either directly or via a sequence of deterministic variables. If the value of any of these variables changes, the variable will need to recompute its log-probability.
extended_children
:- A set containing all the stochastic variables and potentials that depend on the variable either directly or via a sequence of deterministic variables. If the variable’s value changes, all of these variables will need to recompute their log-probabilities.
observed
:- A flag (boolean) indicating whether the variable’s value has been observed (is fixed).
dtype
:- A NumPy dtype object (such as
numpy.int
) that specifies the type of the variable’s value to fitting methods. If this isNone
(default) then no type is enforced.
Creation of stochastic variables¶
There are three main ways to create stochastic variables, called the automatic, decorator, and direct interfaces.
- Automatic
Stochastic variables with standard distributions provided by PyMC (see chapter Probability distributions) can be created in a single line using special subclasses of
Stochastic
. For example, the uniformly-distributed discrete variableswitchpoint
in (disaster_model) is created using the automatic interface as follows:switchpoint = DiscreteUniform('switchpoint', lower=0, upper=110, doc='Switchpoint[year]')
In addition to the classes in chapter Probability distributions,
scipy.stats.distributions
‘ random variable classes are wrapped asStochastic
subclasses if SciPy is installed. These distributions are in the submodulepymc.SciPyDistributions
.Users can call the class factory
stochastic_from_dist
to produceStochastic
subclasses of their own from probability distributions not included with PyMC.- Decorator
Uniformly-distributed discrete stochastic variable
switchpoint
in (disaster_model) could alternatively be created from a function that computes its log-probability as follows:@pymc.stochastic(dtype=int) def switchpoint(value=1900, t_l=1851, t_h=1962): """The switchpoint for the rate of disaster occurrence.""" if value > t_h or value < t_l: # Invalid values return -np.inf else: # Uniform log-likelihood return -np.log(t_h - t_l + 1)
Note that this is a simple Python function preceded by a Python expression called a decorator [vanRossum2010], here called
@stochastic
. Generally, decorators enhance functions with additional properties or functionality. TheStochastic
object produced by the@stochastic
decorator will evaluate its log-probability using the functionswitchpoint
. Thevalue
argument, which is required, provides an initial value for the variable. The remaining arguments will be assigned as parents ofswitchpoint
(i.e. they will populate theparents
dictionary).To emphasize, the Python function decorated by
@stochastic
should compute the log-density or log-probability of the variable. That is why the return value in the example above is \(-\log(t_h-t_l+1)\) rather than \(1/(t_h-t_l+1)\).The
value
and parents of stochastic variables may be any objects, provided the log-probability function returns a real number (float
). PyMC and SciPy both provide implementations of several standard probability distributions that may be helpful for creating custom stochastic variables.The decorator stochastic can take any of the arguments
Stochastic.__init__
takes exceptparents
,logp
,random
,doc
andvalue
. These arguments includetrace
,plot
,verbose
,dtype
,rseed
andname
. The decorator interface has a slightly more complex implementation which allows you to specify arandom
method for sampling the stochastic variable’s value conditional on its parents.@pymc.stochastic(dtype=int) def switchpoint(value=1900, t_l=1851, t_h=1962): """The switchpoint for the rate of disaster occurrence.""" def logp(value, t_l, t_h): if value > t_h or value < t_l: return -np.inf else: return -np.log(t_h - t_l + 1) def random(t_l, t_h): from numpy.random import random return np.round( (t_l - t_h) * random() ) + t_l
The stochastic variable again gets its name, docstring and parents from function
switchpoint
, but in this case it will evaluate its log-probability using thelogp
function. Therandom
function will be used whenswitchpoint.random()
is called. Note thatrandom
doesn’t take avalue
argument, as it generates values itself.- Direct
It’s possible to instantiate
Stochastic
directly:def switchpoint_logp(value, t_l, t_h): if value > t_h or value < t_l: return -np.inf else: return -np.log(t_h - t_l + 1) def switchpoint_rand(t_l, t_h): from numpy.random import random return np.round( (t_l - t_h) * random() ) + t_l switchpoint = Stochastic( logp = switchpoint_logp, doc = 'The switchpoint for the rate of disaster occurrence.', name = 'switchpoint', parents = {'t_l': 1851, 't_h': 1962}, random = switchpoint_rand, trace = True, value = 1900, dtype=int, rseed = 1., observed = False, cache_depth = 2, plot=True, verbose = 0)
Notice that the log-probability and random variate functions are specified externally and passed to
Stochastic
as arguments. This is a rather awkward way to instantiate a stochastic variable; consequently, such implementations should be rare.
A Warning: Don’t update stochastic variables’ values in-place
Stochastic
objects’ values should not be updated in-place. This
confuses PyMC’s caching scheme and corrupts the process used for accepting
or rejecting proposed values in the MCMC algorithm. The only way a
stochastic variable’s value should be updated is using statements of the
following form:
A.value = new_value
The following are in-place updates and should _never_ be used:
* ``A.value += 3``
* ``A.value[2,1] = 5``
* ``A.value.attribute = new_attribute_value``
This restriction becomes onerous if a step method proposes values for the elements of an array-valued variable separately. In this case, it may be preferable to partition the variable into several scalar-valued variables stored in an array or list.
Data¶
Data are represented by Stochastic
objects whose observed
attribute is
set to True
. If a stochastic variable’s observed
flag is True
, its
value cannot be changed, and it won’t be sampled by the fitting method.
Declaring stochastic variables to be data¶
In each interface, an optional keyword argument observed
can be set to
True
. In the decorator interface, this argument is added to the
@stochastic
decorator:
@pymc.stochastic(observed=True)
In the other interfaces, the observed=True
argument is added to the
instantiation of the Stochastic
, or its subclass:
x = pymc.Binomial('x', n=n, p=p, observed=True)
Alternatively, in the decorator interface only, a Stochastic
object’s
observed
flag can be set to true by stacking an @observed
decorator on
top of the @stochastic
decorator:
@pymc.observed(dtype=int)
def ...
The Deterministic class¶
The Deterministic
class represents variables whose values are completely
determined by the values of their parents. For example, in model
(disaster_model), rate
is a deterministic
variable. Recall it
was defined by
so rate
‘s value can be computed exactly from the values of its parents
early_mean
, late_mean
and switchpoint
.
A deterministic
variable’s most important attribute is value
, which
gives the current value of the variable given the values of its parents. Like
Stochastic
‘s logp
attribute, this attribute is computed on-demand and
cached for efficiency.
A Deterministic variable has the following additional attributes:
parents
:- A dictionary containing the variable’s parents. The keys of the dictionary correspond to the names assigned to the variable’s parents by the variable, and the values correspond to the actual parents.
children
:- A set containing the variable’s children, which must be nodes.
Deterministic variables have no methods.
Creation of deterministic variables¶
Deterministic variables are less complicated than stochastic variables, and have similar automatic, decorator, and direct interfaces:
- Automatic
A handful of common functions have been wrapped in Deterministic objects. These are brief enough to list:
LinearCombination
:Has two parents \(x\) and \(y\), both of which must be iterable (i.e. vector-valued). This function returns:
\[\sum_i x_i^T y_i.\]Index
:Has two parents \(x\) and
index
. \(x\) must be iterables,index
must be valued as an integer. The value of anindex
is:\[x[\mathtt{index}]^T y[\mathtt{index}].\]Index
is useful for implementing dynamic models, in which the parent-child connections change.Lambda
:- Converts an anonymous function (in Python, called lambda functions)
to a
Deterministic
instance on a single line. CompletedDirichlet
:- PyMC represents Dirichlet variables of length \(k\) by the first
\(k-1\) elements; since they must sum to 1, the \(k^{th}\)
element is determined by the others.
CompletedDirichlet
appends the \(k^{th}\) element to the value of its parent \(D\). Logit
,InvLogit
,StukelLogit
,StukelInvLogit
:- Common link functions for generalized linear models, and their inverses.
It’s a good idea to use these classes when feasible in order to give hints to step methods.
- Elementary operations on variables
Certain elementary operations on variables create deterministic variables. For example:
>>> x = pymc.MvNormalCov('x',np.ones(3),np.eye(3)) >>> y = pymc.MvNormalCov('y',np.ones(3),np.eye(3)) >>> print x+y <pymc.PyMCObjects.Deterministic '(x_add_y)' at 0x105c3bd10> >>> print x[0] <pymc.CommonDeterministics.Index 'x[0]' at 0x105c52390> >>> print x[1]+y[2] <pymc.PyMCObjects.Deterministic '(x[1]_add_y[2])' at 0x105c52410>
All the objects thus created have
trace=False
andplot=False
by default. This convenient method of generating simple deterministics was inspired by [Kerman2004].- Decorator
A deterministic variable can be created via a decorator in a way very similar to
Stochastic
‘s decorator interface:@deterministic(plot=False) def rate(s=switchpoint, e=early_mean, l=late_mean): ''' Concatenate Poisson means ''' out = empty(len(disasters_array)) out[:s] = e out[s:] = l return out
Notice that rather than returning the log-probability, as is the case for
Stochastic
objects, the function returns the value of the deterministic object, given its parents. This return value may be of any type, as is suitable for the problem at hand. Also notice that, unlike forStochastic
objects, there is novalue
argument passed, since the value is calculated deterministically by the function itself. Arguments’ keys and values are converted into a parent dictionary as withStochastic
‘s short interface. Thedeterministic
decorator can taketrace
,verbose
andplot
arguments, like thestochastic
decorator [2].- Direct
Deterministic
objects can also be instantiated directly:def rate_eval(switchpoint = s, early_rate = e, late_rate = l): value = zeros(N) value[:switchpoint] = early_rate value[switchpoint:] = late_rate return value rate = pymc.Deterministic(eval = rate_eval, name = 'rate', parents = {'switchpoint': switchpoint, 'early_mean': early_mean, 'late_mean': late_mean}, doc = 'The rate of disaster occurrence.', trace = True, verbose = 0, dtype=float, plot=False, cache_depth = 2)
Containers¶
In some situations it would be inconvenient to assign a unique label to each parent of some variable. Consider \(y\) in the following model:
Here, \(y\) depends on every element of the Markov chain \(x\), but we
wouldn’t want to manually enter \(N\) parent labels 'x_0'
, 'x_1'
,
etc.
This situation can be handled naturally in PyMC:
N = 10
x_0 = pymc.Normal('x_0', mu=0, tau=1)
x = np.empty(N, dtype=object)
x[0] = x_0
for i in range(1, N):
x[i] = pymc.Normal('x_%i' % i, mu=x[i-1], tau=1)
@pymc.observed
def y(value=1, mu=x, tau=100):
return pymc.normal_like(value, np.sum(mu**2), tau)
PyMC automatically wraps array \(x\) in an appropriate Container
class.
The expression 'x_%i' % i
labels each Normal
object in the container
with the appropriate index \(i\). For example, if i=1
, the name of the
corresponding element becomes 'x_1'
.
Containers, like variables, have an attribute called value
. This attribute
returns a copy of the (possibly nested) iterable that was passed into the
container function, but with each variable inside replaced with its
corresponding value.
Containers can be constructed from lists, tuples, dictionaries, Numpy arrays,
modules, sets or any object with a __dict__
attribute. Variables and
non-variables can be freely mixed in these containers, and different types of
containers can be nested [3]. Containers attempt to behave like the objects
they wrap. All containers are subclasses of ContainerBase
.
Containers have the following useful attributes in addition to value
:
variables
stochastics
potentials
deterministics
data_stochastics
step_methods
.
Each of these attributes is a set containing all the objects of each type in a container, and within any containers in the container.
The Potential class¶
The joint density corresponding to model (disastermodel
) can be written as follows:
Each factor in the joint distribution is a proper, normalized probability
distribution for one of the variables conditional on its parents. Such factors
are contributed by Stochastic
objects.
In some cases, it’s nice to be able to modify the joint density by incorporating terms that don’t correspond to probabilities of variables conditional on parents, for example:
In other cases we may want to add probability terms to existing models. For
example, suppose we want to constrain the difference between \(e\) and
\(l\) in (disastermodel
) to be less than 1, so that the joint density
becomes
Its possible to express this constraint by adding variables to the model, or by grouping \(e\) and \(l\) to form a vector-valued variable, but it’s uncomfortable to do so.
Arbitrary factors such as \(\psi\) and the indicator function
\(I(|e-l|<1)\) are implemented by objects of class Potential
(from
[Lauritzen1990] and [Jordan2004], who call these terms ‘factor
potentials’). Bayesian hierarchical notation (cf model (disastermodel
))
doesn’t accomodate these potentials. They are often used in cases where there
is no natural dependence hierarchy, such as the first example above (which is
known as a Markov random field). They are also useful for expressing ‘soft
data’ [Christakos2002] as in the second example below.
Potentials have one important attribute, logp
, the log of their current
probability or probability density value given the values of their parents. The
only other additional attribute of interest is parents
, a dictionary
containing the potential’s parents. Potentials have no methods. They have no
trace
attribute, because they are not variables. They cannot serve as
parents of variables (for the same reason), so they have no children
attribute.
An example of soft data¶
The role of potentials can be confusing, so we will provide another example: we have a dataset \(t\) consisting of the days on which several marked animals were recaptured. We believe that the probability \(S\) that an animal is not recaptured on any given day can be explained by a covariate vector \(x\). We model this situation as follows:
Now suppose we have some knowledge of other related experiments and we have a good idea of what \(S\) will be independent of the value of \(\beta\). It’s not obvious how to work this ‘soft data’, because as we’ve written the model \(S\) is completely determined by \(\beta\). There are three options within the strict Bayesian hierarchical framework:
- Express the soft data as an informative prior on \(\beta\).
- Incorporate the data from the previous experiments explicitly into the model.
- Refactor the model so that \(S\) is at the bottom of the hierarchy, and assign the prior directly.
Factor potentials provide a convenient way to incorporate the soft data without the need for such major modifications. We can simply modify the joint distribution from
to
where the value of \(\gamma\) is large if \(S\)‘s value is plausible (based on our external information) and small otherwise. We do not know the normalizing constant for the new distribution, but we don’t need it to use most popular fitting algorithms. It’s a good idea to check the induced priors on \(S\) and \(\beta\) for sanity. This can be done in PyMC by fitting the model with the data \(t\) removed.
Its important to understand that \(\gamma\) is not a variable, so it does not have a value. That means, among other things, there will be no \(\gamma\) column in MCMC traces.
Creation of Potentials¶
There are two ways to create potentials:
- Decorator
A potential can be created via a decorator in a way very similar to
Deterministic
‘s decorator interface:@pymc.potential def psi_i(x_lo = x[i], x_hi = x[i+1]): """A pair potential""" return -(xlo - xhi)**2
The function supplied should return the potential’s current log-probability or log-density as a Numpy
float
. Thepotential
decorator can takeverbose
andcache_depth
arguments like thestochastic
decorator.- Direct
The same potential could be created directly as follows:
def psi_i_logp(x_lo = x[i], x_hi = x[i+1]): return -(xlo - xhi)**2 psi_i = pymc.Potential(logp = psi_i_logp, name = 'psi_i', parents = {'xlo': x[i], 'xhi': x[i+1]}, doc = 'A pair potential', verbose = 0, cache_depth = 2)
Graphing models¶
The function dag
(or graph
) in pymc.graph
draws graphical
representations of Model
(Chapter Fitting Models) instances using
GraphViz via the Python package PyDot. See [Lauritzen1990] and [Jordan2004]
for more discussion of useful information that can be read off of graphical
models [4].
The symbol for stochastic variables is an ellipse. Parent-child relationships
are indicated by arrows. These arrows point from parent to child and are
labeled with the names assigned to the parents by the children. PyMC’s symbol
for deterministic variables is a downward-pointing triangle. A graphical
representation of model disastermodel
is shown in Directed acyclic graph of the relationships in the coal mining disaster model example.. Note that
\(D\) is shaded because it is flagged as data.
The symbol for factor potentials is a rectangle, as in the following model.
Factor potentials are usually associated with undirected grahical models. In
undirected representations, each parent of a potential is connected to every
other parent by an undirected edge. The undirected representation of the model
pictured above is much more compact: Directed or mixed graphical models can be
represented in an undirected form by ‘moralizing’, which is done by the
function pymc.graph.moral_graph
.
Class LazyFunction and caching¶
This section gives an overview of how PyMC computes log-probabilities. This is advanced information that is not required in order to use PyMC.
The logp
attributes of stochastic variables and potentials and the
value
attributes of deterministic variables are wrappers for instances of
class LazyFunction
. Lazy functions are wrappers for ordinary Python
functions. A lazy function L
could be created from a function fun
as
follows:
L = pymc.LazyFunction(fun, arguments)
The argument arguments
is a dictionary container; fun
must accept
keyword arguments only. When L
‘s get()
method is called, the return
value is the same as the call
fun(**arguments.value)
Note that no arguments need to be passed to L.get
; lazy functions memorize
their arguments.
Before calling fun
, L
will check the values of arguments.variables
against an internal cache. This comparison is done by reference, not by
value, and this is part of the reason why stochastic variables’ values cannot
be updated in-place. If arguments.variables
‘ values match a frame of the
cache, the corresponding output value is returned and fun
is not called. If
a call to fun
is needed, arguments.variables
‘ values and the return
value replace the oldest frame in the cache. The depth of the cache can be set
using the optional init argument cache_depth
, which defaults to 2.
Caching is helpful in MCMC, because variables’ log-probabilities and values tend to be queried multiple times for the same parental value configuration. The default cache depth of 2 turns out to be most useful in Metropolis-Hastings-type algorithms involving proposed values that may be rejected.
Lazy functions are implemented in C using Pyrex, a language for writing Python extensions.
Footnotes
[1] | Note that the random method does not provide a Gibbs sample unless |
the variable has no children.
[2] | Note that deterministic variables have no observed flag. If a |
deterministic variable’s value were known, its parents would be restricted to
the inverse image of that value under the deterministic variable’s evaluation
function. This usage would be extremely difficult to support in general, but it
can be implemented for particular applications at the StepMethod
level.
[3] | Nodes whose parents are containers make private shallow copies of those |
containers. This is done for technical reasons rather than to protect users from accidental misuse.
[4] | Note that these authors do not consider deterministic variables. |
Fitting Models¶
PyMC provides three objects that fit models:
MCMC
, which coordinates Markov chain Monte Carlo algorithms. The actual work of updating stochastic variables conditional on the rest of the model is done byStepMethod
objects, which are described in this chapter.MAP
, which computes maximum a posteriori estimates.NormApprox
, which computes the ‘normal approximation’ [Gelman2004]: the joint distribution of all stochastic variables in a model is approximated as normal using local information at the maximum a posteriori estimate.
All three objects are subclasses of Model
, which is PyMC’s base class for
fitting methods. MCMC
and NormApprox
, both of which can produce samples
from the posterior, are subclasses of Sampler
, which is PyMC’s base class
for Monte Carlo fitting methods. Sampler
provides a generic sampling loop
method and database support for storing large sets of joint samples. These base
classes implement some basic methods that are inherited by the three
implemented fitting methods, so they are documented at the end of this section.
Creating models¶
The first argument to any fitting method’s __init__
method, including that
of MCMC
, is called input
. The input
argument can be just about
anything; once you have defined the nodes that make up your model, you
shouldn’t even have to think about how to wrap them in a Model
instance.
Some examples of model instantiation using nodes a
, b
and c
follow:
M = MCMC(set([a,b,c]))
This will create aMCMC
model with \(a\), \(b\), and \(c\) as components, each of which will be exposed as attributes ofM
(e.g.M.a
).M = MCMC({`a': a, `d': [b,c]})
In this case, \(M\) will expose \(a\) and \(d\) as attributes:M.a
will be \(a\), andM.d
will be[b,c]
.M = MAP([[a,b],c])
This will create aMAP
model with \(a\) and \(b\) as aContainer
object and \(c\) exposed on its own.If file
MyModule
contains the definitions ofa
,b
andc
:import MyModule M = Model(MyModule)
In this case, \(M\) will expose \(a\), \(b\) and \(c\) as attributes.
Using a ‘model factory’ function:
def make_model(x): a = pymc.Exponential('a', beta=x, value=0.5) @pymc.deterministic def b(a=a): return 100-a @pymc.stochastic def c(value=.5, a=a, b=b): return (value-a)**2/b return locals() M = pymc.MCMC(make_model(3))
In this case, \(M\) will also expose \(a\), \(b\) and \(c\) as attributes.
The Model class¶
This class serves as a container for probability models and as a base class for
the classes responsible for model fitting, such as MCMC
.
Model
‘s init method takes the following arguments:
input
:- Some collection of PyMC nodes defining a probability model. These may be
stored in a list, set, tuple, dictionary, array, module, or any object with
a
__dict__
attribute. verbose
(optional):- An integer controlling the verbosity of the model’s output.
Models’ useful methods are:
draw_from_prior()
:- Sets all stochastic variables’ values to new random values, which would be a
sample from the joint distribution if all data and
Potential
instances’ log-probability functions returned zero. If any stochastic variables lack arandom()
method, PyMC will raise an exception. seed()
:- Same as
draw_from_prior
, but onlystochastics
whoserseed
attribute is notNone
are changed.
As introduced in the previous chapter, the helper function graph.dag
produces graphical representations of models (see [Jordan2004]).
Models have the following important attributes:
variables
nodes
stochastics
potentials
deterministics
observed_stochastics
containers
value
logp
In addition, models expose each node they contain as an attribute. For
instance, if model M
were produced from model (disastermodel
) M.s
would return the switchpoint variable.
Though one may instantiate Model
objects directly, most users should pefer
to instantiate the Model
subclass that they will be using to fit their model.
These are each described below.
Maximum a posteriori estimates¶
The MAP
class sets all stochastic variables to their maximum a posteriori
values using functions in SciPy’s optimize
package; hence, SciPy must be
installed to use it. MAP
can only handle variables whose dtype is
float
, so it will not work, for example, on model (disastermodel
). To
fit the model in examples/gelman_bioassay.py
using MAP
, do the
following:
>>> from pymc.examples import gelman_bioassay
>>> M = pymc.MAP(gelman_bioassay)
>>> M.fit()
This call will cause \(M\) to fit the model using modified Powell optimization,
which does not require derivatives. The variables in DisasterModel
have now
been set to their maximum a posteriori values:
>>> M.alpha.value
array(0.8465892309923545)
>>> M.beta.value
array(7.7488499785334168)
In addition, the AIC and BIC of the model are now available:
>>> M.AIC
7.9648372671389458
>>> M.BIC
6.7374259893787265
MAP
has two useful methods:
fit(method='fmin_powell', iterlim=1000, tol=.0001)
:- The optimization method may be
fmin
,fmin_l_bfgs_b
,fmin_ncg
,fmin_cg
, orfmin_powell
. See the documentation of SciPy’soptimize
package for the details of these methods. Thetol
anditerlim
parameters are passed to the optimization function under the appropriate names. revert_to_max()
:- If the values of the constituent stochastic variables change after fitting, this function will reset them to their maximum a posteriori values.
If you’re going to use an optimization method that requires derivatives,
MAP
‘s __init__
method can take additional parameters eps
and
diff_order
. diff_order
, which must be an integer, specifies the order
of the numerical approximation (see the SciPy function derivative
). The
step size for numerical derivatives is controlled by eps
, which may be
either a single value or a dictionary of values whose keys are variables
(actual objects, not names).
The useful attributes of MAP
are:
logp
:- The joint log-probability of the model.
logp_at_max
:- The maximum joint log-probability of the model.
AIC
:- Akaike’s information criterion for this model ([Akaike1973],[Burnham2002]_).
BIC
:- The Bayesian information criterion for this model [Schwarz1978].
One use of the MAP
class is finding reasonable initial states for MCMC
chains. Note that multiple Model
subclasses can handle the same collection
of nodes.
Normal approximations¶
The NormApprox
class extends the MAP
class by approximating the
posterior covariance of the model using the Fisher information matrix, or the
Hessian of the joint log probability at the maximum. To fit the model in
examples/gelman_bioassay.py
using NormApprox
, do:
>>> N = pymc.NormApprox(gelman_bioassay)
>>> N.fit()
The approximate joint posterior mean and covariance of the variables are
available via the attributes mu
and C
:
>>> N.mu[N.alpha]
array([ 0.84658923])
>>> N.mu[N.alpha, N.beta]
array([ 0.84658923, 7.74884998])
>>> N.C[N.alpha]
matrix([[ 1.03854093]])
>>> N.C[N.alpha, N.beta]
matrix([[ 1.03854093, 3.54601911],
[ 3.54601911, 23.74406919]])
As with MAP
, the variables have been set to their maximum a posteriori
values (which are also in the mu
attribute) and the AIC and BIC of the
model are available.
In addition, it’s now possible to generate samples from the posterior as with
MCMC
:
>>> N.sample(100)
>>> N.trace('alpha')[::10]
array([-0.85001278, 1.58982854, 1.0388088 , 0.07626688, 1.15359581,
-0.25211939, 1.39264616, 0.22551586, 2.69729987, 1.21722872])
>>> N.trace('beta')[::10]
array([ 2.50203663, 14.73815047, 11.32166303, 0.43115426,
10.1182532 , 7.4063525 , 11.58584317, 8.99331152,
11.04720439, 9.5084239 ])
Any of the database backends can be used (chapter Saving and managing sampling results).
In addition to the methods and attributes of MAP
, NormApprox
provides
the following methods:
sample(iter)
:- Samples from the approximate posterior distribution are drawn and stored.
isample(iter)
:- An ‘interactive’ version of
sample()
: sampling can be paused, returning control to the user. draw
:- Sets all variables to random values drawn from the approximate posterior.
It provides the following additional attributes:
mu
:- A special dictionary-like object that can be keyed with multiple variables.
N.mu[p1, p2, p3]
would return the approximate posterior mean values of stochastic variablesp1
,p2
andp3
, raveled and concatenated to form a vector. C
:- Another special dictionary-like object.
N.C[p1, p2, p3]
would return the approximate posterior covariance matrix of stochastic variablesp1
,p2
andp3
. As withmu
, these variables’ values are raveled and concatenated before their covariance matrix is constructed.
Markov chain Monte Carlo: the MCMC class¶
The MCMC
class implements PyMC’s core business: producing ‘traces’ for a
model’s variables which, with careful thinning, can be considered independent
joint samples from the posterior. See Tutorial for an example of
basic usage.
MCMC
‘s primary job is to create and coordinate a collection of ‘step
methods’, each of which is responsible for updating one or more variables. The
available step methods are described below. Instructions on how to create your
own step method are available in Extending PyMC.
MCMC
provides the following useful methods:
sample(iter, burn, thin, tune_interval, tune_throughout, save_interval, ...)
:- Runs the MCMC algorithm and produces the traces. The
iter
argument controls the total number of MCMC iterations. No tallying will be done during the firstburn
iterations; these samples will be forgotten. After this burn-in period, tallying will be done eachthin
iterations. Tuning will be done eachtune_interval
iterations. Iftune_throughout=False
, no more tuning will be done after the burnin period. The model state will be saved everysave_interval
iterations, if given. isample(iter, burn, thin, tune_interval, tune_throughout, save_interval, ...)
:- An interactive version of
sample
. The sampling loop may be paused at any time, returning control to the user. use_step_method(method, *args, **kwargs)
:- Creates an instance of step method class
method
to handle some stochastic variables. The extra arguments are passed to theinit
method ofmethod
. Assigning a step method to a variable manually will prevent theMCMC
instance from automatically assigning one. However, you may handle a variable with multiple step methods. goodness()
:- Calculates goodness-of-fit (GOF) statistics according to [Brooks2000].
save_state()
:- Saves the current state of the sampler, including all stochastics, to the
database. This allows the sampler to be reconstituted at a later time to
resume sampling. This is not supported yet for the
sqlite
backend. restore_state()
:- Restores the sampler to the state stored in the database.
stats()
:- Generate summary statistics for all nodes in the model.
remember(trace_index)
:- Set all variables’ values from frame
trace_index
in the database.
MCMC samplers’ step methods can be accessed via the step_method_dict
attribute. M.step_method_dict[x]
returns a list of the step methods M
will use to handle the stochastic variable x
.
After sampling, the information tallied by M
can be queried via
M.db.trace_names
. In addition to the values of variables, tuning
information for adaptive step methods is generally tallied. These ‘traces’ can
be plotted to verify that tuning has in fact terminated.
You can produce ‘traces’ for arbitrary functions with zero arguments as well.
If you issue the command M._funs_to_tally['trace_name'] = f
before sampling
begins, then each time the model variables’ values are tallied, f
will be
called with no arguments, and the return value will be tallied. After sampling
ends you can retrieve the trace as M.trace[’trace_name’]
.
The Sampler class¶
MCMC
is a subclass of a more general class called Sampler
. Samplers fit
models with Monte Carlo fitting methods, which characterize the posterior
distribution by approximate samples from it. They are initialized as follows:
Sampler(input=None, db='ram', name='Sampler', reinit_model=True,
calc_deviance=False, verbose=0)
. The input
argument is a module, list,
tuple, dictionary, set, or object that contains all elements of the model, the
db
argument indicates which database backend should be used to store the
samples (see chapter Saving and managing sampling results), reinit_model
is a boolean flag
that indicates whether the model should be re-initialised before running, and
calc_deviance
is a boolean flag indicating whether deviance should be
calculated for the model at each iteration. Samplers have the following
important methods:
sample(iter, length, verbose, ...)
:- Samples from the joint distribution. The
iter
argument controls how many times the sampling loop will be run, and thelength
argument controls the initial size of the database that will be used to store the samples. isample(iter, length, verbose, ...)
:- The same as
sample
, but the sampling is done interactively: you can pause sampling at any point and be returned to the Python prompt to inspect progress and adjust fitting parameters. While sampling is paused, the following methods are useful: icontinue()
:- Continue interactive sampling.
halt()
:- Truncate the database and clean up.
tally()
:- Write all variables’ current values to the database. The actual write operation depends on the specified database backend.
save_state()
:- Saves the current state of the sampler, including all stochastics, to the
database. This allows the sampler to be reconstituted at a later time to
resume sampling. This is not supported yet for the
sqlite
backend. restore_state()
:- Restores the sampler to the state stored in the database.
stats()
:- Generate summary statistics for all nodes in the model.
remember(trace_index)
:- Set all variables’ values from frame
trace_index
in the database. Note that thetrace_index
is different from the current iteration, since not all samples are necessarily saved due to burning and thinning.
In addition, the sampler attribute deviance
is a deterministic variable
valued as the model’s deviance at its current state.
Step methods¶
Step method objects handle individual stochastic variables, or sometimes groups
of them. They are responsible for making the variables they handle take single
MCMC steps conditional on the rest of the model. Each subclass of
StepMethod
implements a method called step()
, which is called by
MCMC
. Step methods with adaptive tuning parameters can optionally implement
a method called tune()
, which causes them to assess performance (based on
the acceptance rates of proposed values for the variable) so far and adjust.
The major subclasses of StepMethod
are Metropolis
,
AdaptiveMetropolis
and Slicer
. PyMC provides several flavors of the
basic Metropolis steps. There are Gibbs sampling (Gibbs
) steps, but they are not
ready for use as of the current release, but since it is feasible to write Gibbs step
methods for particular applications, the Gibbs
base class will be documented here.
Metropolis step methods¶
Metropolis
and subclasses implement Metropolis-Hastings steps. To tell an
MCMC
object \(M\) to handle a variable x
with a Metropolis step
method, you might do the following:
M.use_step_method(pymc.Metropolis, x, proposal_sd=1., proposal_distribution='Normal')
Metropolis
itself handles float-valued variables, and subclasses
DiscreteMetropolis
and BinaryMetropolis
handle integer- and
boolean-valued variables, respectively. Subclasses of Metropolis
must
implement the following methods:
propose()
:- Sets the values of the variables handled by the Metropolis step method to proposed values.
reject()
:- If the Metropolis-Hastings acceptance test fails, this method is called to
reset the values of the variables to their values before
propose()
was called.
Note that there is no accept()
method; if a proposal is accepted, the
variables’ values are simply left alone. Subclasses that use proposal
distributions other than symmetric random-walk may specify the ‘Hastings
factor’ by changing the hastings_factor
method. See Extending PyMC
for an example.
Metropolis
‘ __init__
method takes the following arguments:
stochastic
:- The variable to handle.
proposal_sd
:- A float or array of floats. This sets the proposal standard deviation if the proposal distribution is normal.
scale
:A float, defaulting to 1. If the
scale
argument is provided but notproposal_sd
,proposal_sd
is computed as follows:if all(self.stochastic.value != 0.): self.proposal_sd = ones(shape(self.stochastic.value)) * \ abs(self.stochastic.value) * scale else: self.proposal_sd = ones(shape(self.stochastic.value)) * scale
proposal_distribution
:- A string indicating which distribution should be used for proposals. Current
options are
'Normal'
and'Prior'
. verbose
:- An integer. By convention 0 indicates no output, 1 shows a progress bar only, 2 provides basic feedback about the current MCMC run, while 3 and 4 provide low and high debugging verbosity, respectively.
Alhough the proposal_sd
attribute is fixed at creation, Metropolis step
methods adjust their initial proposal standard deviations using an attribute
called adaptive_scale_factor
. When tune()
is called, the acceptance
ratio of the step method is examined, and this scale factor is updated
accordingly. If the proposal distribution is normal, proposals will have
standard deviation self.proposal_sd * self.adaptive_scale_factor
.
By default, tuning will continue throughout the sampling loop, even after the
burnin period is over. This can be changed via the tune_throughout
argument
to MCMC.sample
. If an adaptive step method’s tally
flag is set (the
default for Metropolis
), a trace of its tuning parameters will be kept. If
you allow tuning to continue throughout the sampling loop, it is important to
verify that the ‘Diminishing Tuning’ condition of [Roberts2007] is satisfied:
the amount of tuning should decrease to zero, or tuning should become very
infrequent.
If a Metropolis step method handles an array-valued variable, it proposes all
elements independently but simultaneously. That is, it decides whether to
accept or reject all elements together but it does not attempt to take the
posterior correlation between elements into account. The AdaptiveMetropolis
class (see below), on the other hand, does make correlated proposals.
The AdaptiveMetropolis class¶
The AdaptativeMetropolis
(AM) step method works like a regular Metropolis
step method, with the exception that its variables are block-updated using a
multivariate jump distribution whose covariance is tuned during sampling.
Although the chain is non-Markovian, it has correct ergodic properties (see
[Haario2001]).
To tell an MCMC
object \(M\) to handle variables x
, y
and \(z\) with an AdaptiveMetropolis
instance, you might do the
following:
M.use_step_method(pymc.AdaptiveMetropolis, [x,y,z], \
scales={'x':1, 'y':2, 'z':.5}, delay=10000)
AdaptativeMetropolis
‘s init method takes the following arguments:
stochastics
:- The stochastic variables to handle. These will be updated jointly.
cov
(optional):- An initial covariance matrix. Defaults to the identity matrix, adjusted
according to the
scales
argument. delay
(optional):- The number of iterations to delay before computing the empirical covariance matrix.
scales
(optional):- The initial covariance matrix will be diagonal, and its diagonal elements
will be set to
scales
times the stochastics’ values, squared. interval
(optional):- The number of iterations between updates of the covariance matrix. Defaults to 1000.
greedy
(optional):- If
True
, only accepted jumps will be counted toward the delay before the covariance is first computed. Defaults toTrue
. verbose
:- An integer from 0 to 4 controlling the verbosity of the step method’s printed output.
shrink_if_necessary
(optional):- Whether the proposal covariance should be shrunk if the acceptance rate becomes extremely small.
In this algorithm, jumps are proposed from a multivariate normal distribution
with covariance matrix \(\Sigma\). The algorithm first iterates until
delay
samples have been drawn (if greedy
is true, until delay
jumps
have been accepted). At this point, \(\Sigma\) is given the value of the
empirical covariance of the trace so far and sampling resumes. The covariance
is then updated each interval
iterations throughout the entire sampling run
[1]. It is this constant adaptation of the proposal distribution that makes
the chain non-Markovian.
The DiscreteMetropolis class¶
This class is just like Metropolis
, but specialized to handle
Stochastic
instances with dtype int
. The jump proposal distribution can
either be 'Normal'
, 'Prior'
or 'Poisson'
(the default). In the
normal case, the proposed value is drawn from a normal distribution centered at
the current value and then rounded to the nearest integer.
The BinaryMetropolis class¶
This class is specialized to handle Stochastic
instances with dtype
bool
.
For array-valued variables, BinaryMetropolis
can be set to propose from the
prior by passing in dist="Prior"
. Otherwise, the argument p_jump
of the
init method specifies how probable a change is. Like Metropolis
‘ attribute
proposal_sd
, p_jump
is tuned throughout the sampling loop via
adaptive_scale_factor
.
For scalar-valued variables, BinaryMetropolis
behaves like a Gibbs sampler,
since this requires no additional expense. The p_jump
and
adaptive_scale_factor
parameters are not used in this case.
The Slicer class¶
The Slicer
class implements Slice sampling ([Neal2003]). To tell an
MCMC
object \(M\) to handle a variable x
with a Slicer step
method, you might do the following:
M.use_step_method(pymc.Slicer, x, w=10, m=10000, doubling=True)
Slicer
‘s init method takes the following arguments:
stochastics
:- The stochastic variables to handle. These will be updated jointly.
w
(optional):- The initial width of the horizontal slice (Defaults to 1). This will be updated via either stepping-out or doubling procedures.
m
(optional):- The multiplier defining the maximum slice size as \(mw\) (Defaults to 1000).
tune
(optional):- A flag indicating whether to tune the initial slice width (Defaults to
True
). doubling
(optional):- A flag for using doubling procedure instead of stepping out (Defaults to
False
) tally
(optional):- Flag for recording values for trace (Defaults to
True
). verbose
:- An integer from -1 to 4 controlling the verbosity of the step method’s printed output (Defaults to -1).
The *slice sampler* generates posterior samples by alternately drawing “slices” from
the vertical (y) and horizontal (x) planes of a distribution. It first samples from the
conditional distribution for y
given some current value of x
, which is
uniform over the \((0, f (x))\). Conditional on this value for y
, it then
samples x
, which is uniform on \(S = {x : y < f (x)}\); that is the “slice”
defined by the y
value. Hence, this algorithm automatically adapts its to the
local characteristics of the posterior.
The steps required to perform a single iteration of the slice sampler to update the current value of \(x_i\) is as follows:
- Sample
y
uniformly on \((0,f(x_i))\). - Use this value
y
to define a horizontal slice \(S = \{x : y < f (x)\}\). - Establish an interval, \(I = (x_{a}, x_{b})\), around \(x_i\) that contains most of the slice.
- Sample \(x_{i+1}\) from the region of the slice overlaping
I
.
Hence, slice sampling employs an auxilliary variable (y
) that is not retained at the
end of the iteration. Note that in practice one may operate on the log scale such that
\(g(x) = \log(f (x))\) to avoid floating-point underflow. In this case, the auxiliary
variable becomes \(z = log(y) = g(x_i) − e\), where \(e \sim \text{Exp}(1)\),
resulting in the slice \(S = \{x : z < g(x)\}\).
There are many ways of establishing and sampling from the interval I
, with the only
restriction being that the resulting Markov chain leaves \(f(x)\) invariant. The
objective is to include as much of the slice as possible, so that the potential step
size can be large, but not (much) larger than the slice, so that the sampling of
invalid points is minimized. Ideally, we would like it to be the slice itself, but it
may not always be feasible to determine (and certainly not automatically).
One method for determining a sampling interval for \(x_{i+1}\) involves specifying an
initial “guess” at the slice width w
, and iteratively moving the endpoints out
(growing the interval) until either (1) the interval reaches a maximum pre-specified
width or (2) y
is less than the \(f(x)\) evaluated both at the left and the
right interval endpoints. This is the stepping out method. The efficiency of
stepping out depends largely on the ability to pick a reasonable interval w from
which to sample. Otherwise, the doubling procedure may be preferable, as it can be
expanded faster. It simply doubles the size of the interval until both endpoints
are outside the slice.
Gibbs step methods¶
Gibbs step methods handle conjugate submodels. These models usually have two components: the `parent’ and the `children’. For example, a gamma-distributed variable serving as the precision of several normally-distributed variables is a conjugate submodel; the gamma variable is the parent and the normal variables are the children.
This section describes PyMC’s current scheme for Gibbs step methods, several of
which are in a semi-working state in the sandbox. It is meant to be as generic
as possible to minimize code duplication, but it is admittedly complicated.
Feel free to subclass StepMethod
directly when writing Gibbs step methods
if you prefer.
Gibbs step methods that subclass PyMC’s Gibbs
should define the following
class attributes:
child_class
:- The class of the children in the submodels the step method can handle.
parent_class
:- The class of the parent.
parent_label
:- The label the children would apply to the parent in a conjugate submodel.
In the gamma-normal example, this would be
tau
. linear_OK
:- A flag indicating whether the children can use linear combinations involving the parent as their actual parent without destroying the conjugacy.
A subclass of Gibbs
that defines these attributes only needs to implement a
propose()
method, which will be called by Gibbs.step()
. The resulting
step method will be able to handle both conjugate and ‘non-conjugate’ cases.
The conjugate case corresponds to an actual conjugate submodel. In the
nonconjugate case all the children are of the required class, but the parent is
not. In this case the parent’s value is proposed from the likelihood and
accepted based on its prior. The acceptance rate in the nonconjugate case will
be less than one.
The inherited class method Gibbs.competence
will determine the new step
method’s ability to handle a variable x
by checking whether:
- all
x
‘s children are of classchild_class
, and either applyparent_label
to x directly or (iflinear_OK=True
) to aLinearCombination
object (Building models), one of whose parents containsx
. x
is of classparent_class
If both conditions are met, pymc.conjugate_Gibbs_competence
will be
returned. If only the first is met, pymc.nonconjugate_Gibbs_competence
will
be returned.
Granularity of step methods: one-at-a-time vs. block updating¶
There is currently no way for a stochastic variable to compute individual terms of its log-probability; it is computed all together. This means that updating the elements of a array-valued variable individually would be inefficient, so all existing step methods update array-valued variables together, in a block update.
To update an array-valued variable’s elements individually, simply break it up into an array of scalar-valued variables. Instead of this:
A = pymc.Normal('A', value=zeros(100), mu=0., tau=1.)
do this:
A = [pymc.Normal('A_%i'%i, value=0., mu=0., tau=1.) for i in range(100)]
An individual step method will be assigned to each element of A
in the
latter case, and the elements will be updated individually. Note that A
can
be broken up into larger blocks if desired.
Automatic assignment of step methods¶
Every step method subclass (including user-defined ones) that does not require
any __init__
arguments other than the stochastic variable to be handled
adds itself to a list called StepMethodRegistry
in the PyMC namespace. If a
stochastic variable in an MCMC
object has not been explicitly assigned a
step method, each class in StepMethodRegistry
is allowed to examine the
variable.
To do so, each step method implements a class method called
competence(stochastic)
, whose only argument is a single stochastic
variable. These methods return values from 0 to 3; 0 meaning the step method
cannot safely handle the variable and 3 meaning it will most likely perform
well for variables like this. The MCMC
object assigns the step method that
returns the highest competence value to each of its stochastic variables.
Footnotes
[1] | The covariance is estimated recursively from the previous value and the last
interval samples, instead of computing it each time from the entire trace. |
Saving and managing sampling results¶
Accessing Sampled Data¶
The recommended way to access data from an MCMC run, irrespective of the
database backend, is to use the trace
method:
>>> from pymc.examples import disaster_model
>>> from pymc import MCMC
>>> M = MCMC(disaster_model)
>>> M.sample(10)
Sampling: 100% [00000000000000000000000000000000000000000000000000] Iterations: 10
>>> M.trace('early_mean')[:]
array([ 2.28320992, 2.28320992, 2.28320992, 2.28320992, 2.28320992,
2.36982455, 2.36982455, 3.1669422 , 3.1669422 , 3.14499489])
M.trace('early_mean')
returns a copy of the Trace
instance associated
with the tallyable object early_mean:
>>> M.trace('early_mean')
<pymc.database.ram.Trace object at 0x7fa4877a8b50>
Particular subsamples from the trace are obtained using the slice notation
[]
, similar to NumPy arrays. By default, trace
returns the samples from
the last chain. To return the samples from all the chains, use chain=None
:
>>> M.sample(5)
Sampling: 100% [000000000000000000000000000000000000000000000000000] Iterations: 5
>>> M.trace('early_mean', chain=None)[:]
array([ 2.28320992, 2.28320992, 2.28320992, 2.28320992, 2.28320992,
2.36982455, 2.36982455, 3.1669422 , 3.1669422 , 3.14499489,
3.14499489, 3.14499489, 3.14499489, 2.94672454, 3.10767686])
Output Summaries¶
PyMC samplers include a couple of methods that are useful for obtaining
summaries of the model, or particular member nodes, rather than the entire
trace. The summary
method can be used to generate a pretty-printed summary
of posterior quantities. For example, if we want a statistical snapshot of the
early_mean
node:
>>> M.early_mean.summary()
early_mean:
Mean SD MC Error 95% HPD interval
------------------------------------------------------------------
3.075 0.287 0.01 [ 2.594 3.722]
Posterior quantiles:
2.5 25 50 75 97.5
|---------------|===============|===============|---------------|
2.531 2.876 3.069 3.255 3.671
A method of the same name exists for the sampler, which yields summaries for every node in the model.
Alternatively, we may wish to write posterior statistics to a file, where they
may be imported into a spreadsheet or plotting package. In that case,
write_csv
may be called to generate a comma-separated values (csv) file
containing all available statistics for each node:
M.write_csv("disasters.csv", variables=["early_mean", "late_mean", "switchpoint"])
write_csv
is called with a single mandatory argument, the name of the
output file for the summary statistics, and several optional arguments,
including a list of parameters for which summaries are desired (if not
given, all model nodes are summarized) and an alpha level for calculating
credible intervalse (defaults to 0.05).
Saving Data to Disk¶
By default, the database backend selected by the MCMC
sampler is the
ram
backend, which simply holds the data in memory. Now, we will create a
sampler that instead will write data to a pickle file:
>>> M = MCMC(disaster_model, db='pickle', dbname='Disaster.pickle')
>>> M.db
<pymc.database.pickle.Database object at 0x7fa486623d90>
>>> M.sample(10)
>>> M.db.close()
Note that in this particular case, no data is written to disk before the call
to db.close
. The close
method will flush data to disk and prepare the
database for a potential session exit. Closing a Python session without
calling close
beforehand is likely to corrupt the database, making the data
irretrievable. To simply flush data to disk without closing the database, use
the commit
method.
Some backends not only have the ability to store the traces, but also the state of the step methods at the end of sampling. This is particularly useful when long warm-up periods are needed to tune the jump parameters. When the database is loaded in a new session, the step methods query the database to fetch the state they were in at the end of the last trace.
Check that you close()
the database before closing the Python session.
Reloading a Database¶
To load a file created in a previous session, use the load
function from
the backend:
>>> db = pymc.database.pickle.load('Disaster.pickle')
>>> len(db.trace('early_mean')[:])
10
The db
object also has a trace
method identical to that of Sampler
.
You can hence inspect the results of a model, even when you don’t have the
model around.
To add a new trace to this file, we need to create an MCMC instance. This time,
instead of setting db='pickle'
, we will pass the existing Database
instance and sample as usual. A new trace will be appended to the first:
>>> M = MCMC(disaster_model, db=db)
>>> M.sample(5)
Sampling: 100% [000000000000000000000000000000000000000000000000000] Iterations: 5
>>> len(M.trace('early_model', chain=None)[:])
15
>>> M.db.close()
The ram
backend¶
Used by default, this backend simply holds a copy in memory, with no output written to disk. This is useful for short runs or testing. For long runs generating large amount of data, using this backend may fill the available memory, forcing the OS to store data in the cache, slowing down all other applications.
The no_trace
backend¶
This backend simply does not store the trace. This may be useful for testing purposes.
The txt backend¶
With the txt
backend, the data is written to disk in ASCII files. More
precisely, the dbname
argument is used to create a top directory into which
chain directories, called Chain_<#>
, are going to be created each time
sample
is called:
dbname/
Chain_0/
<object0 name>.txt
<object1 name>.txt
...
Chain_1/
<object0 name>.txt
<object1 name>.txt
...
...
In each one of these chain directories, files named <variable name>.txt
are
created, storing the values of the variable as rows of text:
# Variable: e
# Sample shape: (5,)
# Date: 2008-11-18 17:19:13.554188
3.033672373807017486e+00
3.033672373807017486e+00
...
While the txt backend makes it easy to load data using other applications and programming languages, it is not optimized for speed nor memory efficiency. If you plan on generating and handling large datasets, read on for better options.
The pickle
backend¶
The pickle
database relies on the cPickle
module to save the traces.
Use of this backend is appropriate for small-scale, short-lived projects. For
longer term or larger projects, the pickle
backend should be avoided since
generated files might be unreadable across different Python versions. The
pickled file is a simple dump of a dictionary containing the NumPy arrays
storing the traces, as well as the state of the Sampler
‘s step methods.
The sqlite
backend¶
The sqlite
backend is based on the python module sqlite3 (built-in to
Python versions greater than 2.4) . It opens an SQL database named dbname
,
and creates one table per tallyable objects. The rows of this table store a
key, the chain index and the values of the objects:
key (INTT), trace (INT), v1 (FLOAT), v2 (FLOAT), v3 (FLOAT) ...
The key is autoincremented each time a new row is added to the table, that is,
each time tally
is called by the sampler. Note that the savestate
feature is not implemented, that is, the state of the step methods is not
stored internally in the database.
The hdf5
backend¶
The hdf5
backend uses pyTables to save data in binary HDF5 format. The
hdf5
database is fast and can store huge traces, far larger than the
available RAM. Data can be compressed and decompressed on the fly to reduce the
disk footprint. Another feature of this backends is that it can store arbitrary
objects. Whereas the other backends are limited to numerical values, hdf5
can tally any object that can be pickled, opening the door for powerful and
exotic applications (see pymc.gp
).
The internal structure of an HDF5 file storing both numerical values and arbitrary objects is as follows:
/ (root)
/chain0/ (Group) 'Chain #0'
/chain0/PyMCSamples (Table(N,)) 'PyMC Samples'
/chain0/group0 (Group) 'Group storing objects.'
/chain0/group0/<object0 name> (VLArray(N,)) '<object0 name> samples.'
/chain0/group0/<object1 name> (VLArray(N,)) '<object1 name> samples.'
...
/chain1/ (Group) 'Chain #1'
...
All standard numerical values are stored in a Table
, while objects
are
stored in individual VLArrays
.
The hdf5
Database takes the following parameters:
dbname
(string) Name of the hdf5 file.dbmode
(string) File mode:a
: append,w
: overwrite,r
: read-only.dbcomplevel
(int (0-9)) Compression level, 0: no compression.dbcomplib
(string) Compression library (zlib
,bzip2
,lzo
)
According the the pyTables manual, zlib ([Roelofs2010]) has a fast decompression, relatively slow compression, and a good compression ratio. LZO ([Oberhumer2008]) has a fast compression, but a low compression ratio. bzip2 ([Seward2007]) has an excellent compression ratio but requires more CPU. Note that some of these compression algorithms require additional software to work (see the pyTables manual).
Writing a New Backend¶
It is relatively easy to write a new backend for PyMC
. The first step is to
look at the database.base
module, which defines barebone Database
and
Trace
classes. This module contains documentation on the methods that
should be defined to get a working backend.
Testing your new backend should be trivial, since the test_database
module
contains a generic test class that can easily be subclassed to check that the
basic features required of all backends are implemented and working properly.
Model checking and diagnostics¶
Convergence Diagnostics¶
Valid inferences from sequences of MCMC samples are based on the assumption that the samples are derived from the true posterior distribution of interest. Theory guarantees this condition as the number of iterations approaches infinity. It is important, therefore, to determine the minimum number of samples required to ensure a reasonable approximation to the target posterior density. Unfortunately, no universal threshold exists across all problems, so convergence must be assessed independently each time MCMC estimation is performed. The procedures for verifying convergence are collectively known as convergence diagnostics.
One approach to analyzing convergence is analytical, whereby the variance of the sample at different sections of the chain are compared to that of the limiting distribution. These methods use distance metrics to analyze convergence, or place theoretical bounds on the sample variance, and though they are promising, they are generally difficult to use and are not prominent in the MCMC literature. More common is a statistical approach to assessing convergence. With this approach, rather than considering the properties of the theoretical target distribution, only the statistical properties of the observed chain are analyzed. Reliance on the sample alone restricts such convergence criteria to heuristics. As a result, convergence cannot be guaranteed. Although evidence for lack of convergence using statistical convergence diagnostics will correctly imply lack of convergence in the chain, the absence of such evidence will not guarantee convergence in the chain. Nevertheless, negative results for one or more criteria may provide some measure of assurance to users that their sample will provide valid inferences.
For most simple models, convergence will occur quickly, sometimes within a the first several hundred iterations, after which all remaining samples of the chain may be used to calculate posterior quantities. For more complex models, convergence requires a significantly longer burn-in period; sometimes orders of magnitude more samples are needed. Frequently, lack of convergence will be caused by poor mixing (Figure 7.1). Recall that mixing refers to the degree to which the Markov chain explores the support of the posterior distribution. Poor mixing may stem from inappropriate proposals (if one is using the Metropolis-Hastings sampler) or from attempting to estimate models with highly correlated variables.
Informal Methods¶
The most straightforward approach for assessing convergence is based on simply plotting and inspecting traces and histograms of the observed MCMC sample. If the trace of values for each of the stochastics exhibits asymptotic behavior [1] over the last \(m\) iterations, this may be satisfactory evidence for convergence. A similar approach involves plotting a histogram for every set of \(k\) iterations (perhaps 50-100) beyond some burn in threshold \(n\); if the histograms are not visibly different among the sample intervals, this is reasonable evidence for convergence. Note that such diagnostics should be carried out for each stochastic estimated by the MCMC algorithm, because convergent behavior by one variable does not imply evidence for convergence for other variables in the analysis. An extension of this approach can be taken when multiple parallel chains are run, rather than just a single, long chain. In this case, the final values of \(c\) chains run for \(n\) iterations are plotted in a histogram; just as above, this is repeated every \(k\) iterations thereafter, and the histograms of the endpoints are plotted again and compared to the previous histogram. This is repeated until consecutive histograms are indistinguishable.
Another ad hoc method for detecting lack of convergence is to examine the traces of several MCMC chains initialized with different starting values. Overlaying these traces on the same set of axes should (if convergence has occurred) show each chain tending toward the same equilibrium value, with approximately the same variance. Recall that the tendency for some Markov chains to converge to the true (unknown) value from diverse initial values is called ergodicity. This property is guaranteed by the reversible chains constructed using MCMC, and should be observable using this technique. Again, however, this approach is only a heuristic method, and cannot always detect lack of convergence, even though chains may appear ergodic.
A principal reason that evidence from informal techniques cannot guarantee convergence is a phenomenon called metastability. Chains may appear to have converged to the true equilibrium value, displaying excellent qualities by any of the methods described above. However, after some period of stability around this value, the chain may suddenly move to another region of the parameter space (Figure 7.2). This period of metastability can sometimes be very long, and therefore escape detection by these convergence diagnostics. Unfortunately, there is no statistical technique available for detecting metastability.
Formal Methods¶
Along with the ad hoc techniques described above, a number of more formal methods exist which are prevalent in the literature. These are considered more formal because they are based on existing statistical methods, such as time series analysis.
PyMC currently includes three formal convergence diagnostic methods. The first, proposed by [Geweke1992], is a time-series approach that compares the mean and variance of segments from the beginning and end of a single chain.
where \(a\) is the early interval and \(b\) the late interval. If the z-scores (theoretically distributed as standard normal variates) of these two segments are similar, it can provide evidence for convergence. PyMC calculates z-scores of the difference between various initial segments along the chain, and the last 50% of the remaining chain. If the chain has converged, the majority of points should fall within 2 standard deviations of zero.
Diagnostic z-scores can be obtained by calling the geweke
function. It
accepts either (1) a single trace, (2) a Node or Stochastic object, or (4) an
entire Model object:
geweke(x, first=0.1, last=0.5, intervals=20)
The arguments expected are the following:
pymc_object
: The object that is or contains the output trace(s).first
(optional): First portion of chain to be used in Geweke diagnostic. Defaults to 0.1 (i.e. first 10% of chain).last
(optional): Last portion of chain to be used in Geweke diagnostic. Defaults to 0.5 (i.e. last 50% of chain).intervals
(optional): Number of sub-chains to analyze. Defaults to 20.
The resulting scores are best interpreted graphically, using the
geweke_plot
function. This displays the scores in series, in relation to
the 2 standard deviation boundaries around zero. Hence, it is easy to see
departures from the standard normal assumption.
geweke_plot
takes either a single set of scores, or a dictionary of scores
(output by geweke
when an entire Sampler is passed) as its argument:
def geweke_plot(scores, name='geweke', format='png', suffix='-diagnostic',
path='./', fontmap = {1:10, 2:8, 3:6, 4:5, 5:4}, verbose=1)
The arguments are defined as:
scores
: The object that contains the Geweke scores. Can be a list (one set) or a dictionary (multiple sets).name
(optional): Name used for output files. For multiple scores, the dictionary keys are used as names.format
(optional): Graphic output file format (defaults to png).suffix
(optional): Suffix to filename (defaults to -diagnostic)path
(optional): The path for output graphics (defaults to working directory).fontmap
(optional): Dictionary containing the font map for the labels of the graphic.verbose
(optional): Verbosity level for output (defaults to 1).
To illustrate, consider the sample model gelman_bioassay
that is used to
instantiate a MCMC sampler. The sampler is then run for a given number of
iterations:
>>> from pymc.examples import gelman_bioassay
>>> S = pymc.MCMC(gelman_bioassay)
>>> S.sample(10000, burn=5000)
It is easiest simply to pass the entire sampler S
the geweke
function:
>>> scores = pymc.geweke(S, intervals=20)
>>> pymc.Matplot.geweke_plot(scores)
Alternatively, individual stochastics within S
can be analyzed for
convergence:
>>> trace = S.trace("alpha")[:]
>>> alpha_scores = pymc.geweke(trace, intervals=20)
>>> pymc.Matplot.geweke_plot(alpha_scores, "alpha")
An example of convergence and non-convergence of a chain using geweke_plot is given in Figure 7.3.
The second diagnostic provided by PyMC is the [Raftery1995a] procedure. This approach estimates the number of iterations required to reach convergence, along with the number of burn-in samples to be discarded and the appropriate thinning interval. A separate estimate of both quantities can be obtained for each variable in a given model.
As the criterion for determining convergence, the Raftery and Lewis approach uses the accuracy of estimation of a user-specified quantile. For example, we may want to estimate the quantile \(q=0.975\) to within \(r=0.005\) with probability \(s=0.95\). In other words,
From any sample of \(\theta\), one can construct a binary chain:
where \(u_q\) is the quantile value and \(I\) is the indicator function. While \(\{\theta^{(j)}\}\) is a Markov chain, \(\{Z^{(j)}\}\) is not necessarily so. In any case, the serial dependency among \(Z^{(j)}\) decreases as the thinning interval \(k\) increases. A value of \(k\) is chosen to be the smallest value such that the first order Markov chain is preferable to the second order Markov chain.
This thinned sample is used to determine number of burn-in samples. This is done by comparing the remaining samples from burn-in intervals of increasing length to the limiting distribution of the chain. An appropriate value is one for which the truncated sample’s distribution is within \(\epsilon\) (arbitrarily small) of the limiting distribution. See [Raftery1995a] or [Gamerman1997] for computational details. Estimates for sample size tend to be conservative.
This diagnostic is best used on a short pilot run of a particular model, and the results used to parameterize a subsequent sample that is to be used for inference. Its calling convention is as follows:
raftery_lewis(x, q, r, s=.95, epsilon=.001, verbose=1)
The arguments are:
pymc_object
: The object that contains the Geweke scores. Can be a list (one set) or a dictionary (multiple sets).q
: Desired quantile to be estimated.r
: Desired accuracy for quantile.s
(optional): Probability of attaining the requested accuracy (defaults to 0.95).epsilon
(optional) : Half width of the tolerance interval required for the q-quantile (defaults to 0.001).verbose
(optional) : Verbosity level for output (defaults to 1).
The code for raftery_lewis
is based on the FORTRAN program gibbsit
([Raftery1995b]).
For example, consider again a sampler S run for some model my_model:
>>> S = pymc.MCMC(my_model)
>>> S.sample(10000, burn=5000)
One can pass either the entire sampler S or any stochastic within S to the raftery_lewis function, along with suitable arguments. Here, we have chosen \(q = 0.025\) (the lower limit of the equal-tailed 95% interval) and error \(r = 0.01\):
>>> pymc.raftery_lewis(S, q=0.025, r=0.01)
This yields diagnostics as follows for each stochastic of S, as well as a dictionary containing the diagnostic quantities:
========================
Raftery-Lewis Diagnostic
========================
937 iterations required (assuming independence) to achieve 0.01 accuracy
with 95 percent probability.
Thinning factor of 1 required to produce a first-order Markov chain.
39 iterations to be discarded at the beginning of the simulation (burn-in).
11380 subsequent iterations required.
Thinning factor of 11 required to produce an independence chain.
The third convergence diagnostic provided by PyMC is the Gelman-Rubin statistic ([Gelman1992]). This diagnostic uses multiple chains to check for lack of convergence, and is based on the notion that if multiple chains have converged, by definition they should appear very similar to one another; if not, one or more of the chains has failed to converge.
The Gelman-Rubin diagnostic uses an analysis of variance approach to assessing convergence. That is, it calculates both the between-chain variance (B) and within-chain variance (W), and assesses whether they are different enough to worry about convergence. Assuming \(m\) chains, each of length \(n\), quantities are calculated by:
for each scalar estimand \(\theta\). Using these values, an estimate of the marginal posterior variance of \(\theta\) can be calculated:
Assuming \(\theta\) was initialized to arbitrary starting points in each chain, this quantity will overestimate the true marginal posterior variance. At the same time, \(W\) will tend to underestimate the within-chain variance early in the sampling run. However, in the limit as \(n \rightarrow \infty\), both quantities will converge to the true variance of \(\theta\). In light of this, the Gelman-Rubin statistic monitors convergence using the ratio:
This is called the potential scale reduction, since it is an estimate of the potential reduction in the scale of \(\theta\) as the number of simulations tends to infinity. In practice, we look for values of \(\hat{R}\) close to one (say, less than 1.1) to be confident that a particular estimand has converged. In PyMC, the function gelman_rubin will calculate \(\hat{R}\) for each stochastic node in the passed model:
>>> pymc.gelman_rubin(S)
{'alpha': 1.0036389589627821,
'beta': 1.001503957313336,
'theta': [1.0013923468783055,
1.0274479503713816,
0.95365716267969636,
1.00267321019079]}
For the best results, each chain should be initialized to highly dispersed starting values for each stochastic node.
By default, when calling the summary_plot
function using nodes with multiple chains, the \(\hat{R}\) values will be plotted alongside the posterior intervals.
Additional convergence diagnostics are available in the R statistical
package ([R2010]), via the CODA module ([Plummer2008]). PyMC includes a
method coda
for exporting model traces in a format that may be directly
read by coda
:
>>> pymc.utils.coda(S)
Generating CODA output
==================================================
Processing deaths
Processing beta
Processing theta
Processing alpha
The lone argument is the PyMC sampler for which output is desired.
Calling coda
yields a file containing raw trace values (suffix .out
)
and a file containing indices to the trace values (suffix .ind
).
Autocorrelation Plots¶
Samples from MCMC algorithms are ususally autocorrelated, due partly to the inherent Markovian dependence structure. The degree of autocorrelation can be quantified using the autocorrelation function:
PyMC includes a function for plotting the autocorrelation function for each stochastics in the sampler (Figure 7.5). This allows users to examine the relationship among successive samples within sampled chains. Significant autocorrelation suggests that chains require thinning prior to use of the posterior statistics for inference.
autocorrelation(pymc_object, name, maxlag=100, format='png', suffix='-acf',
path='./', fontmap = {1:10, 2:8, 3:6, 4:5, 5:4}, verbose=1)
pymc_object
: The object that is or contains the output trace(s).name
: Name used for output files.maxlag
: The highest lag interval for which autocorrelation is calculated.format
(optional): Graphic output file format (defaults to png).suffix
(optional): Suffix to filename (defaults to -diagnostic)path
(optional): The path for output graphics (defaults to working directory).fontmap
(optional): Dictionary containing the font map for the labels of the graphic.verbose
(optional): Verbosity level for output (defaults to 1).
Autocorrelation plots can be obtained simply by passing the sampler to the autocorrelation function (within the Matplot module) directly:
>>> S = pymc.MCMC(gelman_bioassay)
>>> S.sample(10000, burn=5000)
>>> pymc.Matplot.autocorrelation(S)
Alternatively, variables within a model can be plotted individually. For example, the parameter beta that was estimated using sampler S for the gelman_bioassay model will yield a correlation plot as follows:
>>> pymc.Matplot.autocorrelation(S.beta)
Goodness of Fit¶
Checking for model convergence is only the first step in the evaluation of MCMC model outputs. It is possible for an entirely unsuitable model to converge, so additional steps are needed to ensure that the estimated model adequately fits the data. One intuitive way of evaluating model fit is to compare model predictions with the observations used to fit the model. In other words, the fitted model can be used to simulate data, and the distribution of the simulated data should resemble the distribution of the actual data.
Fortunately, simulating data from the model is a natural component of the Bayesian modelling framework. Recall, from the discussion on imputation of missing data, the posterior predictive distribution:
Here, \(\tilde{y}\) represents some hypothetical new data that would be expected, taking into account the posterior uncertainty in the model parameters. Sampling from the posterior predictive distribution is easy in PyMC. The code looks identical to the corresponding data stochastic, with two modifications: (1) the node should be specified as deterministic and (2) the statistical likelihoods should be replaced by random number generators. As an example, consider a simple dose-response model, where deaths are modeled as a binomial random variable for which the probability of death is a logit-linear function of the dose of a particular drug:
n = [5]*4
dose = [-.86,-.3,-.05,.73]
x = [0,1,3,5]
alpha = pymc.Normal('alpha', mu=0.0, tau=0.01)
beta = pymc.Normal('beta', mu=0.0, tau=0.01)
@pymc.deterministic
def theta(a=alpha, b=beta, d=dose):
"""theta = inv_logit(a+b)"""
return pymc.invlogit(a+b*d)
# deaths ~ binomial(n, p)
deaths = pymc.Binomial('deaths', n=n, p=theta, value=x, observed=True)
The posterior predictive distribution of deaths uses the same functional form as the data likelihood, in this case a binomial stochastic. Here is the corresponding sample from the posterior predictive distribution:
deaths_sim = pymc.Binomial('deaths_sim', n=n, p=theta)
Notice that the observed stochastic pymc.Binomial has been replaced with a stochastic node that is identical in every respect to deaths, except that its values are not fixed to be the observed data – they are left to vary according to the values of the fitted parameters.
The degree to which simulated data correspond to observations can be evaluated in at least two ways. First, these quantities can simply be compared visually. This allows for a qualitative comparison of model-based replicates and observations. If there is poor fit, the true value of the data may appear in the tails of the histogram of replicated data, while a good fit will tend to show the true data in high-probability regions of the posterior predictive distribution (Figure 7.6).
The Matplot package in PyMC provides an easy way of producing such plots, via
the gof_plot
function. To illustrate, consider a single data point x
and an array of values x_sim
sampled from the posterior predictive
distribution. The histogram is generated by calling:
pymc.Matplot.gof_plot(x_sim, x, name='x')
A second approach for evaluating goodness of fit using samples from the posterior predictive distribution involves the use of a statistical criterion. For example, the Bayesian p-value [Gelman1996] uses a discrepancy measure that quantifies the difference between data (observed or simulated) and the expected value, conditional on some model. One such discrepancy measure is the Freeman-Tukey statistic [Brooks2000]:
where the \(x_j\) are data and \(e_j\) are the corresponding expected values, based on the model. Model fit is assessed by comparing the discrepancies from observed data to those from simulated data. On average, we expect the difference between them to be zero; hence, the Bayesian p value is simply the proportion of simulated discrepancies that are larger than their corresponding observed discrepancies:
If \(p\) is very large (e.g. \(>0.975\)) or very small (e.g. \(<0.025\)) this implies that the model is not consistent with the data, and thus is evidence of lack of fit. Graphically, data and simulated discrepancies plotted together should be clustered along a 45 degree line passing through the origin, as shown in Figure 7.7.
The discrepancy
function in the diagnostics
package can be used to
generate discrepancy statistics from arrays of data, simulated values, and
expected values:
D = pymc.discrepancy(x, x_sim, x_exp)
For a dataset of size \(n\) and an MCMC chain of length \(r\), this
implies that x
is size (n,)
, x_sim
is size (r,n)
and x_exp
is either size (r,)
or (r,n)
. A call to this function returns two
arrays of discrepancy values (simulated and observed), which can be passed to
the discrepancy_plot
function in the Matplot module to generate a scatter
plot, and if desired, a p value:
pymc.Matplot.discrepancy_plot(D, name='D', report_p=True)
Additional optional arguments for discrepancy_plot
are identical to other
PyMC plotting functions.
Footnotes
[1] | Asymptotic behaviour implies that the variance and the mean value of the sample stays relatively constant over some arbitrary period. |
Extending PyMC¶
PyMC tries to make standard things easy, but keep unusual things possible. Its openness, combined with Python’s flexibility, invite extensions from using new step methods to exotic stochastic processes (see the Gaussian process module). This chapter briefly reviews the ways PyMC is designed to be extended.
Nonstandard Stochastics¶
The simplest way to create a Stochastic
object with a nonstandard
distribution is to use the medium or long decorator syntax. See Chapter
Building models. If you want to create many stochastics with the same
nonstandard distribution, the decorator syntax can become cumbersome. An actual
subclass of Stochastic
can be created using the class factory
stochastic_from_dist
. This function takes the following arguments:
- The name of the new class,
- A
logp
function,- A
random
function,- The NumPy datatype of the new class (for continuous distributions, this should be
float
; for discrete distributions,int
; for variables valued as non-numerical objects,object
),- A flag indicating whether the resulting class represents a vector-valued variable.
The necessary parent labels are read from the logp
function, and a
docstring for the new class is automatically generated. Instances of the new
class can be created in one line.
Full subclasses of Stochastic
may be necessary to provide nonstandard
behaviors (see gp.GP
).
User-defined step methods¶
The StepMethod
class is meant to be subclassed. There are an enormous number of MCMC step methods in the literature, whereas PyMC provides only about half a dozen. Most user-defined step methods will be either Metropolis-Hastings or Gibbs step methods, and these should subclass Metropolis
or Gibbs
respectively. More unusual step methods should subclass StepMethod
directly.
Example: an asymmetric Metropolis step¶
Consider the probability model in examples/custom_step.py
:
mu = pymc.Normal('mu',0,.01, value=0)
tau = pymc.Exponential('tau',.01, value=1)
cutoff = pymc.Exponential('cutoff',1, value=1.3)
D = pymc.Truncnorm('D',mu,tau,-np.inf,cutoff,value=data,observed=True)
The stochastic variable cutoff
cannot be smaller than the largest element
of \(D\), otherwise \(D\)‘s density would be zero. The standard
Metropolis
step method can handle this case without problems; it will
propose illegal values occasionally, but these will be rejected.
Suppose we want to handle cutoff
with a smarter step method that doesn’t
propose illegal values. Specifically, we want to use the nonsymmetric proposal
distribution:
We can implement this Metropolis-Hastings algorithm with the following step method class:
class TruncatedMetropolis(pymc.Metropolis):
def __init__(self, stochastic, low_bound, up_bound, *args, **kwargs):
self.low_bound = low_bound
self.up_bound = up_bound
pymc.Metropolis.__init__(self, stochastic, *args, **kwargs)
def propose(self):
tau = 1./(self.adaptive_scale_factor * self.proposal_sd)**2
self.stochastic.value = \
pymc.rtruncnorm(self.stochastic.value, tau, self.low_bound, self.up_bound)
def hastings_factor(self):
tau = 1./(self.adaptive_scale_factor * self.proposal_sd)**2
cur_val = self.stochastic.value
last_val = self.stochastic.last_value
lp_for = pymc.truncnorm_like(cur_val, last_val, tau, self.low_bound, self.up_bound)
lp_bak = pymc.truncnorm_like(last_val, cur_val, tau, self.low_bound, self.up_bound)
if self.verbose > 1:
print self._id + ': Hastings factor %f'%(lp_bak - lp_for)
return lp_bak - lp_for
The propose
method sets the step method’s stochastic’s value to a new
value, drawn from a truncated normal distribution. The precision of this
distribution is computed from two factors: self.proposal_sd
, which can be
set with an input argument to Metropolis, and self.adaptive_scale_factor
.
Metropolis step methods’ default tuning behavior is to reduce
adaptive_scale_factor
if the acceptance rate is too low, and to increase
adaptive_scale_factor
if it is too high. By incorporating
adaptive_scale_factor
into the proposal standard deviation, we avoid having
to write our own tuning infrastructure. If we don’t want the proposal to tune,
we don’t have to use adaptive_scale_factor
.
The hastings_factor
method adjusts for the asymmetric proposal distribution
[Gelman2004]. It computes the log of the quotient of the ‘backward’ density
and the ‘forward’ density. For symmetric proposal distributions, this quotient
is 1, so its log is zero.
Having created our custom step method, we need to tell MCMC instances to use it
to handle the variable cutoff
. This is done in custom_step.py
with
the following line:
M.use_step_method(TruncatedMetropolis, cutoff, D.value.max(), np.inf)
This call causes \(M\) to pass the arguments cutoff
, D.value.max()
,
and np.inf
to a TruncatedMetropolis
object’s __init__
method, and
use the object to handle cutoff
.
Its often convenient to get a handle to a custom step method instance directly
for debugging purposes. M.step_method_dict[cutoff]
returns a list of all
the step methods \(M\) will use to handle cutoff
:
>>> M.step_method_dict[cutoff]
[<custom_step.TruncatedMetropolis object at 0x3c91130>]
There may be more than one, and conversely step methods may handle more than one stochastic variable. To see which variables step method \(S\) is handling, try:
>>> S.stochastics
set([<pymc.distributions.Exponential 'cutoff' at 0x3cd6b90>])
General step methods¶
All step methods must implement the following methods:
step()
:- Updates the values of
self.stochastics
. tune()
:Tunes the jumping strategy based on performance so far. A default method is available that increases
self.adaptive_scale_factor
(see below) when acceptance rate is high, and decreases it when acceptance rate is low. This method should returnTrue
if additional tuning will be required later,andFalse
otherwise.competence(s):
- A class method that examines stochastic variable \(s\) and returns a
- value from 0 to 3 expressing the step method’s ability to handle the
variable. This method is used by
MCMC
instances when automatically assigning step methods. Conventions are: - 0
- I cannot safely handle this variable.
- 1
- I can handle the variable about as well as the standard
Metropolis
step method. - 2
- I can do better than
Metropolis
. - 3
- I am the best step method you are likely to find for this variable in most cases.
For example, if you write a step method that can handle
MyStochasticSubclass
well, the competence method might look like this:class MyStepMethod(pymc.StepMethod): def __init__(self, stochastic, *args, **kwargs): ... @classmethod def competence(self, stochastic): if isinstance(stochastic, MyStochasticSubclass): return 3 else: return 0
Note that PyMC will not even attempt to assign a step method automatically if its
__init__
method cannot be called with a single stochastic instance, that isMyStepMethod(x)
is a legal call. The list of step methods that PyMC will consider assigning automatically is calledpymc.StepMethodRegistry
.current_state()
:This method is easiest to explain by showing the code:
state = {} for s in self._state: state[s] = getattr(self, s) return state
self._state
should be a list containing the names of the attributes needed to reproduce the current jumping strategy. If anMCMC
object writes its state out to a database, these attributes will be preserved. If anMCMC
object restores its state from the database later, the corresponding step method will have these attributes set to their saved values.
Step methods should also maintain the following attributes:
_id
:- A string that can identify each step method uniquely (usually something
- like
<class_name>_<stochastic_name>
).
adaptive_scale_factor
:- An ‘adaptive scale factor’. This attribute is only needed if the default
tune()
method is used. _tuning_info
:- A list of strings giving the names of any tuning parameters. For
Metropolis
instances, this would beadaptive_scale_factor
. This list is used to keep traces of tuning parameters in order to verify ‘diminishing tuning’ [Roberts2007].
All step methods have a property called loglike
, which returns the sum of
the log-probabilities of the union of the extended children of
self.stochastics
. This quantity is one term in the log of the Metropolis-
Hastings acceptance ratio. The logp_plus_loglike
property gives the sum of
that and the log-probabilities of self.stochastics
.
Metropolis-Hastings step methods¶
A Metropolis-Hastings step method only needs to implement the following
methods, which are called by Metropolis.step()
:
reject()
:Usually just
def reject(self): self.rejected += 1 [s.value = s.last_value for s in self.stochastics]
propose():
- Sets the values of all
self.stochastics
to new, proposed values. This method may use theadaptive_scale_factor
attribute to take advantage of the standard tuning scheme.
Metropolis-Hastings step methods may also override the tune
and competence
methods.
Metropolis-Hastings step methods with asymmetric jumping distributions may
implement a method called hastings_factor()
, which returns the log of the
ratio of the ‘reverse’ and ‘forward’ proposal probabilities. Note that no
accept()
method is needed or used.
By convention, Metropolis-Hastings step methods use attributes called
accepted
and rejected
to log their performance.
Gibbs step methods¶
Gibbs step methods handle conjugate submodels. These models usually have two components: the ‘parent’ and the ‘children’. For example, a gamma-distributed variable serving as the precision of several normally-distributed variables is a conjugate submodel; the gamma variable is the parent and the normal variables are the children.
This section describes PyMC’s current scheme for Gibbs step methods, several of
which are in a semi-working state in the sandbox directory. It is meant to be
as generic as possible to minimize code duplication, but it is admittedly
complicated. Feel free to subclass StepMethod
directly when writing Gibbs
step methods if you prefer.
Gibbs step methods that subclass PyMC’s Gibbs
should define the following
class attributes:
child_class
:- The class of the children in the submodels the step method can handle.
parent_class
:- The class of the parent.
parent_label
:- The label the children would apply to the parent in a conjugate submodel.
- In the gamma-normal example, this would be
tau
.
linear_OK
:- A flag indicating whether the children can use linear combinations
- involving the parent as their actual parent without destroying the conjugacy.
A subclass of Gibbs
that defines these attributes only needs to implement a
propose()
method, which will be called by Gibbs.step()
. The resulting
step method will be able to handle both conjugate and ‘non-conjugate’ cases.
The conjugate case corresponds to an actual conjugate submodel. In the
non-conjugate case all the children are of the required class, but the parent
is not. In this case the parent’s value is proposed from the likelihood and
accepted based on its prior. The acceptance rate in the non-conjugate case will
be less than one.
The inherited class method Gibbs.competence
will determine the new step
method’s ability to handle a variable \(x\) by checking whether:
- all \(x\)‘s children are of class
child_class
, and either applyparent_label
to \(x\) directly or (iflinear_OK=True
) to aLinearCombination
object (chapter Building models), one of whose parents contains \(x\).- \(x\) is of class
parent_class
If both conditions are met, pymc.conjugate_Gibbs_competence
will be
returned. If only the first is met, pymc.nonconjugate_Gibbs_competence
will
be returned.
New fitting algorithms¶
PyMC provides a convenient platform for non-MCMC fitting algorithms in addition
to MCMC. All fitting algorithms should be implemented by subclasses of
Model
. There are virtually no restrictions on fitting algorithms, but many
of Model
‘s behaviors may be useful. See Chapter Fitting Models.
Monte Carlo fitting algorithms¶
Unless there is a good reason to do otherwise, Monte Carlo fitting algorithms
should be implemented by subclasses of Sampler
to take advantage of the
interactive sampling feature and database backends. Subclasses using the
standard sample()
and isample()
methods must define one of two methods:
draw()
:- If it is possible to generate an independent sample from the posterior at
- every iteration, the
draw
method should do so. The default_loop
method can be used in this case.
_loop()
:- If it is not possible to implement a
draw()
method, but you want to - take advantage of the interactive sampling option, you should override
_loop()
. This method is responsible for generating the posterior samples and callingtally()
when it is appropriate to save the model’s state. In addition,_loop
should monitor the sampler’sstatus
attribute at every iteration and respond appropriately. The possible values ofstatus
are: 'ready'
:- Ready to sample.
'running'
:- Sampling should continue as normal.
'halt'
:- Sampling should halt as soon as possible.
_loop
should call thehalt()
method and return control._loop
can set the status to'halt'
itself if appropriate (eg the database is full or aKeyboardInterrupt
has been caught). 'paused'
:- Sampling should pause as soon as possible.
_loop
should return, but should be able to pick up where it left off next time it’s called.
- If it is not possible to implement a
Samplers may alternatively want to override the default sample()
method. In
that case, they should call the tally()
method whenever it is appropriate
to save the current model state. Like custom _loop()
methods, custom
sample()
methods should handle KeyboardInterrupts
and call the
halt()
method when sampling terminates to finalize the traces.
A second warning: Don’t update stochastic variables’ values in-place¶
If you’re going to implement a new step method, fitting algorithm or unusual (non-numeric-valued) Stochastic
subclass, you should understand the issues related to in-place updates of Stochastic
objects’ values. Fitting methods should never update variables’ values in-place for two reasons:
- In algorithms that involve accepting and rejecting proposals, the ‘pre-proposal’ value needs to be preserved uncorrupted. It would be possible to make a copy of the pre-proposal value and then allow in-place updates, but in PyMC we have chosen to store the pre-proposal value as
Stochastic.last_value
and require proposed values to be new objects. In-place updates would corruptStochastic.last_value
, and this would cause problems. LazyFunction
‘s caching scheme checks variables’ current values against its internal cache by reference. That means if you update a variable’s value in-place, it or its child may miss the update and incorrectly skip recomputing its value or log-probability.
However, a Stochastic
object’s value can make in-place updates to itself if the updates don’t change its identity. For example, the Stochastic
subclass gp.GP
is valued as a gp.Realization
object. GP realizations represent random functions, which are infinite-dimensional stochastic processes, as literally as possible. The strategy they employ is to ‘self-discover’ on demand: when they are evaluated, they generate the required value conditional on previous evaluations and then make an internal note of it. This is an in-place update, but it is done to provide the same behavior as a single random function whose value everywhere has been determined since it was created.
Probability distributions¶
PyMC provides a large suite of built-in probability distributions. For each distribution, it provides:
- A function that evaluates its log-probability or log-density: normal_like().
- A function that draws random variables: rnormal().
- A function that computes the expectation associated with the distribution: normal_expval().
- A Stochastic subclass generated from the distribution: Normal.
This section describes the likelihood functions of these distributions.
Discrete distributions¶
Continuous distributions¶
Multivariate discrete distributions¶
Multivariate continuous distributions¶
Conclusion¶
MCMC is a surprisingly difficult and bug-prone algorithm to implement by hand. We find PyMC makes it much easier and less stressful. PyMC also makes our work more dynamic; getting hand-coded MCMC’s working used to be so much work that we were reluctant to change anything, but with PyMC changing models is much easier.
We welcome new contributors at all levels. If you would like to contribute new code, improve documentation, share your results or provide ideas for new features, please introduce yourself on our mailing list. Our wiki page. also hosts a number of tutorials and examples from users that could give you some ideas. We have taken great care to make the code easy to extend, whether by adding new database backends, step methods or entirely new sampling algorithms.
Acknowledgements¶
The authors would like to thank several users of PyMC who have been particularly helpful during the development of the 2.0 release. In alphabetical order, these are Mike Conroy, Abraham Flaxman, J. Miguel Marin, Aaron MacNeil, Nick Matsakis, John Salvatier, Andrew Straw and Thomas Wiecki.
Anand Patil’s work on PyMC has been supported since 2008 by the Malaria Atlas Project, principally funded by the Wellcome Trust.
David Huard’s early work on PyMC was supported by a scholarship from the Natural Sciences and Engineering Research Council of Canada.
Appendix: Markov Chain Monte Carlo¶
Monte Carlo Methods in Bayesian Analysis¶
Bayesian analysis often requires integration over multiple dimensions that is intractable both via analytic methods or standard methods of numerical integration. However, it is often possible to compute these integrals by simulating (drawing samples) from posterior distributions. For example, consider the expected value of a random variable \(\mathbf{x}\):
where \(k\) (the dimension of vector \(x\)) is perhaps very large. If we can produce a reasonable number of random vectors \(\{{\bf x_i}\}\), we can use these values to approximate the unknown integral. This process is known as Monte Carlo integration. In general, MC integration allows integrals against probability density functions:
to be estimated by finite sums:
where \(\mathbf{x}_i\) is a sample from \(f\). This estimate is valid and useful because:
- By the strong law of large numbers:
- Simulation error can be measured and controlled:
Why is this relevant to Bayesian analysis? If we replace \(f(\mathbf{x})\) with a posterior, \(f(\theta|d)\) and make \(h(\theta)\) an interesting function of the unknown parameter, the resulting expectation is that of the posterior of \(h(\theta)\):
Rejection Sampling¶
Though Monte Carlo integration allows us to estimate integrals that are unassailable by analysis and standard numerical methods, it relies on the ability to draw samples from the posterior distribution. For known parametric forms, this is not a problem; probability integral transforms or bivariate techniques (e.g Box-Muller method) may be used to obtain samples from uniform pseudo-random variates generated from a computer. Often, however, we cannot readily generate random values from non-standard posteriors. In such instances, we can use rejection sampling to generate samples.
Posit a function, \(f(x)\) which can be evaluated for any value on the support of \(x:S_x = [A,B]\), but may not be integrable or easily sampled from. If we can calculate the maximum value of \(f(x)\), we can then define a rectangle that is guaranteed to contain all possible values \((x,f(x))\). It is then trivial to generate points over the box and enumerate the values that fall under the curve (Figure Rejection sampling of a bounded form. Area is estimated by the ratio of accepted (open squares) to total points, multiplied by the rectangle area.).
This approach is useful, for example, in estimating the normalizing constant for posterior distributions.
If \(f(x)\) has unbounded support (i.e. infinite tails), such as a Gaussian distribution, a bounding box is no longer appropriate. We must specify a majorizing (or, enveloping) function, \(g(x)\), which implies:
Having done this, we can now sample \({x_i}\) from \(g(x)\) and accept or reject each of these values based upon \(f(x_i)\). Specifically, for each draw \(x_i\), we also draw a uniform random variate \(u_i\) and accept \(x_i\) if \(u_i < f(x_i)/cg(x_i)\), where \(c\) is a constant (Figure Rejection sampling of an unbounded form using an enveloping distribution.). This approach is made more efficient by choosing an enveloping distribution that is “close” to the target distribution, thus maximizing the number of accepted points. Further improvement is gained by using optimized algorithms such as importance sampling which, as the name implies, samples more frequently from important areas of the distribution.
Rejection sampling is usually subject to declining performance as the dimension of the parameter space increases, so it is used less frequently than MCMC for evaluation of posterior distributions [Gamerman1997].
Markov Chains¶
A Markov chain is a special type of stochastic process. The standard definition of a stochastic process is an ordered collection of random variables:
where \(t\) is frequently (but not necessarily) a time index. If we think of \(X_t\) as a state \(X\) at time \(t\), and invoke the following dependence condition on each state:
then the stochastic process is known as a Markov chain. This conditioning specifies that the future depends on the current state, but not past states. Thus, the Markov chain wanders about the state space, remembering only where it has just been in the last time step. The collection of transition probabilities is sometimes called a transition matrix when dealing with discrete states, or more generally, a transition kernel.
In the context of Markov chain Monte Carlo, it is useful to think of the Markovian property as “mild non-independence”. MCMC allows us to indirectly generate independent samples from a particular posterior distribution.
Jargon-busting¶
Before we move on, it is important to define some general properties of Markov chains. They are frequently encountered in the MCMC literature, and some will help us decide whether MCMC is producing a useful sample from the posterior.
- Homogeneity:
A Markov chain is homogeneous at step \(t\) if the transition probabilities are independent of time \(t\).
- Irreducibility:
A Markov chain is irreducible if every state is accessible in one or more steps from any other state. That is, the chain contains no absorbing states. This implies that there is a non-zero probability of eventually reaching state \(k\) from any other state in the chain.
- Recurrence:
States which are visited repeatedly are recurrent. If the expected time to return to a particular state is bounded, this is known as positive recurrence, otherwise the recurrent state is null recurrent. Further, a chain is Harris recurrent when it visits all states \(X \in S\) infinitely often in the limit as \(t \to \infty\); this is an important characteristic when dealing with unbounded, continuous state spaces. Whenever a chain ends up in a closed, irreducible set of Harris recurrent states, it stays there forever and visits every state with probability one.
- Stationarity:
A stationary Markov chain produces the same marginal distribution when multiplied by the transition kernel. Thus, if \(P\) is some \(n \times n\) transition matrix:
\[{\bf \pi P} = {\bf \pi}\]
for Markov chain \(\pi\). Thus, \(\pi\) is no longer subscripted, and is referred to as the limiting distribution of the chain. In MCMC, the chain explores the state space according to its limiting marginal distribution.
- Ergodicity:
Ergodicity is an emergent property of Markov chains which are irreducible, positive Harris recurrent and aperiodic. Ergodicity is defined as:
\[\lim_{n \to \infty} Pr^{(n)}(\theta_i \rightarrow \theta_j) = \pi(\theta) \quad \forall \theta_i, \theta_j \in \Theta\]
or in words, after many steps the marginal distribution of the chain is the same at one step as at all other steps. This implies that our Markov chain, which we recall is dependent, can generate samples that are independent if we wait long enough between samples. If it means anything to you, ergodicity is the analogue of the strong law of large numbers for Markov chains. For example, take values \(\theta_{i+1},\ldots,\theta_{i+n}\) from a chain that has reached an ergodic state. A statistic of interest can then be estimated by:
\[\frac{1}{n}\sum_{j=i+1}^{i+n} h(\theta_j) \approx \int f(\theta) h(\theta) d\theta\]
Why MCMC Works: Reversible Markov Chains¶
Markov chain Monte Carlo simulates a Markov chain for which some function of interest (e.g. the joint distribution of the parameters of some model) is the unique, invariant limiting distribution. An invariant distribution with respect to some Markov chain with transition kernel \(Pr(y \mid x)\) implies that:
Invariance is guaranteed for any reversible Markov chain. Consider a Markov chain in reverse sequence: \(\{\theta^{(n)},\theta^{(n-1)},...,\theta^{(0)}\}\). This sequence is still Markovian, because:
Forward and reverse transition probabilities may be related through Bayes theorem:
Though not homogeneous in general, \(\pi\) becomes homogeneous if Do you ever call the stationary distribution itself homogeneous?:
- \(n \rightarrow \infty\)
- \(\pi^{(i)}=\pi\) for some \(i < k\)
If this chain is homogeneous it is called reversible, because it satisfies the detailed balance equation:
Reversibility is important because it has the effect of balancing movement through the entire state space. When a Markov chain is reversible, \(\pi\) is the unique, invariant, stationary distribution of that chain. Hence, if \(\pi\) is of interest, we need only find the reversible Markov chain for which \(\pi\) is the limiting distribution. This is what MCMC does!
Gibbs Sampling¶
The Gibbs sampler is the simplest and most prevalent MCMC algorithm. If a posterior has \(k\) parameters to be estimated, we may condition each parameter on current values of the other \(k-1\) parameters, and sample from the resultant distributional form (usually easier), and repeat this operation on the other parameters in turn. This procedure generates samples from the posterior distribution. Note that we have now combined Markov chains (conditional independence) and Monte Carlo techniques (estimation by simulation) to yield Markov chain Monte Carlo.
Here is a stereotypical Gibbs sampling algorithm:
As we can see from the algorithm, each distribution is conditioned on the last iteration of its chain values, constituting a Markov chain as advertised. The Gibbs sampler has all of the important properties outlined in the previous section: it is aperiodic, homogeneous and ergodic. Once the sampler converges, all subsequent samples are from the target distribution. This convergence occurs at a geometric rate.
Choose starting values for states (parameters): \({\bf \theta} = [\theta_1^{(0)},\theta_2^{(0)},\ldots,\theta_k^{(0)}]\)
Initialize counter \(j=1\)
Draw the following values from each of the \(k\) conditional distributions:
\begin{eqnarray*} \theta_1^{(j)} &\sim& \pi(\theta_1 | \theta_2^{(j-1)},\theta_3^{(j-1)},\ldots,\theta_{k-1}^{(j-1)},\theta_k^{(j-1)}) \\ \theta_2^{(j)} &\sim& \pi(\theta_2 | \theta_1^{(j)},\theta_3^{(j-1)},\ldots,\theta_{k-1}^{(j-1)},\theta_k^{(j-1)}) \\ \theta_3^{(j)} &\sim& \pi(\theta_3 | \theta_1^{(j)},\theta_2^{(j)},\ldots,\theta_{k-1}^{(j-1)},\theta_k^{(j-1)}) \\ \vdots \\ \theta_{k-1}^{(j)} &\sim& \pi(\theta_{k-1} | \theta_1^{(j)},\theta_2^{(j)},\ldots,\theta_{k-2}^{(j)},\theta_k^{(j-1)}) \\ \theta_k^{(j)} &\sim& \pi(\theta_k | \theta_1^{(j)},\theta_2^{(j)},\theta_4^{(j)},\ldots,\theta_{k-2}^{(j)},\theta_{k-1}^{(j)}) \end{eqnarray*}Increment \(j\) and repeat until convergence occurs.
The Metropolis-Hastings Algorithm¶
The key to success in applying the Gibbs sampler to the estimation of Bayesian posteriors is being able to specify the form of the complete conditionals of \({\bf \theta}\). In fact, the algorithm cannot be implemented without them. Of course, the posterior conditionals cannot always be neatly specified. In contrast to the Gibbs algorithm, the Metropolis-Hastings algorithm generates candidate state transitions from an alternate distribution, and accepts or rejects each candidate probabilistically.
Let us first consider a simple Metropolis-Hastings algorithm for a single parameter, \(\theta\). We will use a standard sampling distribution, referred to as the proposal distribution, to produce candidate variables \(q_t(\theta^{\prime} | \theta)\). That is, the generated value, \(\theta^{\prime}\), is a possible next value for \(\theta\) at step \(t+1\). We also need to be able to calculate the probability of moving back to the original value from the candidate, or \(q_t(\theta | \theta^{\prime})\). These probabilistic ingredients are used to define an acceptance ratio:
The value of \(\theta^{(t+1)}\) is then determined by:
This transition kernel implies that movement is not guaranteed at every step. It only occurs if the suggested transition is likely based on the acceptance ratio.
A single iteration of the Metropolis-Hastings algorithm proceeds as follows:
The original form of the algorithm specified by Metropolis required that \(q_t(\theta^{\prime} | \theta) = q_t(\theta | \theta^{\prime})\), which reduces \(a(\theta^{\prime},\theta)\) to \(\pi(\theta^{\prime})/\pi(\theta)\), but this is not necessary. In either case, the state moves to high-density points in the distribution with high probability, and to low-density points with low probability. After convergence, the Metropolis-Hastings algorithm describes the full target posterior density, so all points are recurrent.
- Sample \(\theta^{\prime}\) from \(q(\theta^{\prime} | \theta^{(t)})\).
- Generate a Uniform[0,1] random variate \(u\).
- If \(a(\theta^{\prime},\theta) > u\) then \(\theta^{(t+1)} = \theta^{\prime}\), otherwise \(\theta^{(t+1)} = \theta^{(t)}\).
Random-walk Metropolis-Hastings¶
A practical implementation of the Metropolis-Hastings algorithm makes use of a random-walk proposal. Recall that a random walk is a Markov chain that evolves according to:
As applied to the MCMC sampling, the random walk is used as a proposal distribution, whereby dependent proposals are generated according to:
Generally, the density generating \(\epsilon_t\) is symmetric about zero, resulting in a symmetric chain. Chain symmetry implies that \(q(\theta^{\prime} | \theta^{(t)}) = q(\theta^{(t)} | \theta^{\prime})\), which reduces the Metropolis-Hastings acceptance ratio to:
The choice of the random walk distribution for \(\epsilon_t\) is frequently a normal or Student’s \(t\) density, but it may be any distribution that generates an irreducible proposal chain.
An important consideration is the specification of the scale parameter for the random walk error distribution. Large values produce random walk steps that are highly exploratory, but tend to produce proposal values in the tails of the target distribution, potentially resulting in very small acceptance rates. Conversely, small values tend to be accepted more frequently, since they tend to produce proposals close to the current parameter value, but may result in chains that mix very slowly. Some simulation studies suggest optimal acceptance rates in the range of 20-50%. It is often worthwhile to optimize the proposal variance by iteratively adjusting its value, according to observed acceptance rates early in the MCMC simulation [Gamerman1997].
List of References¶
[Akaike1973] |
|
[Bernardo1992] | J.M. Bernardo, J. Berger, A.P. Dawid, and J.F.M. Smith, editors. Bayesian Statistics 4. Oxford University Press, Oxford, 1992. |
[Brooks2000] | S.P. Brooks, E.A. Catchpole, and B.J.T. Morgan. Bayesian animal survival estimation. Statistical Science, 15: 357–376, 2000. |
[Burnham2002] | K.P. Burnham and D.R. Anderson. Model Selection and Multi-Model Inference: A Practical, Information-theoretic Approach. Springer, New York, 2002. |
[Christakos2002] |
|
[Gamerman1997] |
|
[Gelman1992] |
|
[Gelman1996] |
|
[Gelman2004] |
|
[Geweke1992] |
|
[Haario2001] |
|
[Jarrett1979] | R.G. Jarrett. A note on the intervals between coal mining disasters. Biometrika, 66:191–193, 1979. |
[Jaynes2003] | E.T. Jaynes. Probability theory: the logic of science. Cambridge university press, 2003. |
[Jordan2004] | M.I. Jordan. Graphical models. Statist. Sci., 19(1):140–155, 2004. |
[Kerman2004] |
|
[Langtangen2009] | Hans Petter Langtangen. Python Scripting for Computational Science. Springer-Verlag, 2009. |
[Lauritzen1990] | S.L. Lauritzen, A.P. Dawid, B.N. Larsen, and H.G. Leimer. Independence properties of directed Markov fields. Networks, 20:491–505, 1990. |
[Lutz2007] |
|
[Neal2003] | R.M. Neal. Slice sampling. Annals of Statistics. 2003. doi:10.2307/3448413. |
[Oberhumer2008] | M.F.X.J. Oberhumer. LZO Real-Time Data Compression Library. 2008. URL http://www.oberhumer.com/opensource/lzo/. |
[Plummer2008] |
|
[R2010] | R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, 2010. URL http: //www.R-project.org/. |
[Raftery1995a] | A.E. Raftery and S.M. Lewis. The number of iterations, convergence diagnostics and generic metropolis al- gorithms. In D.J. Spiegelhalter W.R. Gilks and S. Richardson, editors, Practical Markov Chain Monte Carlo. Chapman and Hall, London, U.K., 1995. |
[Raftery1995b] | A.E. Raftery and S.M. Lewis. Gibbsit Version 2.0. 1995. URL http://lib.stat.cmu.edu/ general/gibbsit/. |
[Roelofs2010] |
|
[Roberts2007] | G.O. Roberts and J.S. Rosenthal. Implementing componentwise Hastings algorithms. Journal of Applied Probability, 44(2):458–475, 2007. |
[Schwarz1978] |
|
[Seward2007] |
|
[vanRossum2010] |
|