Home¶
pomegranate is a python package which implements fast, efficient, and extremely flexible probabilistic models ranging from probability distributions to Bayesian networks to mixtures of hidden Markov models. The most basic level of probabilistic modeling is the a simple probability distribution. If we’re modeling language, this may be a simple distribution over the frequency of all possible words a person can say.
The next level up are probabilistic models which use the simple distributions in more complex ways. A markov chain can extend a simple probability distribution to say that the probability of a certain word depends on the word(s) which have been said previously. A hidden Markov model may say that the probability of a certain words depends on the latent/hidden state of the previous word, such as a noun usually follows an adjective.
 Markov Chains
 Bayes Classifiers and Naive Bayes
 General Mixture Models
 Hidden Markov Models
 Bayesian Networks
 Factor Graphs
The third level are stacks of probabilistic models which can model even more complex phenomena. If a single hidden Markov model can capture a dialect of a language (such as a certain persons speech usage) then a mixture of hidden Markov models may fine tune this to be situation specific. For example, a person may use more formal language at work and more casual language when speaking with friends. By modeling this as a mixture of HMMs, we represent the persons language as a “mixture” of these dialects.
 GMMHMMs
 Mixtures of Models
 Bayesian Classifiers of Models
Installation¶
pomegranate is pip installable using `pip install pomegranate`
. You can get the bleeding edge from github using the following:
git clone https://github.com/jmschrei/pomegranate
cd pomegranate
python setup.py install
On Windows machines you may need to download a C++ compiler. For Python 2 this minimal version of Visual Studio 2008 works well. For Python 3 this version of the Visual Studio build tools has been reported to work.
No good project is done alone, and so I’d like to thank all the previous contributors to YAHMM and all the current contributors to pomegranate as well as the graduate students whom I have pestered with ideas. Contributions are eagerly accepted! If you would like to contribute a feature then fork the master branch and be sure to run the tests before changing any code. Let us know what you want to do on the issue tracker just in case we’re already working on an implementation of something similar. Also, please don’t forget to add tests for any new functions.
FAQ¶
How can I cite pomegranate?
I don’t currently have a research paper which can be cited, but the GitHub repository can be.
@misc{Schreiber2016,
author = {Jacob Schreiber},
title = {pomegranate},
year = {2016},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/jmschrei/pomegranate}},
commit = {enter commit that you used}
}
How does pomegranate compare to other packages?
A comparison of the features between pomegranate and others in the python ecosystem can be seen in the following two plots.
The plot on the left shows model stacks which are currently supported by pomegranate. The rows show each model, and the columns show which models those can fit in. Dark blue shows model stacks which currently are supported, and light blue shows model stacks which are currently being worked on and should be available soon. For example, all models use basic distributions as their main component. However, general mixture models (GMMs) can be fit into both Naive Bayes classifiers and hidden Markov models (HMMs). Conversely, HMMs can be fit into GMMs to form mixtures of HMMs. Soon pomegranate will support models like a mixture of Bayesian networks.
The plot on the right shows features compared to other packages in the python ecosystem. Dark red indicates features which no other package supports (to my knowledge!) and orange shows areas where pomegranate has an expanded feature set compared to other packages. For example, both pomegranate and sklearn support Gaussian naive Bayes classifiers. However, pomegranate supports naive Bayes of arbitrary distributions and combinations of distributions, such as one feature being Gaussian, one being log normal, and one being exponential (useful to classify things like ionic current segments or audio segments). pomegranate also extends naive Bayes past its “naivity” to allow for features to be dependent on each other, and allows input to be more complex things like hidden Markov models and Bayesian networks. There’s no rule that each of the inputs to naive Bayes has to be the same type though, allowing you to do things like compare a markov chain to a HMM. No other package supports a HMM Naive Bayes! Packages like hmmlearn support the GMMHMM, but for them GMM strictly means Gaussian mixture model, whereas in pomegranate it ~can~ be a Gaussian mixture model, but it can also be an arbitrary mixture model of any types of distributions. Lastly, no other package supports mixtures of HMMs despite their prominent use in things like audio decoding and biological sequence analysis.
Models can be stacked more than once, though. For example, a “naive” Bayes classifier can be used to compare multiple mixtures of HMMs to each other, or compare a HMM with GMM emissions to one without GMM emissions. You can also create mixtures of HMMs with GMM emissions, and so the most stacking currently supported is a “naive” Bayes classifier of mixtures of HMMs with GMM emissions, or four levels of stacking.
How can pomegranate be faster than numpy?
 pomegranate has been shown to be faster than numpy at updating univariate and multivariate gaussians. One of the reasons is because when you use numpy you have to use
`numpy.mean(X)`
and`numpy.cov(X)`
which requires two full passes of the data. pomegranate uses additive sufficient statistics to reduce a dataset down to a fixed set of numbers which can be used to get an exact update.  This allows pomegranate to calculate both mean and covariance in a single pass of the dataset. In addition, one of the reasons that numpy is so fast is its use of BLAS. pomegranate also uses BLAS, but uses the cython level calls to BLAS so that the data doesn’t have to pass between cython and python multiple times.
Does pomegranate support parallelization?
Yes! pomegranate supports parallelized model fitting and model predictions, both in a dataparallel manner. Since the backend is written in cython the global interpreter lock (GIL) can be released and multithreaded training can be supported via joblib. This means that parallelization is utilized time isn’t spent piping data from one process to another nor are multiple copies of the model made.
Does pomegranate support GPUs?
Currently pomegranate does not support GPUs.
Does pomegranate support distributed computing?
Currently pomegranate is not set up for a distributed environment, though the pieces are currently there to make this possible.
Out of Core¶
Sometimes datasets which we’d like to train on can’t fit in memory but we’d still like to get an exact update. pomegranate supports out of core training to allow this, by allowing models to summarize batches of data into sufficient statistics and then later on using these sufficient statistics to get an exact update for model parameters. These are done through the methods `model.summarize`
and `model.from_summaries`
. Let’s see an example of using it to update a normal distribution.
>>> from pomegranate import *
>>> import numpy
>>>
>>> a = NormalDistribution(1, 1)
>>> b = NormalDistribution(1, 1)
>>> X = numpy.random.normal(3, 5, size=(5000,))
>>>
>>> a.fit(X)
>>> a
{
"frozen" :false,
"class" :"Distribution",
"parameters" :[
3.012692830297519,
4.972082359070984
],
"name" :"NormalDistribution"
}
>>> for i in range(5):
>>> b.summarize(X[i*1000:(i+1)*1000])
>>> b.from_summaries()
>>> b
{
"frozen" :false,
"class" :"Distribution",
"parameters" :[
3.01269283029752,
4.972082359070983
],
"name" :"NormalDistribution"
}
This is a simple example with a simple distribution, but all models and model stacks support this type of learning. Lets next look at a simple Bayesian network.
We can see that before fitting to any data, the distribution in one of the states is equal for both. After fitting the first distribution they become different as would be expected. After fitting the second one through summarize the distributions become equal again, showing that it is recovering an exact update.
It’s easy to see how one could use this to update models which don’t use Expectation Maximization (EM) to train, since it is an iterative algorithm. For algorithms which use EM to train there is a `fit`
wrapper which will allow you to load up batches of data from a numpy memory map to train on automatically.
Probability Distributions¶
While probability distributions are frequently used as components of more complex models such as mixtures and hidden Markov models, they can also be used by themselves. Many data science tasks require fitting a distribution to data or generating samples under a distribution. pomegranate has a large library of both univariate and multivariate distributions which can be used with an intuitive interface.
Univariate Distributions
UniformDistribution 
A uniform distribution between two values. 
BernoulliDistribution 
A Bernoulli distribution describing the probability of a binary variable. 
NormalDistribution 
A normal distribution based on a mean and standard deviation. 
LogNormalDistribution 
Represents a lognormal distribution over nonnegative floats. 
ExponentialDistribution 
Represents an exponential distribution on nonnegative floats. 
PoissonDistribution 
A discrete probability distribution which expresses the probability of a number of events occuring in a fixed time window. 
BetaDistribution 
This distribution represents a beta distribution, parameterized using alpha/beta, which are both shape parameters. 
GammaDistribution 
This distribution represents a gamma distribution, parameterized in the alpha/beta (shape/rate) parameterization. 
DiscreteDistribution 
A discrete distribution, made up of characters and their probabilities, assuming that these probabilities will sum to 1.0. 
Kernel Densities
GaussianKernelDensity 
A quick way of storing points to represent a Gaussian kernel density in one dimension. 
UniformKernelDensity 
A quick way of storing points to represent an Exponential kernel density in one dimension. 
TriangleKernelDensity 
A quick way of storing points to represent an Exponential kernel density in one dimension. 
Multivariate Distributions
IndependentComponentsDistribution 
Allows you to create a multivariate distribution, where each distribution is independent of the others. 
MultivariateGaussianDistribution 

DirichletDistribution 
A Dirichlet distribution, usually a prior for the multinomial distributions. 
ConditionalProbabilityTable 
A conditional probability table, which is dependent on values from at least one previous distribution but up to as many as you want to encode for. 
JointProbabilityTable 
A joint probability table. 
While there is a large variety of univariate distributions, multivariate distributions can be made from univariate distributions by using `IndependentComponentsDistribution`
with the assumption that each column of data is independent from the other columns (instead of being related by a covariance matrix, like in multivariate gaussians). Here is an example:
d1 = NormalDistribution(5, 2)
d2 = LogNormalDistribution(1, 0.3)
d3 = ExponentialDistribution(4)
d = IndependentComponentsDistribution([d1, d2, d3])
Initialization¶
Initializing a distribution is simple and done just by passing in the distribution parameters. For example, the parameters of a normal distribution are the mean (mu) and the standard deviation (sigma). We can initialize it as follows:
from pomegranate import *
a = NormalDistribution(5, 2)
However, frequently we don’t know the parameters of the distribution beforehand or would like to directly fit this distribution to some data. We can do this through the from_samples class method.
b = NormalDistribution.from_samples([3, 4, 5, 6, 7], weights=[0.5, 1, 1.5, 1, 0.5])
If we want to fit the model to weighted samples, we can just pass in an array of the relative weights of each sample as well.
b = NormalDistribution.from_samples([3, 4, 5, 6, 7], weights=[0.5, 1, 1.5, 1, 0.5])
Probability¶
Distributions are typically used to calculate the probability of some sample. This can be done using either the probability or log_probability methods.
a = NormalDistribution(5, 2)
a.log_probability(8)
2.737085713764219
a.probability(8)
0.064758797832971712
b = NormalDistribution.from_samples([3, 4, 5, 6, 7], weights=[0.5, 1, 1.5, 1, 0.5])
b.log_probability(8)
4.437779569430167
These methods work for univariate distributions, kernel densities, and multivariate distributions all the same. For a multivariate distribution you’ll have to pass in an array for the full sample.
d1 = NormalDistribution(5, 2)
d2 = LogNormalDistribution(1, 0.3)
d3 = ExponentialDistribution(4)
d = IndependentComponentsDistribution([d1, d2, d3])
>>>
X = [6.2, 0.4, 0.9]
d.log_probability(X)
23.205411733352875
Fitting¶
We may wish to fit the distribution to new data, either overriding the previous parameters completely or moving the parameters to match the dataset more closely through inertia. Distributions are updated using maximum likelihood estimates (MLE). Kernel densities will either discard previous points or downweight them if inertia is used.
d = NormalDistribution(5, 2)
d.fit([1, 5, 7, 3, 2, 4, 3, 5, 7, 8, 2, 4, 6, 7, 2, 4, 5, 1, 3, 2, 1])
d
{
"frozen" :false,
"class" :"Distribution",
"parameters" :[
3.9047619047619047,
2.13596776114341
],
"name" :"NormalDistribution"
}
Training can be done on weighted samples by passing an array of weights in along with the data for any of the training functions, like the following:
d = NormalDistribution(5, 2)
d.fit([1, 5, 7, 3, 2, 4], weights=[0.5, 0.75, 1, 1.25, 1.8, 0.33])
d
{
"frozen" :false,
"class" :"Distribution",
"parameters" :[
3.538188277087034,
1.954149818564894
],
"name" :"NormalDistribution"
}
Training can also be done with inertia, where the new value will be some percentage the old value and some percentage the new value, used like d.from_sample([5,7,8], inertia=0.5) to indicate a 5050 split between old and new values.
API Reference¶

class
pomegranate.distributions.
Distribution
¶ A probability distribution.
Represents a probability distribution over the defined support. This is the base class which must be subclassed to specific probability distributions. All distributions have the below methods exposed.
Parameters: Varies on distribution. Attributes
name (str) The name of the type of distributioon. summaries (list) Sufficient statistics to store the update. frozen (bool) Whether or not the distribution will be updated during training. d (int) The dimensionality of the data. Univariate distributions are all 1, while multivariate distributions are > 1. 
clear_summaries
()¶ Clear the summary statistics stored in the object. Parameters ——— None Returns —— None

copy
()¶ Return a deep copy of this distribution object.
This object will not be tied to any other distribution or connected in any form.
Returns: distribution : Distribution
A copy of the distribution with the same parameters.

from_json
()¶ Read in a serialized distribution and return the appropriate object.
Parameters: s : str
A JSON formatted string containing the file.
Returns: model : object
A properly initialized and baked model.

from_samples
()¶ Fit a distribution to some data without prespecifying it.

from_summaries
()¶ Fit the distribution to the stored sufficient statistics. Parameters ——— inertia : double, optional
The weight of the previous parameters of the model. The new parameters will roughly be old_param*inertia + new_param*(1inertia), so an inertia of 0 means ignore the old parameters, whereas an inertia of 1 means ignore the new parameters. Default is 0.0.None

log_probability
()¶ Return the log probability of the given X under this distribution.
Parameters: X : double
The X to calculate the log probability of (overriden for DiscreteDistributions)
Returns: logp : double
The log probability of that point under the distribution.

marginal
()¶ Return the marginal of the distribution.
Parameters: *args : optional
Arguments to pass in to specific distributions
**kwargs : optional
Keyword arguments to pass in to specific distributions
Returns: distribution : Distribution
The marginal distribution. If this is a multivariate distribution then this method is filled in. Otherwise returns self.

plot
()¶ Plot the distribution by sampling from it.
This function will plot a histogram of samples drawn from a distribution on the current open figure.
Parameters: n : int, optional
The number of samples to draw from the distribution. Default is 1000.
**kwargs : arguments, optional
Arguments to pass to matplotlib’s histogram function.
Returns: None

summarize
()¶ Summarize a batch of data into sufficient statistics for a later update. Parameters ——— items : arraylike, shape (n_samples, n_dimensions)
This is the data to train on. Each row is a sample, and each column is a dimension to train on. For univariate distributions an array is used, while for multivariate distributions a 2d matrix is used. weights : arraylike, shape (n_samples,), optional
 The initial weights of each sample in the matrix. If nothing is passed in then each sample is assumed to be the same weight. Default is None.
None

to_json
()¶ Serialize the distribution to a JSON.
Parameters: separators : tuple, optional
The two separaters to pass to the json.dumps function for formatting. Default is (‘,’, ‘ : ‘).
indent : int, optional
The indentation to use at each level. Passed to json.dumps for formatting. Default is 4.
Returns: json : str
A properly formatted JSON object.

General Mixture Models¶
General Mixture models (GMMs) are an unsupervised probabilistic model composed of multiple distributions (commonly referred to as components) and corresponding weights. This allows you to model more complex distributions corresponding to a singular underlying phenomena. For a full tutorial on what a mixture model is and how to use them, see the above tutorial.
Initialization¶
General Mixture Models can be initialized in two ways depending on if you know the initial parameters of the model or not: (1) passing in a list of preinitialized distributions, or (2) running the from_samples
class method on data. The initial parameters can be either a prespecified model that is ready to be used for prediction, or the initialization for expectationmaximization. Otherwise, if the second initialization option is chosen, then kmeans is used to initialize the distributions. The distributions passed for each component don’t have to be the same type, and if an IndependentComponentDistribution
object is passed in, then the dimensions don’t need to be modeled by the same distribution.
Here is an example of a traditional multivariate Gaussian mixture where we pass in preinitialized distributions. We can also pass in the weight of each component, which serves as the prior probability of a sample belonging to that component when doing predictions.
from pomegranate import *
d1 = MultivariateGaussianDistribution([1, 6, 3], [[1, 0, 0], [0, 1, 0], [0, 0, 1]])
d2 = MultivariateGaussianDistribution([2, 8, 4], [[1, 0, 0], [0, 1, 0], [0, 0, 2]])
d3 = MultivariateGaussianDistribution([0, 4, 8], [[2, 0, 0], [0, 3, 0], [0, 0, 1]])
model = GeneralMixtureModel([d1, d2, d3], weights=[0.25, 0.60, 0.15])
Alternatively, if we want to model each dimension differently, then we can replace the multivariate Gaussian distributions with IndependentComponentsDistribution
objects.
from pomegranate import *
d1 = IndependentComponentsDistributions([NormalDistribution(5, 2), ExponentialDistribution(1), LogNormalDistribution(0.4, 0.1)])
d2 = IndependentComponentsDistributions([NormalDistribution(3, 1), ExponentialDistribution(2), LogNormalDistribution(0.8, 0.2)])
model = GeneralMixtureModel([d1, d2], weights=[0.66, 0.34])
If we do not know the parameters of our distributions beforehand and want to learn them entirely from data, then we can use the from_samples
class method. This method will run kmeans to initialize the components, using the returned clusters to initialize all parameters of the distributions, i.e. both mean and covariances for multivariate Gaussian distributions. Afterwards, execptationmaximization is used to refine the parameters of the model, iterating until convergence.
from pomegranate import *
model = GeneralMixtureModel.from_samples(MultivariateGaussianDistribution, n_components=3, X=X)
If we want to model each dimension using a different distribution, then we can pass in a list of callables and they will be initialized using kmeans as well.
from pomegranate import *
model = GeneralMixtureModel.from_samples([NormalDistribution, ExponentialDistribution, LogNormalDistribution], n_components=5, X=X)
Probability¶
The probability of a point is the sum of its probability under each of the components, multiplied by the weight of each component c, \(P = \sum\limits_{i \in M} P(DM_{i})P(M_{i})\). The probability
method returns the probability of each sample under the entire mixture, and the log_probability
method returns the log of that value.
Prediction¶
The common prediction tasks involve predicting which component a new point falls under. This is done using Bayes rule \(P(MD) = \frac{P(DM)P(M)}{P(D)}\) to determine the posterior probability \(P(MD)\) as opposed to simply the likelihood \(P(DM)\). Bayes rule indicates that it isn’t simply the likelihood function which makes this prediction but the likelihood function multiplied by the probability that that distribution generated the sample. For example, if you have a distribution which has 100x as many samples fall under it, you would naively think that there is a ~99% chance that any random point would be drawn from it. Your belief would then be updated based on how well the point fit each distribution, but the proportion of points generated by each sample is important as well.
We can get the component label assignments using model.predict(data)
, which will return an array of indexes corresponding to the maximally likely component. If what we want is the full matrix of \(P(MD)\), then we can use model.predict_proba(data)
, which will return a matrix with each row being a sample, each column being a component, and each cell being the probability that that model generated that data. If we want log probabilities instead we can use model.predict_log_proba(data)
instead.
Fitting¶
Training GMMs faces the classic chickenandegg problem that most unsupervised learning algorithms face. If we knew which component a sample belonged to, we could use MLE estimates to update the component. And if we knew the parameters of the components we could predict which sample belonged to which component. This problem is solved using expectationmaximization, which iterates between the two until convergence. In essence, an initialization point is chosen which usually is not a very good start, but through successive iteration steps, the parameters converge to a good ending.
These models are fit using model.fit(data)
. A maximimum number of iterations can be specified as well as a stopping threshold for the improvement ratio. See the API reference for full documentation.
API Reference¶

class
pomegranate.gmm.
GeneralMixtureModel
¶ A General Mixture Model.
This mixture model can be a mixture of any distribution as long as they are all of the same dimensionality. Any object can serve as a distribution as long as it has fit(X, weights), log_probability(X), and summarize(X, weights)/from_summaries() methods if out of core training is desired.
Parameters: distributions : arraylike, shape (n_components,) or callable
The components of the model. If array, corresponds to the initial distributions of the components. If callable, must also pass in the number of components and kmeans++ will be used to initialize them.
weights : arraylike, optional, shape (n_components,)
The prior probabilities corresponding to each component. Does not need to sum to one, but will be normalized to sum to one internally. Defaults to None.
Examples
>>> from pomegranate import * >>> clf = GeneralMixtureModel([ >>> NormalDistribution(5, 2), >>> NormalDistribution(1, 1)]) >>> clf.log_probability(5) 2.304562194038089 >>> clf.predict_proba([[5], [7], [1]]) array([[ 0.99932952, 0.00067048], [ 0.99999995, 0.00000005], [ 0.06337894, 0.93662106]]) >>> clf.fit([[1], [5], [7], [8], [2]]) >>> clf.predict_proba([[5], [7], [1]]) array([[ 1. , 0. ], [ 1. , 0. ], [ 0.00004383, 0.99995617]]) >>> clf.distributions array([ { "frozen" :false, "class" :"Distribution", "parameters" :[ 6.6571359101390755, 1.2639830514274502 ], "name" :"NormalDistribution" }, { "frozen" :false, "class" :"Distribution", "parameters" :[ 1.498707696758334, 0.4999983303277837 ], "name" :"NormalDistribution" }], dtype=object)
Attributes
distributions (arraylike, shape (n_components,)) The component distribution objects. weights (arraylike, shape (n_components,)) The learned prior weight of each object 
clear_summaries
()¶ Remove the stored sufficient statistics.
Parameters: None Returns: None

copy
()¶ Return a deep copy of this distribution object.
This object will not be tied to any other distribution or connected in any form.
Parameters: None
Returns: distribution : Distribution
A copy of the distribution with the same parameters.

fit
()¶ Fit the model to new data using EM.
This method fits the components of the model to new data using the EM method. It will iterate until either max iterations has been reached, or the stop threshold has been passed.
This is a sklearn wrapper for train method.
Parameters: X : arraylike, shape (n_samples, n_dimensions)
This is the data to train on. Each row is a sample, and each column is a dimension to train on.
 weights : arraylike, shape (n_samples,), optional
The initial weights of each sample in the matrix. If nothing is passed in then each sample is assumed to be the same weight. Default is None.
 inertia : double, optional
The weight of the previous parameters of the model. The new parameters will roughly be old_param*inertia + new_param*(1inertia), so an inertia of 0 means ignore the old parameters, whereas an inertia of 1 means ignore the new parameters. Default is 0.0.
 stop_threshold : double, optional, positive
The threshold at which EM will terminate for the improvement of the model. If the model does not improve its fit of the data by a log probability of 0.1 then terminate. Default is 0.1.
 max_iterations : int, optional, positive
The maximum number of iterations to run EM for. If this limit is hit then it will terminate training, regardless of how well the model is improving per iteration. Default is 1e8.
pseudocount : double, optional, positive
A pseudocount to add to the emission of each distribution. This
effectively smoothes the states to prevent 0. probability symbols
if they don’t happen to occur in the data. Only effects mixture
models defined over discrete distributions. Default is 0.
 verbose : bool, optional
Whether or not to print out improvement information over iterations. Default is False.
Returns: improvement : double
The total improvement in log probability P(DM)

freeze
()¶ Freeze the distribution, preventing updates from occuring.

from_samples
()¶ Create a mixture model directly from the given dataset.
First, kmeans will be run using the given initializations, in order to define initial clusters for the points. These clusters are used to initialize the distributions used. Then, EM is run to refine the parameters of these distributions.
A homogenous mixture can be defined by passing in a single distribution callable as the first parameter and specifying the number of components, while a heterogeneous mixture can be defined by passing in a list of callables of the appropriate type.
Parameters: distributions : arraylike, shape (n_components,) or callable
The components of the model. If array, corresponds to the initial distributions of the components. If callable, must also pass in the number of components and kmeans++ will be used to initialize them.
 n_components : int
If a callable is passed into distributions then this is the number of components to initialize using the kmeans++ algorithm.
 X : arraylike, shape (n_samples, n_dimensions)
This is the data to train on. Each row is a sample, and each column is a dimension to train on.
 weights : arraylike, shape (n_samples,), optional
The initial weights of each sample in the matrix. If nothing is passed in then each sample is assumed to be the same weight. Default is None.
 n_init : int, optional
The number of initializations of kmeans to do before choosing the best. Default is 1.
 init : str, optional
The initialization algorithm to use for the initial kmeans clustering. Must be one of ‘firstk’, ‘random’, ‘kmeans++’, or ‘kmeans’. Default is ‘kmeans++’.
 max_kmeans_iterations : int, optional
The maximum number of iterations to run kmeans for in the initialization step. Default is 1.
 inertia : double, optional
The weight of the previous parameters of the model. The new parameters will roughly be old_param*inertia + new_param*(1inertia), so an inertia of 0 means ignore the old parameters, whereas an inertia of 1 means ignore the new parameters. Default is 0.0.
 stop_threshold : double, optional, positive
The threshold at which EM will terminate for the improvement of the model. If the model does not improve its fit of the data by a log probability of 0.1 then terminate. Default is 0.1.
 max_iterations : int, optional, positive
The maximum number of iterations to run EM for. If this limit is hit then it will terminate training, regardless of how well the model is improving per iteration. Default is 1e8.
pseudocount : double, optional, positive
A pseudocount to add to the emission of each distribution. This
effectively smoothes the states to prevent 0. probability symbols
if they don’t happen to occur in the data. Only effects mixture
models defined over discrete distributions. Default is 0.
 verbose : bool, optional
Whether or not to print out improvement information over iterations. Default is False.

from_summaries
()¶ Fit the model to the collected sufficient statistics.
Fit the parameters of the model to the sufficient statistics gathered during the summarize calls. This should return an exact update.
Parameters: inertia : double, optional
The weight of the previous parameters of the model. The new parameters will roughly be old_param*inertia + new_param*(1inertia), so an inertia of 0 means ignore the old parameters, whereas an inertia of 1 means ignore the new parameters. Default is 0.0.
pseudocount : double, optional
A pseudocount to add to the emission of each distribution. This effectively smoothes the states to prevent 0. probability symbols if they don’t happen to occur in the data. If discrete data, will smooth both the prior probabilities of each component and the emissions of each component. Otherwise, will only smooth the prior probabilities of each component. Default is 0.
Returns: None

log_probability
()¶ Calculate the log probability of a point under the distribution.
The probability of a point is the sum of the probabilities of each distribution multiplied by the weights. Thus, the log probability is the sum of the log probability plus the log prior.
This is the python interface.
Parameters: X : numpy.ndarray, shape=(n, d) or (n, m, d)
The samples to calculate the log probability of. Each row is a sample and each column is a dimension. If emissions are HMMs then shape is (n, m, d) where m is variable length for each obervation, and X becomes an array of n (m, d)shaped arrays.
Returns: log_probability : double
The log probabiltiy of the point under the distribution.

predict
()¶ Predict the most likely component which generated each sample.
Calculate the posterior P(MD) for each sample and return the index of the component most likely to fit it. This corresponds to a simple argmax over the responsibility matrix.
This is a sklearn wrapper for the maximum_a_posteriori method.
Parameters: X : arraylike, shape (n_samples, n_dimensions)
The samples to do the prediction on. Each sample is a row and each column corresponds to a dimension in that sample. For univariate distributions, a single array may be passed in.
Returns: y : arraylike, shape (n_samples,)
The predicted component which fits the sample the best.

predict_log_proba
()¶ Calculate the posterior log P(MD) for data.
Calculate the log probability of each item having been generated from each component in the model. This returns normalized log probabilities such that the probabilities should sum to 1
This is a sklearn wrapper for the original posterior function.
Parameters: X : arraylike, shape (n_samples, n_dimensions)
The samples to do the prediction on. Each sample is a row and each column corresponds to a dimension in that sample. For univariate distributions, a single array may be passed in.
Returns: y : arraylike, shape (n_samples, n_components)
The normalized log probability log P(MD) for each sample. This is the probability that the sample was generated from each component.

predict_proba
()¶ Calculate the posterior P(MD) for data.
Calculate the probability of each item having been generated from each component in the model. This returns normalized probabilities such that each row should sum to 1.
Since calculating the log probability is much faster, this is just a wrapper which exponentiates the log probability matrix.
Parameters: X : arraylike, shape (n_samples, n_dimensions)
The samples to do the prediction on. Each sample is a row and each column corresponds to a dimension in that sample. For univariate distributions, a single array may be passed in.
Returns: probability : arraylike, shape (n_samples, n_components)
The normalized probability P(MD) for each sample. This is the probability that the sample was generated from each component.

probability
()¶ Return the probability of the given symbol under this distribution.
Parameters: symbol : object
The symbol to calculate the probability of
Returns: probability : double
The probability of that point under the distribution.

sample
()¶ Generate a sample from the model.
First, randomly select a component weighted by the prior probability, Then, use the sample method from that component to generate a sample.
Parameters: n : int, optional
The number of samples to generate. Defaults to 1.
Returns: sample : arraylike or object
A randomly generated sample from the model of the type modelled by the emissions. An integer if using most distributions, or an array if using multivariate ones, or a string for most discrete distributions. If n=1 return an object, if n>1 return an array of the samples.

summarize
()¶ Summarize a batch of data and store sufficient statistics.
This will run the expectation step of EM and store sufficient statistics in the appropriate distribution objects. The summarization can be thought of as a chunk of the E step, and the from_summaries method as the M step.
Parameters: X : arraylike, shape (n_samples, n_dimensions)
This is the data to train on. Each row is a sample, and each column is a dimension to train on.
weights : arraylike, shape (n_samples,), optional
The initial weights of each sample in the matrix. If nothing is passed in then each sample is assumed to be the same weight. Default is None.
Returns: logp : double
The log probability of the data given the current model. This is used to speed up EM.

thaw
()¶ Thaw the distribution, reallowing updates to occur.

Bayes Classifiers and Naive Bayes¶
Bayes classifiers are simple probabilistic classification models based off of Bayes throerom. See the above tutorial for a full primer on how they work, and what the distinction between a naive Bayes classifier and a Bayes classifier is. Essentially, each class is modeled by a probability distribution and classifications are made according to what distribution fits the data the best. They are a supervised version of general mixture models, in that the predict
, predict_proba
, and predict_log_proba
methods return the same values for the same underlying distributions, but that instead of using expectationmaximization to fit to new data they can use the provided labels directly.
Initialization¶
Bayes classifiers and naive Bayes can both be initialized in one of two ways depending on if you know the parameters of the model beforehand or not, (1) passing in a list of preinitialized distributions to the model, or (2) using the from_samples
class method to initialize the model directly from data. For naive Bayes models on multivariate data, the preinitialized distributions must be a list of IndependentComponentDistribution
objects since each dimension is modeled independently from the others. For Bayes classifiers on multivariate data a list of any type of multivariate distribution can be provided. For univariate data the two models produce identical results, and can be passed in a list of univariate distributions. For example:
from pomegranate import *
d1 = IndependentComponentsDistribution([NormalDistribution(5, 2), NormalDistribution(6, 1), NormalDistribution(9, 1)])
d2 = IndependentComponentsDistribution([NormalDistribution(2, 1), NormalDistribution(8, 1), NormalDistribution(5, 1)])
d3 = IndependentComponentsDistribution([NormalDistribution(3, 1), NormalDistribution(5, 3), NormalDistribution(4, 1)])
model = NaiveBayes([d1, d2, d3])
would create a three class naive Bayes classifier that modeled data with three dimensions. Alternatively, we can initialize a Bayes classifier in the following manner
from pomegranate import *
d1 = MultivariateGaussianDistribution([5, 6, 9], [[2, 0, 0], [0, 1, 0], [0, 0, 1]])
d2 = MultivariateGaussianDistribution([2, 8, 5], [[1, 0, 0], [0, 1, 0], [0, 0, 1]])
d3 = MultivariateGaussianDistribution([3, 5, 4], [[1, 0, 0], [0, 3, 0], [0, 0, 1]])
model = BayesClassifier([d1, d2, d3])
The two examples above functionally creatte the same model, as the Bayes classifier uses multivariate Gaussian distributions with the same means and a diagonal covariance matrix containing only the variances. However, if we were to fit these models to data later on, the Bayes classifier would learn a full covariance matrix while the naive Bayes would only learn the diagonal.
If we instead wish to initialize our model directly onto data, we use the from_samples
class method.
from pomegranate import *
import numpy
X = numpy.load('data.npy')
y = numpy.load('labels.npy')
model = NaiveBayes.from_samples(NormalDistribution, X, y)
This would create a naive Bayes model directly from the data with normal distributions modeling each of the dimensions, and a number of components equal to the number of classes in y
. Alternatively if we wanted to create a model with different distributions for each dimension we can do the following:
model = NaiveBayes.from_samples([NormalDistribution, ExponentialDistribution], X, y)
This assumes that your data is two dimensional and that you want to model the first distribution as a normal distribution and the second dimension as an exponential distribution.
We can do pretty much the same thing with Bayes classifiers, except passing in a more complex model.
model = BayesClassifier.from_samples(MultivariateGaussianDistribution, X, y)
One can use much more complex models than just a multivariate Gaussian with a full covariance matrix when using a Bayes classifier. Specifically, you can also have your distributions be general mixture models, hidden Markov models, and Bayesian networks. For example:
model = BayesClassifier.from_samples(BayesianNetwork, X, y)
That would require that the data is only discrete valued currently, and the structure learning task may be too long if not set appropriately. However, it is possible. Currently, one cannot simply put in GeneralMixtureModel or HiddenMarkovModel despite them having a from_samples
method because there is a great deal of flexibility in terms of the structure or emission distributions. The easiest way to set up one of these more complex models is to build each of the components separately and then feed them into the Bayes classifier method using the first initialization method.
d1 = GeneralMixtureModel.from_samples(MultivariateGaussianDistribution, n_components=5, X=X[y==0])
d2 = GeneralMixtureModel.from_samples(MultivariateGaussianDistribution, n_components=5, X=X[y==1])
model = BayesClassifier([d1, d2])
Prediction¶
Bayes classifiers and naive Bayes supports the same three prediction methods that the other models support, predict
, predict_proba
, and predict_log_proba
. These methods return the most likely class given the data (argmax_m P(MD)), the probability of each class given the data (P(MD)), and the log probability of each class given the data (log P(MD)). It is best to always pass in a 2D matrix even for univariate data, where it would have a shape of (n, 1).
The predict
method takes in samples and returns the most likely class given the data.
from pomegranate import *
model = NaiveBayes([NormalDistribution(5, 2), UniformDistribution(0, 10), ExponentialDistribution(1.0)])
model.predict( np.array([[0], [1], [2], [3], [4]]))
[2, 2, 2, 0, 0]
Calling predict_proba
on five samples for a Naive Bayes with univariate components would look like the following.
from pomegranate import *
model = NaiveBayes([NormalDistribution(5, 2), UniformDistribution(0, 10), ExponentialDistribution(1)])
model.predict_proba(np.array([[0], [1], [2], [3], [4]]))
[[ 0.00790443 0.09019051 0.90190506]
[ 0.05455011 0.20207126 0.74337863]
[ 0.21579499 0.33322883 0.45097618]
[ 0.44681566 0.36931382 0.18387052]
[ 0.59804205 0.33973357 0.06222437]]
Multivariate models work the same way.
from pomegranate import *
d1 = MultivariateGaussianDistribution([5, 5], [[1, 0], [0, 1]])
d2 = IndependentComponentsDistribution([NormalDistribution(5, 2), NormalDistribution(5, 2)])
model = BayesClassifier([d1, d2])
clf.predict_proba(np.array([[0, 4],
[1, 3],
[2, 2],
[3, 1],
[4, 0]]))
array([[ 0.00023312, 0.99976688],
[ 0.00220745, 0.99779255],
[ 0.00466169, 0.99533831],
[ 0.00220745, 0.99779255],
[ 0.00023312, 0.99976688]])
predict_log_proba
works the same way, returning the log probabilities instead of the probabilities.
Fitting¶
Both naive Bayes and Bayes classifiers also have a fit
method that updates the parameters of the model based on new data. The major difference between these methods and the others presented is that these are supervised methods and so need to be passed labels in addition to data. This change propogates also to the summarize
method, where labels are provided as well.
from pomegranate import *
d1 = MultivariateGaussianDistribution([5, 5], [[1, 0], [0, 1]])
d2 = IndependentComponentsDistribution(NormalDistribution(5, 2), NormalDistribution(5, 2)])
model = BayesClassifier([d1, d2])
X = np.array([[6.0, 5.0],
[3.5, 4.0],
[7.5, 1.5],
[7.0, 7.0 ]])
y = np.array([0, 0, 1, 1])
model.fit(X, y)
As we can see, there are four samples, with the first two samples labeled as class 0 and the last two samples labeled as class 1. Keep in mind that the training samples must match the input requirements for the models used. So if using a univariate distribution, then each sample must contain one item. A bivariate distribution, two. For hidden markov models, the sample can be a list of observations of any length. An example using hidden markov models would be the following.
d1 = HiddenMarkovModel...
d2 = HiddenMarkovModel...
d3 = HiddenMarkovModel...
model = BayesClassifier([d1, d2, d3])
X = np.array([list('HHHHHTHTHTTTTH'),
list('HHTHHTTHHHHHTH'),
list('TH'),
list('HHHHT')])
y = np.array([2, 2, 1, 0])
model.fit(X, y)
API Reference¶

class
pomegranate.NaiveBayes.
NaiveBayes
¶ A naive Bayes model, a supervised alternative to GMM.
A naive Bayes classifier, that treats each dimension independently from each other. This is a simpler version of the Bayes Classifier, that can use any distribution with any covariance structure, including Bayesian networks and hidden Markov models.
Parameters: models : list
A list of initialized distributions.
weights : list or numpy.ndarray or None, default None
The prior probabilities of the components. If None is passed in then defaults to the uniformly distributed priors.
Examples
>>> from pomegranate import * >>> X = [0, 2, 0, 1, 0, 5, 6, 5, 7, 6] >>> y = [0, 0, 0, 0, 0, 1, 1, 0, 1, 1] >>> clf = NaiveBayes.from_samples(NormalDistribution, X, y) >>> clf.predict_proba([6]) array([[0.01973451, 0.98026549]])
>>> from pomegranate import * >>> clf = NaiveBayes([NormalDistribution(1, 2), NormalDistribution(0, 1)]) >>> clf.predict_log_proba([[0], [1], [2], [1]]) array([[1.1836569 , 0.36550972], [0.79437677, 0.60122959], [0.26751248, 1.4493653], [1.09861229, 0.40546511]])
Attributes
models (list) The model objects, either initialized by the user or fit to data. weights (numpy.ndarray) The prior probability of each component of the model. 
clear_summaries
()¶ Remove the stored sufficient statistics.
Parameters: None Returns: None

copy
()¶ Return a deep copy of this distribution object.
This object will not be tied to any other distribution or connected in any form.
Parameters: None
Returns: distribution : Distribution
A copy of the distribution with the same parameters.

fit
()¶ Fit the Naive Bayes model to the data by passing data to their components.
Parameters: X : numpy.ndarray or list
The dataset to operate on. For most models this is a numpy array with columns corresponding to features and rows corresponding to samples. For markov chains and HMMs this will be a list of variable length sequences.
y : numpy.ndarray or list or None, optional
Data labels for supervised training algorithms. Default is None
weights : arraylike or None, shape (n_samples,), optional
The initial weights of each sample in the matrix. If nothing is passed in then each sample is assumed to be the same weight. Default is None.
n_jobs : int
The number of jobs to use to parallelize, either the number of threads or the number of processes to use. Default is 1.
inertia : double, optional
Inertia used for the training the distributions.
pseudocount : double, optional
A pseudocount to add to the emission of each distribution. This effectively smoothes the states to prevent 0. probability symbols if they don’t happen to occur in the data. Default is 0.
Returns: self : object
Returns the fitted model

freeze
()¶ Freeze the distribution, preventing updates from occuring.

from_samples
()¶ Create a mixture model directly from the given dataset.
First, kmeans will be run using the given initializations, in order to define initial clusters for the points. These clusters are used to initialize the distributions used. Then, EM is run to refine the parameters of these distributions.
A homogenous mixture can be defined by passing in a single distribution callable as the first parameter and specifying the number of components, while a heterogeneous mixture can be defined by passing in a list of callables of the appropriate type.
Parameters: distributions : arraylike, shape (n_components,) or callable
The components of the model. If array, corresponds to the initial distributions of the components. If callable, must also pass in the number of components and kmeans++ will be used to initialize them.
 n_components : int
If a callable is passed into distributions then this is the number of components to initialize using the kmeans++ algorithm.
 X : arraylike, shape (n_samples, n_dimensions)
This is the data to train on. Each row is a sample, and each column is a dimension to train on.
 weights : arraylike, shape (n_samples,), optional
The initial weights of each sample in the matrix. If nothing is passed in then each sample is assumed to be the same weight. Default is None.
pseudocount : double, optional, positive
A pseudocount to add to the emission of each distribution. This
effectively smoothes the states to prevent 0. probability symbols
if they don’t happen to occur in the data. Only effects mixture
models defined over discrete distributions. Default is 0.
Returns: model : NaiveBayes
The fit naive Bayes model.

from_summaries
()¶ Fit the model to the collected sufficient statistics.
Fit the parameters of the model to the sufficient statistics gathered during the summarize calls. This should return an exact update.
Parameters: inertia : double, optional
The weight of the previous parameters of the model. The new parameters will roughly be old_param*inertia + new_param*(1inertia), so an inertia of 0 means ignore the old parameters, whereas an inertia of 1 means ignore the new parameters. Default is 0.0.
pseudocount : double, optional
A pseudocount to add to the emission of each distribution. This effectively smoothes the states to prevent 0. probability symbols if they don’t happen to occur in the data. If discrete data, will smooth both the prior probabilities of each component and the emissions of each component. Otherwise, will only smooth the prior probabilities of each component. Default is 0.
Returns: None

log_probability
()¶ Calculate the log probability of a point under the distribution.
The probability of a point is the sum of the probabilities of each distribution multiplied by the weights. Thus, the log probability is the sum of the log probability plus the log prior.
This is the python interface.
Parameters: X : numpy.ndarray, shape=(n, d) or (n, m, d)
The samples to calculate the log probability of. Each row is a sample and each column is a dimension. If emissions are HMMs then shape is (n, m, d) where m is variable length for each obervation, and X becomes an array of n (m, d)shaped arrays.
Returns: log_probability : double
The log probabiltiy of the point under the distribution.

predict
()¶ Predict the most likely component which generated each sample.
Calculate the posterior P(MD) for each sample and return the index of the component most likely to fit it. This corresponds to a simple argmax over the responsibility matrix.
This is a sklearn wrapper for the maximum_a_posteriori method.
Parameters: X : arraylike, shape (n_samples, n_dimensions)
The samples to do the prediction on. Each sample is a row and each column corresponds to a dimension in that sample. For univariate distributions, a single array may be passed in.
Returns: y : arraylike, shape (n_samples,)
The predicted component which fits the sample the best.

predict_log_proba
()¶ Calculate the posterior log P(MD) for data.
Calculate the log probability of each item having been generated from each component in the model. This returns normalized log probabilities such that the probabilities should sum to 1
This is a sklearn wrapper for the original posterior function.
Parameters: X : arraylike, shape (n_samples, n_dimensions)
The samples to do the prediction on. Each sample is a row and each column corresponds to a dimension in that sample. For univariate distributions, a single array may be passed in.
Returns: y : arraylike, shape (n_samples, n_components)
The normalized log probability log P(MD) for each sample. This is the probability that the sample was generated from each component.

predict_proba
()¶ Calculate the posterior P(MD) for data.
Calculate the probability of each item having been generated from each component in the model. This returns normalized probabilities such that each row should sum to 1.
Since calculating the log probability is much faster, this is just a wrapper which exponentiates the log probability matrix.
Parameters: X : arraylike, shape (n_samples, n_dimensions)
The samples to do the prediction on. Each sample is a row and each column corresponds to a dimension in that sample. For univariate distributions, a single array may be passed in.
Returns: probability : arraylike, shape (n_samples, n_components)
The normalized probability P(MD) for each sample. This is the probability that the sample was generated from each component.

probability
()¶ Return the probability of the given symbol under this distribution.
Parameters: symbol : object
The symbol to calculate the probability of
Returns: probability : double
The probability of that point under the distribution.

sample
()¶ Generate a sample from the model.
First, randomly select a component weighted by the prior probability, Then, use the sample method from that component to generate a sample.
Parameters: n : int, optional
The number of samples to generate. Defaults to 1.
Returns: sample : arraylike or object
A randomly generated sample from the model of the type modelled by the emissions. An integer if using most distributions, or an array if using multivariate ones, or a string for most discrete distributions. If n=1 return an object, if n>1 return an array of the samples.

summarize
()¶ Summarize data into stored sufficient statistics for outofcore training.
Parameters: X : arraylike, shape (n_samples, variable)
Array of the samples, which can be either fixed size or variable depending on the underlying components.
y : arraylike, shape (n_samples,)
Array of the known labels as integers
weights : arraylike, shape (n_samples,) optional
Array of the weight of each sample, a positive float
n_jobs : int
The number of jobs to use to parallelize, either the number of threads or the number of processes to use. Default is 1.
Returns: None

thaw
()¶ Thaw the distribution, reallowing updates to occur.


class
pomegranate.BayesClassifier.
BayesClassifier
¶ A Naive Bayes model, a supervised alternative to GMM.
Parameters: models : list or constructor
Must either be a list of initialized distribution/model objects, or the constructor for a distribution object:
 Initialized : NaiveBayes([NormalDistribution(1, 2), NormalDistribution(0, 1)])
 Constructor : NaiveBayes(NormalDistribution)
weights : list or numpy.ndarray or None, default None
The prior probabilities of the components. If None is passed in then defaults to the uniformly distributed priors.
Examples
>>> from pomegranate import * >>> clf = NaiveBayes( NormalDistribution ) >>> X = [0, 2, 0, 1, 0, 5, 6, 5, 7, 6] >>> y = [0, 0, 0, 0, 0, 1, 1, 0, 1, 1] >>> clf.fit(X, y) >>> clf.predict_proba([6]) array([[ 0.01973451, 0.98026549]])
>>> from pomegranate import * >>> clf = NaiveBayes([NormalDistribution(1, 2), NormalDistribution(0, 1)]) >>> clf.predict_log_proba([[0], [1], [2], [1]]) array([[1.1836569 , 0.36550972], [0.79437677, 0.60122959], [0.26751248, 1.4493653 ], [1.09861229, 0.40546511]])
Attributes
models (list) The model objects, either initialized by the user or fit to data. weights (numpy.ndarray) The prior probability of each component of the model. 
clear_summaries
()¶ Remove the stored sufficient statistics.
Parameters: None Returns: None

copy
()¶ Return a deep copy of this distribution object.
This object will not be tied to any other distribution or connected in any form.
Parameters: None
Returns: distribution : Distribution
A copy of the distribution with the same parameters.

fit
()¶ Fit the Naive Bayes model to the data by passing data to their components.
Parameters: X : numpy.ndarray or list
The dataset to operate on. For most models this is a numpy array with columns corresponding to features and rows corresponding to samples. For markov chains and HMMs this will be a list of variable length sequences.
y : numpy.ndarray or list or None, optional
Data labels for supervised training algorithms. Default is None
weights : arraylike or None, shape (n_samples,), optional
The initial weights of each sample in the matrix. If nothing is passed in then each sample is assumed to be the same weight. Default is None.
n_jobs : int
The number of jobs to use to parallelize, either the number of threads or the number of processes to use. Default is 1.
inertia : double, optional
Inertia used for the training the distributions.
pseudocount : double, optional
A pseudocount to add to the emission of each distribution. This effectively smoothes the states to prevent 0. probability symbols if they don’t happen to occur in the data. Default is 0.
Returns: self : object
Returns the fitted model

freeze
()¶ Freeze the distribution, preventing updates from occuring.

from_samples
()¶ Create a mixture model directly from the given dataset.
First, kmeans will be run using the given initializations, in order to define initial clusters for the points. These clusters are used to initialize the distributions used. Then, EM is run to refine the parameters of these distributions.
A homogenous mixture can be defined by passing in a single distribution callable as the first parameter and specifying the number of components, while a heterogeneous mixture can be defined by passing in a list of callables of the appropriate type.
Parameters: distributions : arraylike, shape (n_components,) or callable
The components of the model. If array, corresponds to the initial distributions of the components. If callable, must also pass in the number of components and kmeans++ will be used to initialize them.
 n_components : int
If a callable is passed into distributions then this is the number of components to initialize using the kmeans++ algorithm.
 X : arraylike, shape (n_samples, n_dimensions)
This is the data to train on. Each row is a sample, and each column is a dimension to train on.
 weights : arraylike, shape (n_samples,), optional
The initial weights of each sample in the matrix. If nothing is passed in then each sample is assumed to be the same weight. Default is None.
pseudocount : double, optional, positive
A pseudocount to add to the emission of each distribution. This
effectively smoothes the states to prevent 0. probability symbols
if they don’t happen to occur in the data. Only effects mixture
models defined over discrete distributions. Default is 0.
Returns: model : NaiveBayes
The fit naive Bayes model.

from_summaries
()¶ Fit the model to the collected sufficient statistics.
Fit the parameters of the model to the sufficient statistics gathered during the summarize calls. This should return an exact update.
Parameters: inertia : double, optional
The weight of the previous parameters of the model. The new parameters will roughly be old_param*inertia + new_param*(1inertia), so an inertia of 0 means ignore the old parameters, whereas an inertia of 1 means ignore the new parameters. Default is 0.0.
pseudocount : double, optional
A pseudocount to add to the emission of each distribution. This effectively smoothes the states to prevent 0. probability symbols if they don’t happen to occur in the data. If discrete data, will smooth both the prior probabilities of each component and the emissions of each component. Otherwise, will only smooth the prior probabilities of each component. Default is 0.
Returns: None

log_probability
()¶ Calculate the log probability of a point under the distribution.
The probability of a point is the sum of the probabilities of each distribution multiplied by the weights. Thus, the log probability is the sum of the log probability plus the log prior.
This is the python interface.
Parameters: X : numpy.ndarray, shape=(n, d) or (n, m, d)
The samples to calculate the log probability of. Each row is a sample and each column is a dimension. If emissions are HMMs then shape is (n, m, d) where m is variable length for each obervation, and X becomes an array of n (m, d)shaped arrays.
Returns: log_probability : double
The log probabiltiy of the point under the distribution.

predict
()¶ Predict the most likely component which generated each sample.
Calculate the posterior P(MD) for each sample and return the index of the component most likely to fit it. This corresponds to a simple argmax over the responsibility matrix.
This is a sklearn wrapper for the maximum_a_posteriori method.
Parameters: X : arraylike, shape (n_samples, n_dimensions)
The samples to do the prediction on. Each sample is a row and each column corresponds to a dimension in that sample. For univariate distributions, a single array may be passed in.
Returns: y : arraylike, shape (n_samples,)
The predicted component which fits the sample the best.

predict_log_proba
()¶ Calculate the posterior log P(MD) for data.
Calculate the log probability of each item having been generated from each component in the model. This returns normalized log probabilities such that the probabilities should sum to 1
This is a sklearn wrapper for the original posterior function.
Parameters: X : arraylike, shape (n_samples, n_dimensions)
The samples to do the prediction on. Each sample is a row and each column corresponds to a dimension in that sample. For univariate distributions, a single array may be passed in.
Returns: y : arraylike, shape (n_samples, n_components)
The normalized log probability log P(MD) for each sample. This is the probability that the sample was generated from each component.

predict_proba
()¶ Calculate the posterior P(MD) for data.
Calculate the probability of each item having been generated from each component in the model. This returns normalized probabilities such that each row should sum to 1.
Since calculating the log probability is much faster, this is just a wrapper which exponentiates the log probability matrix.
Parameters: X : arraylike, shape (n_samples, n_dimensions)
The samples to do the prediction on. Each sample is a row and each column corresponds to a dimension in that sample. For univariate distributions, a single array may be passed in.
Returns: probability : arraylike, shape (n_samples, n_components)
The normalized probability P(MD) for each sample. This is the probability that the sample was generated from each component.

probability
()¶ Return the probability of the given symbol under this distribution.
Parameters: symbol : object
The symbol to calculate the probability of
Returns: probability : double
The probability of that point under the distribution.

sample
()¶ Generate a sample from the model.
First, randomly select a component weighted by the prior probability, Then, use the sample method from that component to generate a sample.
Parameters: n : int, optional
The number of samples to generate. Defaults to 1.
Returns: sample : arraylike or object
A randomly generated sample from the model of the type modelled by the emissions. An integer if using most distributions, or an array if using multivariate ones, or a string for most discrete distributions. If n=1 return an object, if n>1 return an array of the samples.

summarize
()¶ Summarize data into stored sufficient statistics for outofcore training.
Parameters: X : arraylike, shape (n_samples, variable)
Array of the samples, which can be either fixed size or variable depending on the underlying components.
y : arraylike, shape (n_samples,)
Array of the known labels as integers
weights : arraylike, shape (n_samples,) optional
Array of the weight of each sample, a positive float
n_jobs : int
The number of jobs to use to parallelize, either the number of threads or the number of processes to use. Default is 1.
Returns: None

thaw
()¶ Thaw the distribution, reallowing updates to occur.
Markov Chains¶
Markov chains are form of structured model over sequences. They represent the probability of each character in the sequence as a conditional probability of the last k symbols. For example, a 3rd order Markov chain would have each symbol depend on the last three symbols. A 0th order Markov chain is a naive predictor where each symbol is independent of all other symbols. Currently pomegranate only supports discrete emission Markov chains where each symbol is a discrete symbol versus a continuous number (like ‘A’ ‘B’ ‘C’ instead of 17.32 or 19.65).
Initialization¶
Markov chains can almost be represented by a single conditional probability table (CPT), except that the probability of the first k elements (for a kth order Markov chain) cannot be appropriately represented except by using special characters. Due to this pomegranate takes in a series of k+1 distributions representing the first k elements. For example for a second order Markov chain:
from pomegranate import *
d1 = DiscreteDistribution({'A': 0.25, 'B': 0.75})
d2 = ConditionalProbabilityTable([['A', 'A', 0.1],
['A', 'B', 0.9],
['B', 'A', 0.6],
['B', 'B', 0.4]], [d1])
d3 = ConditionalProbabilityTable([['A', 'A', 'A', 0.4],
['A', 'A', 'B', 0.6],
['A', 'B', 'A', 0.8],
['A', 'B', 'B', 0.2],
['B', 'A', 'A', 0.9],
['B', 'A', 'B', 0.1],
['B', 'B', 'A', 0.2],
['B', 'B', 'B', 0.8]], [d1, d2])
model = MarkovChain([d1, d2, d3])
Probability¶
The probability of a sequence under the Markov chain is just the probabiliy of the first character under the first distribution times the probability of the second character under the second distribution and so forth until you go past the (k+1)th character, which remains evaluated under the (k+1)th distribution. We can calculate the probability or log probability in the same manner as any of the other models. Given the model shown before:
>>> model.log_probability(['A', 'B', 'B', 'B'])
3.324236340526027
>>> model.log_probability(['A', 'A', 'A', 'A'])
5.521460917862246
Fitting¶
Markov chains are not very complicated to chain. For each sequence the appropriate symbols are sent to the appropriate distributions and maximum likelihood estimates are used to update the parameters of the distributions. There are no latent factors to train and so no expectation maximization or iterative algorithms are needed to train anything.
API Reference¶

class
pomegranate.MarkovChain.
MarkovChain
¶ A Markov Chain.
Implemented as a series of conditional distributions, the Markov chain models P(X_i  X_i1...X_ik) for a kth order Markov network. The conditional dependencies are directly on the emissions, and not on a hidden state as in a hidden Markov model.
Parameters: distributions : list, shape (k+1)
A list of the conditional distributions which make up the markov chain. Begins with P(X_i), then P(X_i  X_i1). For a kth order markov chain you must put in k+1 distributions.
Examples
>>> from pomegranate import * >>> d1 = DiscreteDistribution({'A': 0.25, 'B': 0.75}) >>> d2 = ConditionalProbabilityTable([['A', 'A', 0.33], ['B', 'A', 0.67], ['A', 'B', 0.82], ['B', 'B', 0.18]], [d1]) >>> mc = MarkovChain([d1, d2]) >>> mc.log_probability(list('ABBAABABABAABABA')) 8.9119890701808213
Attributes
distributions (list, shape (k+1)) The distributions which make up the chain. 
fit
()¶ Fit the model to new data using MLE.
The underlying distributions are fed in their appropriate points and weights and are updated.
Parameters: sequences : arraylike, shape (n_samples, variable)
This is the data to train on. Each row is a sample which contains a sequence of variable length
weights : arraylike, shape (n_samples,), optional
The initial weights of each sample. If nothing is passed in then each sample is assumed to be the same weight. Default is None.
inertia : double, optional
The weight of the previous parameters of the model. The new parameters will roughly be old_param*inertia + new_param*(1inertia), so an inertia of 0 means ignore the old parameters, whereas an inertia of 1 means ignore the new parameters. Default is 0.0.
Returns: None

from_json
()¶ Read in a serialized model and return the appropriate classifier.
Parameters: s : str
A JSON formatted string containing the file.
Returns: model : object
A properly initialized and baked model.

from_samples
()¶ Learn the Markov chain from data.
Takes in the memory of the chain (k) and learns the initial distribution and probability tables associated with the proper parameters.
Parameters: X : arraylike, list or numpy.array
The data to fit the structure too as a list of sequences of variable length. Since the data will be of variable length, there is no set form
weights : arraylike, shape (n_nodes), optional
The weight of each sample as a positive double. Default is None.
k : int, optional
The number of samples back to condition on in the model. Default is 1.
Returns: model : MarkovChain
The learned markov chain model.

from_summaries
()¶ Fit the model to the collected sufficient statistics.
Fit the parameters of the model to the sufficient statistics gathered during the summarize calls. This should return an exact update.
Parameters: inertia : double, optional
The weight of the previous parameters of the model. The new parameters will roughly be old_param*inertia + new_param * (1inertia), so an inertia of 0 means ignore the old parameters, whereas an inertia of 1 means ignore the new parameters. Default is 0.0.
Returns: None

log_probability
()¶ Calculate the log probability of the sequence under the model.
This calculates the first slices of increasing size under the corresponding first few components of the model until size k is reached, at which all slices are evaluated under the final component.
Parameters: sequence : arraylike
An array of observations
Returns: logp : double
The log probability of the sequence under the model.

sample
()¶ Create a random sample from the model.
Parameters: length : int or Distribution
Give either the length of the sample you want to generate, or a distribution object which will be randomly sampled for the length. Continuous distributions will have their sample rounded to the nearest integer, minimum 1.
Returns: sequence : arraylike, shape = (length,)
A sequence randomly generated from the markov chain.

summarize
()¶ Summarize a batch of data and store sufficient statistics.
This will summarize the sequences into sufficient statistics stored in each distribution.
Parameters: sequences : arraylike, shape (n_samples, variable)
This is the data to train on. Each row is a sample which contains a sequence of variable length
weights : arraylike, shape (n_samples,), optional
The initial weights of each sample. If nothing is passed in then each sample is assumed to be the same weight. Default is None.
Returns: None

to_json
()¶ Serialize the model to a JSON.
Parameters: separators : tuple, optional
The two separaters to pass to the json.dumps function for formatting. Default is (‘,’, ‘ : ‘).
indent : int, optional
The indentation to use at each level. Passed to json.dumps for formatting. Default is 4.
Returns: json : str
A properly formatted JSON object.

Bayesian Networks¶
Bayesian networks are a probabilistic model that are especially good at inference given incomplete data. Much like a hidden Markov model, they consist of a directed graphical model (though Bayesian networks must also be acyclic) and a set of probability distributions. The edges encode dependency statements between the variables, where the lack of an edge between any pair of variables indicates a conditional independence. Each node encodes a probability distribution, where root nodes encode univariate probability distributions and inner/leaf nodes encode conditional probability distributions. Bayesian networks are exceptionally flexible when doing inference, as any subset of variables can be observed, and inference done over all other variables, without needing to define these groups in advance. In fact, the set of observed variables can change from one sample to the next without needing to modify the underlying algorithm at all.
Currently, pomegranate only supports discrete Bayesian networks, meaning that the values must be categories, i.e. ‘apples’ and ‘oranges’, or 1 and 2, where 1 and 2 refer to categories, not numbers, and so 2 is not explicitly ‘bigger’ than 1.
Initialization¶
Bayesian networks can be initialized in two ways, depending on whether the underlying graphical structure is known or not: (1) the graphical structure can be built one node at a time with preinitialized distributions set for each node, or (2) both the graphical structure and distributions can be learned directly from data. This mirrors the other models that are implemented in pomegranate. However, typically expectation maximization is used to fit the parameters of the distribution, and so initialization (such as through kmeans) is typically fast whereas fitting is slow. For Bayesian networks, the opposite is the case. Fitting can be done quickly by just summing counts through the data, while initialization is hard as it requires an exponential time search through all possible DAGs to identify the optimal graph. More is discussed in the tutorials above and in the fitting section below.
Let’s take a look at initializing a Bayesian network in the first manner by quickly implementing the Monty Hall problem. The Monty Hall problem arose from the gameshow Let’s Make a Deal, where a guest had to choose which one of three doors had a prize behind it. The twist was that after the guest chose, the host, originally Monty Hall, would then open one of the doors the guest did not pick and ask if the guest wanted to switch which door they had picked. Initial inspection may lead you to believe that if there are only two doors left, there is a 5050 chance of you picking the right one, and so there is no advantage one way or the other. However, it has been proven both through simulations and analytically that there is in fact a 66% chance of getting the prize if the guest switches their door, regardless of the door they initially went with.
Our network will have three nodes, one for the guest, one for the prize, and one for the door Monty chooses to open. The door the guest initially chooses and the door the prize is behind are uniform random processes across the three doors, but the door which Monty opens is dependent on both the door the guest chooses (it cannot be the door the guest chooses), and the door the prize is behind (it cannot be the door with the prize behind it).
from pomegranate import *
guest = DiscreteDistribution({'A': 1./3, 'B': 1./3, 'C': 1./3})
prize = DiscreteDistribution({'A': 1./3, 'B': 1./3, 'C': 1./3})
monty = ConditionalProbabilityTable(
[['A', 'A', 'A', 0.0],
['A', 'A', 'B', 0.5],
['A', 'A', 'C', 0.5],
['A', 'B', 'A', 0.0],
['A', 'B', 'B', 0.0],
['A', 'B', 'C', 1.0],
['A', 'C', 'A', 0.0],
['A', 'C', 'B', 1.0],
['A', 'C', 'C', 0.0],
['B', 'A', 'A', 0.0],
['B', 'A', 'B', 0.0],
['B', 'A', 'C', 1.0],
['B', 'B', 'A', 0.5],
['B', 'B', 'B', 0.0],
['B', 'B', 'C', 0.5],
['B', 'C', 'A', 1.0],
['B', 'C', 'B', 0.0],
['B', 'C', 'C', 0.0],
['C', 'A', 'A', 0.0],
['C', 'A', 'B', 1.0],
['C', 'A', 'C', 0.0],
['C', 'B', 'A', 1.0],
['C', 'B', 'B', 0.0],
['C', 'B', 'C', 0.0],
['C', 'C', 'A', 0.5],
['C', 'C', 'B', 0.5],
['C', 'C', 'C', 0.0]], [guest, prize])
s1 = Node(guest, name="guest")
s2 = Node(prize, name="prize")
s3 = Node(monty, name="monty")
model = BayesianNetwork("Monty Hall Problem")
model.add_states(s1, s2, s3)
model.add_edge(s1, s3)
model.add_edge(s2, s3)
model.bake()
Note
The objects ‘state’ and ‘node’ are really the same thing and can be used interchangable. The only difference is the name, as hidden Markov models use ‘state’ in the literature frequently whereas Bayesian networks use ‘node’ frequently.
The conditional distribution must be explicitly spelled out in this example, followed by a list of the parents in the same order as the columns take in the tabble that is provided (e.g. the columns in the table correspond to guest, prize, monty, probability.)
However, one can also initialize a Bayesian network based completely on data. As mentioned before, the exact version of this algorithm takes exponential time with the number of variables and typically can’t be done on more than ~25 variables. This is because there are a superexponential number of directed acyclic graphs that one could define over a set of variables, but fortunately one can use dynamic programming in order to reduce this complexity down to “simply exponential.” The implementation of the exact algorithm actually goes further than the original dynamic programing algorithm by implementing an A* search to somewhat reduce computational time but drastically reduce required memory, sometimes by an order of magnitude.
from pomegranate import *
import numpy
X = numpy.load('data.npy')
model = BayesianNetwork.from_samples(X, algorithm='exact')
The exact algorithm is not the default, though. The default is a novel greedy algorithm that greedily chooses a topological ordering of the variables, but optimally identifies the best parents for each variable given this ordering. It is significantly faster and more memory efficient than the exact algorithm and produces far better estimates than using a ChowLiu tree. This is set to the default to avoid locking up the computers of users that unintentionally tell their computers to do a nearimpossible task.
Probability¶
You can calculate the probabiity of a sample under a Bayesian network as the product of the probability of each variable given its parents, if it has any. This can be expressed as \(P = \prod\limits_{i=1}^{d} P(D_{i}Pa_{i})\) for a sample with $d$ dimensions. For example, in the Monty Hal problem, the probability of a show is the probability of the guest choosing the respective door, times the probability of the prize being behind a given door, times the probability of Monty opening a given door given the previous two values. For example, using the manually initialized network above:
>>> print model.probability([['A', 'A', 'A'],
['A', 'A', 'B'],
['C', 'C', 'B']])
[ 0. 0.05555556 0.05555556]
Prediction¶
Bayesian networks are frequently used to infer/impute the value of missing variables given the observed values. In other models, typically there is either a single or fixed set of missing variables, such as latent factors, that need to be imputed, and so returning a fixed vector or matrix as the predictions makes sense. However, in the case of Bayesian networks, we can make no such assumptions, and so when data is passed in for prediction it should be in the format as a matrix with None
in the missing variables that need to be inferred. The return is thus a filled in matrix where the Nones have been replaced with the imputed values. For example:
>>> print model.predict([['A', 'B', None],
['A', 'C', None],
['C', 'B', None]])
[['A' 'B' 'C']
['A' 'C' 'B']
['C' 'B' 'A']]
In this example, the final column is the one that is always missing, but a more complex example is as follows:
>>> print model.predict([['A', 'B', None],
['A', None, 'C'],
[None, 'B', 'A']])
[['A' 'B' 'C']
['A' 'B' 'C']
['C' 'B' 'A']]
Fitting¶
Fitting a Bayesian network to data is a fairly simple process. Essentially, for each variable, you need consider only that column of data and the columns corresponding to that variables parents. If it is a univariate distribution, then the maximum likelihood estimate is just the count of each symbol divided by the number of samples in the data. If it is a multivariate distribution, it ends up being the probability of each symbol in the variable of interest given the combination of symbols in the parents. For example, consider a binary dataset with two variables, X and Y, where X is a parent of Y. First, we would go through the dataset and calculate P(X=0) and P(X=1). Then, we would calculate P(Y=0X=0), P(Y=1X=0), P(Y=0X=1), and P(Y=1X=1). Those values encode all of the parameters of the Bayesian network.
API Reference¶

class
pomegranate.BayesianNetwork.
BayesianNetwork
¶ A Bayesian Network Model.
A Bayesian network is a directed graph where nodes represent variables, edges represent conditional dependencies of the children on their parents, and the lack of an edge represents a conditional independence.
Parameters: name : str, optional
The name of the model. Default is None
Attributes
states (list, shape (n_states,)) A list of all the state objects in the model graph (networkx.DiGraph) The underlying graph object. 
add_edge
()¶ Add a transition from state a to state b which indicates that B is dependent on A in ways specified by the distribution.

add_node
()¶ Add a node to the graph.

add_nodes
()¶ Add multiple states to the graph.

add_state
()¶ Another name for a node.

add_states
()¶ Another name for a node.

add_transition
()¶ Transitions and edges are the same.

bake
()¶ Finalize the topology of the model.
Assign a numerical index to every state and create the underlying arrays corresponding to the states and edges between the states. This method must be called before any of the probabilitycalculating methods. This includes converting conditional probability tables into joint probability tables and creating a list of both marginal and table nodes.
Parameters: None Returns: None

clear_summaries
()¶ Clear the summary statistics stored in the object.

copy
()¶ Return a deep copy of this distribution object.
This object will not be tied to any other distribution or connected in any form.
Parameters: None
Returns: distribution : Distribution
A copy of the distribution with the same parameters.

dense_transition_matrix
()¶ Returns the dense transition matrix. Useful if the transitions of somewhat small models need to be analyzed.

edge_count
()¶ Returns the number of edges present in the model.

fit
()¶ Fit the model to data using MLE estimates.
Fit the model to the data by updating each of the components of the model, which are univariate or multivariate distributions. This uses a simple MLE estimate to update the distributions according to their summarize or fit methods.
This is a wrapper for the summarize and from_summaries methods.
Parameters: items : arraylike, shape (n_samples, n_nodes)
The data to train on, where each row is a sample and each column corresponds to the associated variable.
weights : arraylike, shape (n_nodes), optional
The weight of each sample as a positive double. Default is None.
inertia : double, optional
The inertia for updating the distributions, passed along to the distribution method. Default is 0.0.
pseudocount : double, optional
A pseudocount to add to the emission of each distribution. This effectively smoothes the states to prevent 0. probability symbols if they don’t happen to occur in the data. Only effects hidden Markov models defined over discrete distributions. Default is 0.
Returns: None

freeze
()¶ Freeze the distribution, preventing updates from occuring.

from_json
()¶ Read in a serialized Bayesian Network and return the appropriate object.
Parameters: s : str
A JSON formatted string containing the file.
Returns: model : object
A properly initialized and baked model.

from_samples
()¶ Learn the structure of the network from data.
Find the structure of the network from data using a Bayesian structure learning score. This currently enumerates all the exponential number of structures and finds the best according to the score. This allows weights on the different samples as well. The score that is optimized is the minimum description length (MDL).
Parameters: X : arraylike, shape (n_samples, n_nodes)
The data to fit the structure too, where each row is a sample and each column corresponds to the associated variable.
weights : arraylike, shape (n_nodes), optional
The weight of each sample as a positive double. Default is None.
algorithm : str, one of ‘chowliu’, ‘greedy’, ‘exact’, ‘exactdp’ optional
The algorithm to use for learning the Bayesian network. Default is ‘greedy’ that greedily attempts to find the best structure, and frequently can identify the optimal structure. ‘exact’ uses DP/A* to find the optimal Bayesian network, and ‘exactdp’ tries to find the shortest path on the entire order lattice, which is more memory and computationally expensive. ‘exact’ and ‘exactdp’ should give identical results, with ‘exactdp’ remaining an option mostly for debugging reasons. ‘chowliu’ will return the optimal treelike structure for the Bayesian network, which is a very fast approximation but not always the best network.
max_parents : int, optional
The maximum number of parents a node can have. If used, this means using the klearn procedure. Can drastically speed up algorithms. If 1, no max on parents. Default is 1.
root : int, optional
For algorithms which require a single root (‘chowliu’), this is the root for which all edges point away from. User may specify which column to use as the root. Default is the first column.
constraint_graph : networkx.DiGraph or None, optional
A directed graph showing valid parent sets for each variable. Each node is a set of variables, and edges represent which variables can be valid parents of those variables. The naive structure learning task is just all variables in a single node with a self edge, meaning that you know nothing about
pseudocount : double, optional
A pseudocount to add to the emission of each distribution. This effectively smoothes the states to prevent 0. probability symbols if they don’t happen to occur in the data. Default is 0.
state_names : arraylike, shape (n_nodes), optional
A list of meaningful names to be applied to nodes
name : str, optional
The name of the model. Default is None.
n_jobs : int, optional
The number of threads to use when learning the structure of the network. If a constraint graph is provided, this will parallelize the tasks as directed by the constraint graph. If one is not provided it will parallelize the building of the parent graphs. Both cases will provide large speed gains.
Returns: model : BayesianNetwork
The learned BayesianNetwork.

from_structure
()¶ Return a Bayesian network from a predefined structure.
Pass in the structure of the network as a tuple of tuples and get a fit network in return. The tuple should contain n tuples, with one for each node in the graph. Each inner tuple should be of the parents for that node. For example, a three node graph where both node 0 and 1 have node 2 as a parent would be specified as ((2,), (2,), ()).
Parameters: X : arraylike, shape (n_samples, n_nodes)
The data to fit the structure too, where each row is a sample and each column corresponds to the associated variable.
structure : tuple of tuples
The parents for each node in the graph. If a node has no parents, then do not specify any parents.
weights : arraylike, shape (n_nodes), optional
The weight of each sample as a positive double. Default is None.
pseudocount : double, optional
A pseudocount to add to the emission of each distribution. This effectively smoothes the states to prevent 0. probability symbols if they don’t happen to occur in the data. Default is 0.
name : str, optional
The name of the model. Default is None.
state_names : arraylike, shape (n_nodes), optional
A list of meaningful names to be applied to nodes
Returns: model : BayesianNetwoork
A Bayesian network with the specified structure.

from_summaries
()¶ Use MLE on the stored sufficient statistics to train the model.
This uses MLE estimates on the stored sufficient statistics to train the model.
Parameters: inertia : double, optional
The inertia for updating the distributions, passed along to the distribution method. Default is 0.0.
pseudocount : double, optional
A pseudocount to add to the emission of each distribution. This effectively smoothes the states to prevent 0. probability symbols if they don’t happen to occur in the data. Default is 0.
Returns: None

log_probability
()¶ Return the log probability of samples under the Bayesian network.
The log probability is just the sum of the log probabilities under each of the components. The log probability of a sample under the graph A > B is just P(A)*P(BA). This will return a vector of log probabilities, one for each sample.
Parameters: X : arraylike, shape (n_samples, n_dim)
The sample is a vector of points where each dimension represents the same variable as added to the graph originally. It doesn’t matter what the connections between these variables are, just that they are all ordered the same.
Returns: logp : double
The log probability of that sample.

marginal
()¶ Return the marginal probabilities of each variable in the graph.
This is equivalent to a pass of belief propogation on a graph where no data has been given. This will calculate the probability of each variable being in each possible emission when nothing is known.
Parameters: None
Returns: marginals : arraylike, shape (n_nodes)
An array of univariate distribution objects showing the marginal probabilities of that variable.

node_count
()¶ Returns the number of nodes/states in the model

plot
()¶ Draw this model’s graph using NetworkX and matplotlib.
Note that this relies on networkx’s builtin graphing capabilities (and not Graphviz) and thus can’t draw selfloops.
See networkx.draw_networkx() for the keywords you can pass in.
Parameters: **kwargs : any
The arguments to pass into networkx.draw_networkx()
Returns: None

predict
()¶ Predict missing values of a data matrix using MLE.
Impute the missing values of a data matrix using the maximally likely predictions according to the forwardbackward algorithm. Run each sample through the algorithm (predict_proba) and replace missing values with the maximally likely predicted emission.
Parameters: items : arraylike, shape (n_samples, n_nodes)
Data matrix to impute. Missing values must be either None (if lists) or np.nan (if numpy.ndarray). Will fill in these values with the maximally likely ones.
max_iterations : int, optional
Number of iterations to run loopy belief propogation for. Default is 100.
Returns: items : numpy.ndarray, shape (n_samples, n_nodes)
This is the data matrix with the missing values imputed.

predict_proba
()¶ Returns the probabilities of each variable in the graph given evidence.
This calculates the marginal probability distributions for each state given the evidence provided through loopy belief propogation. Loopy belief propogation is an approximate algorithm which is exact for certain graph structures.
Parameters: data : dict or arraylike, shape <= n_nodes, optional
The evidence supplied to the graph. This can either be a dictionary with keys being state names and values being the observed values (either the emissions or a distribution over the emissions) or an array with the values being ordered according to the nodes incorporation in the graph (the order fed into .add_states/add_nodes) and None for variables which are unknown. If nothing is fed in then calculate the marginal of the graph. Default is {}.
max_iterations : int, optional
The number of iterations with which to do loopy belief propogation. Usually requires only 1. Default is 100.
check_input : bool, optional
Check to make sure that the observed symbol is a valid symbol for that distribution to produce. Default is True.
Returns: probabilities : arraylike, shape (n_nodes)
An array of univariate distribution objects showing the probabilities of each variable.

probability
()¶ Return the probability of the given symbol under this distribution.
Parameters: symbol : object
The symbol to calculate the probability of
Returns: probability : double
The probability of that point under the distribution.

sample
()¶ Return a random item sampled from this distribution.
Parameters: n : int or None, optional
The number of samples to return. Default is None, which is to generate a single sample.
Returns: sample : double or object
Returns a sample from the distribution of a type in the support of the distribution.

state_count
()¶ Returns the number of states present in the model.

summarize
()¶ Summarize a batch of data and store the sufficient statistics.
This will partition the dataset into columns which belong to their appropriate distribution. If the distribution has parents, then multiple columns are sent to the distribution. This relies mostly on the summarize function of the underlying distribution.
Parameters: items : arraylike, shape (n_samples, n_nodes)
The data to train on, where each row is a sample and each column corresponds to the associated variable.
weights : arraylike, shape (n_nodes), optional
The weight of each sample as a positive double. Default is None.
Returns: None

thaw
()¶ Thaw the distribution, reallowing updates to occur.

to_json
()¶ Serialize the model to a JSON.
Parameters: separators : tuple, optional
The two separaters to pass to the json.dumps function for formatting.
indent : int, optional
The indentation to use at each level. Passed to json.dumps for formatting.
Returns: json : str
A properly formatted JSON object.


class
pomegranate.BayesianNetwork.
ParentGraph
¶ Generate a parent graph for a single variable over its parents.
This will generate the parent graph for a single parents given the data. A parent graph is the dynamically generated best parent set and respective score for each combination of parent variables. For example, if we are generating a parent graph for x1 over x2, x3, and x4, we may calculate that having x2 as a parent is better than x2,x3 and so store the value of x2 in the node for x2,x3.
Parameters: X : numpy.ndarray, shape=(n, d)
The data to fit the structure too, where each row is a sample and each column corresponds to the associated variable.
weights : numpy.ndarray, shape=(n,)
The weight of each sample as a positive double. Default is None.
key_count : numpy.ndarray, shape=(d,)
The number of unique keys in each column.
pseudocount : double
A pseudocount to add to each possibility.
max_parents : int
The maximum number of parents a node can have. If used, this means using the klearn procedure. Can drastically speed up algorithms. If 1, no max on parents. Default is 1.
parent_set : tuple, default ()
The variables which are possible parents for this variable. If nothing is passed in then it defaults to all other variables, as one would expect in the naive case. This allows for cases where we want to build a parent graph over only a subset of the variables.
Returns: structure : tuple, shape=(d,)
The parents for each variable in this SCC

pomegranate.BayesianNetwork.
discrete_exact_a_star
()¶ Find the optimal graph over a set of variables with no other knowledge.
This is the naive dynamic programming structure learning task where the optimal graph is identified from a set of variables using an order graph and parent graphs. This can be used either when no constraint graph is provided or for a SCC which is made up of a node containing a selfloop. It uses DP/A* in order to find the optimal graph without considering all possible topological sorts. A greedy version of the algorithm can be used that massively reduces both the computational and memory cost while frequently producing the optimal graph.
Parameters: X : numpy.ndarray, shape=(n, d)
The data to fit the structure too, where each row is a sample and each column corresponds to the associated variable.
weights : numpy.ndarray, shape=(n,)
The weight of each sample as a positive double. Default is None.
key_count : numpy.ndarray, shape=(d,)
The number of unique keys in each column.
pseudocount : double
A pseudocount to add to each possibility.
max_parents : int
The maximum number of parents a node can have. If used, this means using the klearn procedure. Can drastically speed up algorithms. If 1, no max on parents. Default is 1.
n_jobs : int
The number of threads to use when learning the structure of the network. This parallelizes the creation of the parent graphs.
Returns: structure : tuple, shape=(d,)
The parents for each variable in this SCC

pomegranate.BayesianNetwork.
discrete_exact_component
()¶ Find the optimal graph over a multinode component of the constaint graph.
The general algorithm in this case is to begin with each variable and add all possible single children for that entry recursively until completion. This will result in a far sparser order graph than before. In addition, one can eliminate entries from the parent graphs that contain invalid parents as they are a fast of computational time.
Parameters: X : numpy.ndarray, shape=(n, d)
The data to fit the structure too, where each row is a sample and each column corresponds to the associated variable.
weights : numpy.ndarray, shape=(n,)
The weight of each sample as a positive double. Default is None.
key_count : numpy.ndarray, shape=(d,)
The number of unique keys in each column.
pseudocount : double
A pseudocount to add to each possibility.
max_parents : int
The maximum number of parents a node can have. If used, this means using the klearn procedure. Can drastically speed up algorithms. If 1, no max on parents. Default is 1.
n_jobs : int
The number of threads to use when learning the structure of the network. This parallelizes the creation of the parent graphs.
Returns: structure : tuple, shape=(d,)
The parents for each variable in this SCC

pomegranate.BayesianNetwork.
discrete_exact_dp
()¶ Find the optimal graph over a set of variables with no other knowledge.
This is the naive dynamic programming structure learning task where the optimal graph is identified from a set of variables using an order graph and parent graphs. This can be used either when no constraint graph is provided or for a SCC which is made up of a node containing a selfloop. This is a reference implementation that uses the naive shortest path algorithm over the entire order graph. The ‘exact’ option uses the A* path in order to avoid considering the full order graph.
Parameters: X : numpy.ndarray, shape=(n, d)
The data to fit the structure too, where each row is a sample and each column corresponds to the associated variable.
weights : numpy.ndarray, shape=(n,)
The weight of each sample as a positive double. Default is None.
key_count : numpy.ndarray, shape=(d,)
The number of unique keys in each column.
pseudocount : double
A pseudocount to add to each possibility.
max_parents : int
The maximum number of parents a node can have. If used, this means using the klearn procedure. Can drastically speed up algorithms. If 1, no max on parents. Default is 1.
n_jobs : int
The number of threads to use when learning the structure of the network. This parallelizes the creation of the parent graphs.
Returns: structure : tuple, shape=(d,)
The parents for each variable in this SCC

pomegranate.BayesianNetwork.
discrete_exact_slap
()¶ Find the optimal graph in a node with a Self Loop And Parents (SLAP).
Instead of just performing exact BNSL over the set of all parents and removing the offending edges there are efficiencies that can be gained by considering the structure. In particular, parents not coming from the main node do not need to be considered in the order graph but simply added to each entry after creation of the order graph. This is because those variables occur earlier in the topological ordering but it doesn’t matter how they occur otherwise. Parent graphs must be defined over all variables however.
Parameters: X : numpy.ndarray, shape=(n, d)
The data to fit the structure too, where each row is a sample and each column corresponds to the associated variable.
weights : numpy.ndarray, shape=(n,)
The weight of each sample as a positive double. Default is None.
key_count : numpy.ndarray, shape=(d,)
The number of unique keys in each column.
pseudocount : double
A pseudocount to add to each possibility.
max_parents : int
The maximum number of parents a node can have. If used, this means using the klearn procedure. Can drastically speed up algorithms. If 1, no max on parents. Default is 1.
n_jobs : int
The number of threads to use when learning the structure of the network. This parallelizes the creation of the parent graphs.
Returns: structure : tuple, shape=(d,)
The parents for each variable in this SCC

pomegranate.BayesianNetwork.
discrete_exact_with_constraints
()¶ This returns the optimal Bayesian network given a set of constraints.
This function controls the process of learning the Bayesian network by taking in a constraint graph, identifying the strongly connected components (SCCs) and solving each one using the appropriate algorithm. This is mostly an internal function.
Parameters: X : numpy.ndarray, shape=(n, d)
The data to fit the structure too, where each row is a sample and each column corresponds to the associated variable.
weights : numpy.ndarray, shape=(n,)
The weight of each sample as a positive double. Default is None.
key_count : numpy.ndarray, shape=(d,)
The number of unique keys in each column.
pseudocount : double
A pseudocount to add to each possibility.
max_parents : int
The maximum number of parents a node can have. If used, this means using the klearn procedure. Can drastically speed up algorithms. If 1, no max on parents. Default is 1.
constraint_graph : networkx.DiGraph
A directed graph showing valid parent sets for each variable. Each node is a set of variables, and edges represent which variables can be valid parents of those variables. The naive structure learning task is just all variables in a single node with a self edge, meaning that you know nothing about
n_jobs : int
The number of threads to use when learning the structure of the network. This parallelized both the creation of the parent graphs for each variable and the solving of the SCCs.
Returns: structure : tuple, shape=(d,)
The parents for each variable in the network.

pomegranate.BayesianNetwork.
discrete_exact_with_constraints_task
()¶ This is a wrapper for the function to be parallelzied by joblib.
This function takes in a single task as an id and a set of parents and children and calls the appropriate function. This is mostly a wrapper for joblib to parallelize.
Parameters: X : numpy.ndarray, shape=(n, d)
The data to fit the structure too, where each row is a sample and each column corresponds to the associated variable.
weights : numpy.ndarray, shape=(n,)
The weight of each sample as a positive double. Default is None.
key_count : numpy.ndarray, shape=(d,)
The number of unique keys in each column.
pseudocount : double
A pseudocount to add to each possibility.
max_parents : int
The maximum number of parents a node can have. If used, this means using the klearn procedure. Can drastically speed up algorithms. If 1, no max on parents. Default is 1.
task : tuple
A 3tuple containing the id, the set of parents and the set of children to learn a component of the Bayesian network over. The cases represent a SCC of the following:
0  Self loop and no parents 1  Self loop and parents 2  Parents and no self loop 3  Multiple nodes
n_jobs : int
The number of threads to use when learning the structure of the network. This parallelizes the creation of the parent graphs for each task or the finding of best parents in case 2.
Returns: structure : tuple, shape=(d,)
The parents for each variable in this SCC

pomegranate.BayesianNetwork.
discrete_greedy
()¶ Find the optimal graph over a set of variables with no other knowledge.
This is the naive dynamic programming structure learning task where the optimal graph is identified from a set of variables using an order graph and parent graphs. This can be used either when no constraint graph is provided or for a SCC which is made up of a node containing a selfloop. It uses DP/A* in order to find the optimal graph without considering all possible topological sorts. A greedy version of the algorithm can be used that massively reduces both the computational and memory cost while frequently producing the optimal graph.
Parameters: X : numpy.ndarray, shape=(n, d)
The data to fit the structure too, where each row is a sample and each column corresponds to the associated variable.
weights : numpy.ndarray, shape=(n,)
The weight of each sample as a positive double. Default is None.
key_count : numpy.ndarray, shape=(d,)
The number of unique keys in each column.
pseudocount : double
A pseudocount to add to each possibility.
max_parents : int
The maximum number of parents a node can have. If used, this means using the klearn procedure. Can drastically speed up algorithms. If 1, no max on parents. Default is 1.
greedy : bool, default is True
Whether the use a heuristic in order to massive reduce computation and memory time, but without the guarantee of finding the best network.
n_jobs : int
The number of threads to use when learning the structure of the network. This parallelizes the creation of the parent graphs.
Returns: structure : tuple, shape=(d,)
The parents for each variable in this SCC

pomegranate.BayesianNetwork.
generate_parent_graph
()¶ Generate a parent graph for a single variable over its parents.
This will generate the parent graph for a single parents given the data. A parent graph is the dynamically generated best parent set and respective score for each combination of parent variables. For example, if we are generating a parent graph for x1 over x2, x3, and x4, we may calculate that having x2 as a parent is better than x2,x3 and so store the value of x2 in the node for x2,x3.
Parameters: X : numpy.ndarray, shape=(n, d)
The data to fit the structure too, where each row is a sample and each column corresponds to the associated variable.
weights : numpy.ndarray, shape=(n,)
The weight of each sample as a positive double. Default is None.
key_count : numpy.ndarray, shape=(d,)
The number of unique keys in each column.
pseudocount : double
A pseudocount to add to each possibility.
max_parents : int
The maximum number of parents a node can have. If used, this means using the klearn procedure. Can drastically speed up algorithms. If 1, no max on parents. Default is 1.
parent_set : tuple, default ()
The variables which are possible parents for this variable. If nothing is passed in then it defaults to all other variables, as one would expect in the naive case. This allows for cases where we want to build a parent graph over only a subset of the variables.
Returns: structure : tuple, shape=(d,)
The parents for each variable in this SCC
Factor Graphs¶
API Reference¶

class
pomegranate.FactorGraph.
FactorGraph
¶ A Factor Graph model.
A biparte graph where conditional probability tables are on one side, and marginals for each of the variables involved are on the other side.
Parameters: name : str, optional
The name of the model. Default is None.

bake
()¶ Finalize the topology of the model.
Assign a numerical index to every state and create the underlying arrays corresponding to the states and edges between the states. This method must be called before any of the probabilitycalculating methods. This is the same as the HMM bake, except that at the end it sets current state information.
Parameters: None Returns: None

marginal
()¶ Return the marginal probabilities of each variable in the graph.
This is equivalent to a pass of belief propogation on a graph where no data has been given. This will calculate the probability of each variable being in each possible emission when nothing is known.
Parameters: None
Returns: marginals : arraylike, shape (n_nodes)
An array of univariate distribution objects showing the marginal probabilities of that variable.

plot
()¶ Draw this model’s graph using NetworkX and matplotlib.
Note that this relies on networkx’s builtin graphing capabilities (and not Graphviz) and thus can’t draw selfloops.
See networkx.draw_networkx() for the keywords you can pass in.
Parameters: **kwargs : any
The arguments to pass into networkx.draw_networkx()
Returns: None

predict_proba
()¶ Returns the probabilities of each variable in the graph given evidence.
This calculates the marginal probability distributions for each state given the evidence provided through loopy belief propogation. Loopy belief propogation is an approximate algorithm which is exact for certain graph structures.
Parameters: data : dict or arraylike, shape <= n_nodes, optional
The evidence supplied to the graph. This can either be a dictionary with keys being state names and values being the observed values (either the emissions or a distribution over the emissions) or an array with the values being ordered according to the nodes incorporation in the graph (the order fed into .add_states/add_nodes) and None for variables which are unknown. If nothing is fed in then calculate the marginal of the graph.
max_iterations : int, optional
The number of iterations with which to do loopy belief propogation. Usually requires only 1.
check_input : bool, optional
Check to make sure that the observed symbol is a valid symbol for that distribution to produce.
Returns: probabilities : arraylike, shape (n_nodes)
An array of univariate distribution objects showing the probabilities of each variable.
