lda: Topic modeling with latent Dirichlet Allocation¶
lda implements latent Dirichlet allocation (LDA) using collapsed Gibbs sampling. lda is fast and can be installed without a compiler on Linux and macOS.
The interface follows conventions found in scikit-learn. The following
demonstrates how to inspect a model of a subset of the Reuters news dataset.
(The input below, X
, is a document-term matrix.)
>>> import numpy as np
>>> import lda
>>> X = lda.datasets.load_reuters()
>>> vocab = lda.datasets.load_reuters_vocab()
>>> titles = lda.datasets.load_reuters_titles()
>>> X.shape
(395, 4258)
>>> X.sum()
84010
>>> model = lda.LDA(n_topics=20, n_iter=1500, random_state=1)
>>> model.fit(X) # model.fit_transform(X) is also available
>>> topic_word = model.topic_word_ # model.components_ also works
>>> n_top_words = 8
>>> for i, topic_dist in enumerate(topic_word):
... topic_words = np.array(vocab)[np.argsort(topic_dist)][:-n_top_words:-1]
... print('Topic {}: {}'.format(i, ' '.join(topic_words)))
Topic 0: british churchill sale million major letters west
Topic 1: church government political country state people party
Topic 2: elvis king fans presley life concert young
Topic 3: yeltsin russian russia president kremlin moscow michael
Topic 4: pope vatican paul john surgery hospital pontiff
Topic 5: family funeral police miami versace cunanan city
Topic 6: simpson former years court president wife south
Topic 7: order mother successor election nuns church nirmala
Topic 8: charles prince diana royal king queen parker
Topic 9: film french france against bardot paris poster
Topic 10: germany german war nazi letter christian book
Topic 11: east peace prize award timor quebec belo
Topic 12: n't life show told very love television
Topic 13: years year time last church world people
Topic 14: mother teresa heart calcutta charity nun hospital
Topic 15: city salonika capital buddhist cultural vietnam byzantine
Topic 16: music tour opera singer israel people film
Topic 17: church catholic bernardin cardinal bishop wright death
Topic 18: harriman clinton u.s ambassador paris president churchill
Topic 19: city museum art exhibition century million churches
NOTE: This package is in maintenance mode. Critical bugs will be fixed. No new features will be added.
Contents:
Getting started¶
The following demonstrates how to inspect a model of a subset of the Reuters
news dataset. The input below, X
, is a document-term matrix (sparse matrices
are accepted).
>>> import numpy as np
>>> import lda
>>> X = lda.datasets.load_reuters()
>>> vocab = lda.datasets.load_reuters_vocab()
>>> titles = lda.datasets.load_reuters_titles()
>>> X.shape
(395, 4258)
>>> X.sum()
84010
>>> model = lda.LDA(n_topics=20, n_iter=1500, random_state=1)
>>> model.fit(X) # model.fit_transform(X) is also available
>>> topic_word = model.topic_word_ # model.components_ also works
>>> n_top_words = 8
>>> for i, topic_dist in enumerate(topic_word):
... topic_words = np.array(vocab)[np.argsort(topic_dist)][:-n_top_words:-1]
... print('Topic {}: {}'.format(i, ' '.join(topic_words)))
Topic 0: british churchill sale million major letters west
Topic 1: church government political country state people party
Topic 2: elvis king fans presley life concert young
Topic 3: yeltsin russian russia president kremlin moscow michael
Topic 4: pope vatican paul john surgery hospital pontiff
Topic 5: family funeral police miami versace cunanan city
Topic 6: simpson former years court president wife south
Topic 7: order mother successor election nuns church nirmala
Topic 8: charles prince diana royal king queen parker
Topic 9: film french france against bardot paris poster
Topic 10: germany german war nazi letter christian book
Topic 11: east peace prize award timor quebec belo
Topic 12: n't life show told very love television
Topic 13: years year time last church world people
Topic 14: mother teresa heart calcutta charity nun hospital
Topic 15: city salonika capital buddhist cultural vietnam byzantine
Topic 16: music tour opera singer israel people film
Topic 17: church catholic bernardin cardinal bishop wright death
Topic 18: harriman clinton u.s ambassador paris president churchill
Topic 19: city museum art exhibition century million churches
The document-topic distributions are available in model.doc_topic_
.
>>> doc_topic = model.doc_topic_
>>> for i in range(10):
... print("{} (top topic: {})".format(titles[i], doc_topic[i].argmax()))
0 UK: Prince Charles spearheads British royal revolution. LONDON 1996-08-20 (top topic: 8)
1 GERMANY: Historic Dresden church rising from WW2 ashes. DRESDEN, Germany 1996-08-21 (top topic: 13)
2 INDIA: Mother Teresa's condition said still unstable. CALCUTTA 1996-08-23 (top topic: 14)
3 UK: Palace warns British weekly over Charles pictures. LONDON 1996-08-25 (top topic: 8)
4 INDIA: Mother Teresa, slightly stronger, blesses nuns. CALCUTTA 1996-08-25 (top topic: 14)
5 INDIA: Mother Teresa's condition unchanged, thousands pray. CALCUTTA 1996-08-25 (top topic: 14)
6 INDIA: Mother Teresa shows signs of strength, blesses nuns. CALCUTTA 1996-08-26 (top topic: 14)
7 INDIA: Mother Teresa's condition improves, many pray. CALCUTTA, India 1996-08-25 (top topic: 14)
8 INDIA: Mother Teresa improves, nuns pray for "miracle". CALCUTTA 1996-08-26 (top topic: 14)
9 UK: Charles under fire over prospect of Queen Camilla. LONDON 1996-08-26 (top topic: 8)
Document-topic distributions may be inferred for out-of-sample texts using the
transform
method:
>>> X = lda.datasets.load_reuters()
>>> titles = lda.datasets.load_reuters_titles()
>>> X_train = X[10:]
>>> X_test = X[:10]
>>> titles_test = titles[:10]
>>> model = lda.LDA(n_topics=20, n_iter=1500, random_state=1)
>>> model.fit(X_train)
>>> doc_topic_test = model.transform(X_test)
>>> for title, topics in zip(titles_test, doc_topic_test):
... print("{} (top topic: {})".format(title, topics.argmax()))
0 UK: Prince Charles spearheads British royal revolution. LONDON 1996-08-20 (top topic: 7)
1 GERMANY: Historic Dresden church rising from WW2 ashes. DRESDEN, Germany 1996-08-21 (top topic: 11)
2 INDIA: Mother Teresa's condition said still unstable. CALCUTTA 1996-08-23 (top topic: 4)
3 UK: Palace warns British weekly over Charles pictures. LONDON 1996-08-25 (top topic: 7)
4 INDIA: Mother Teresa, slightly stronger, blesses nuns. CALCUTTA 1996-08-25 (top topic: 4)
5 INDIA: Mother Teresa's condition unchanged, thousands pray. CALCUTTA 1996-08-25 (top topic: 4)
6 INDIA: Mother Teresa shows signs of strength, blesses nuns. CALCUTTA 1996-08-26 (top topic: 4)
7 INDIA: Mother Teresa's condition improves, many pray. CALCUTTA, India 1996-08-25 (top topic: 4)
8 INDIA: Mother Teresa improves, nuns pray for "miracle". CALCUTTA 1996-08-26 (top topic: 4)
9 UK: Charles under fire over prospect of Queen Camilla. LONDON 1996-08-26 (top topic: 11)
(Note that the topic numbers have changed due to LDA not being an identifiable model. The phenomenon is known as label switching in the literature.)
Convergence may be monitored by accessing the loglikelihoods_
attribute on a
fitted model. The attribute is bound to a list which records the sequence of
log likelihoods associated with the model at different iterations (thinned by
the refresh
parameter).
(The following code assumes matplotlib is installed.)
>>> import matplotlib.pyplot as plt
>>> # skipping the first few entries makes the graph more readable
>>> plt.plot(model.loglikelihoods_[5:])
Judging convergence from the plot, the model should be fit with a slightly greater number of iterations.
Installing lda¶
lda requires Python (>= 3.6) and NumPy (>= 1.13.0). If these requirements are satisfied, lda should install successfully on Linux and macOS with:
pip install lda
If you encounter problems, consult the platform-specific instructions below.
Mac OS X¶
lda and its dependencies are all available as wheel packages for Mac OS X:
pip install lda
Linux¶
lda and its dependencies are all available as wheel packages for most distributions of Linux:
pip install lda
Windows¶
lda must be built from source on Windows. There are no wheels at this time.
Installation from source¶
Installing from source requires you to have installed the Python development headers and a working C/C++ compiler. Under Debian-based operating systems, which include Ubuntu, you can install all these requirements by issuing:
sudo apt-get install build-essential python3-dev python3-setuptools \
python3-numpy
Before attempting a command such as python setup.py install
you will need to run
Cython to generate the relevant C files:
make cython
API Reference¶
This page contains auto-generated API reference documentation [1].
lda
¶
Submodules¶
lda._setup_hooks
¶
Module Contents¶
sdist_pre_hook (cmdobj) |
Ensure Cython has compiled all pyx files to c. |
-
lda._setup_hooks.
sdist_pre_hook
(cmdobj)¶ Ensure Cython has compiled all pyx files to c.
lda.datasets
¶
Module Contents¶
load_reuters () |
|
load_reuters_vocab () |
|
load_reuters_titles () |
-
lda.datasets.
_test_dir
¶
-
lda.datasets.
load_reuters
()¶
-
lda.datasets.
load_reuters_vocab
()¶
-
lda.datasets.
load_reuters_titles
()¶
lda.lda
¶
Latent Dirichlet allocation using collapsed Gibbs sampling
Module Contents¶
LDA (n_topics, n_iter=2000, alpha=0.1, eta=0.01, random_state=None, refresh=10) |
Latent Dirichlet allocation using collapsed Gibbs sampling |
-
lda.lda.
logger
¶
-
lda.lda.
PY2
¶
-
lda.lda.
range
¶
-
class
lda.lda.
LDA
(n_topics, n_iter=2000, alpha=0.1, eta=0.01, random_state=None, refresh=10)¶ Latent Dirichlet allocation using collapsed Gibbs sampling
Parameters: - n_topics : int
Number of topics
- n_iter : int, default 2000
Number of sampling iterations
- alpha : float, default 0.1
Dirichlet parameter for distribution over topics
- eta : float, default 0.01
Dirichlet parameter for distribution over words
- random_state : int or RandomState, optional
The generator used for the initial topics.
References
Blei, David M., Andrew Y. Ng, and Michael I. Jordan. “Latent Dirichlet Allocation.” Journal of Machine Learning Research 3 (2003): 993–1022.
Griffiths, Thomas L., and Mark Steyvers. “Finding Scientific Topics.” Proceedings of the National Academy of Sciences 101 (2004): 5228–5235. doi:10.1073/pnas.0307752101.
Wallach, Hanna, David Mimno, and Andrew McCallum. “Rethinking LDA: Why Priors Matter.” In Advances in Neural Information Processing Systems 22, edited by Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, 1973–1981, 2009.
Wallach, Hanna M., Iain Murray, Ruslan Salakhutdinov, and David Mimno. 2009. “Evaluation Methods for Topic Models.” In Proceedings of the 26th Annual International Conference on Machine Learning, 1105–1112. ICML ’09. New York, NY, USA: ACM. https://doi.org/10.1145/1553374.1553515.
Buntine, Wray. “Estimating Likelihoods for Topic Models.” In Advances in Machine Learning, First Asian Conference on Machine Learning (2009): 51–64. doi:10.1007/978-3-642-05224-8_6.
Examples
>>> import numpy >>> X = numpy.array([[1,1], [2, 1], [3, 1], [4, 1], [5, 8], [6, 1]]) >>> import lda >>> model = lda.LDA(n_topics=2, random_state=0, n_iter=100) >>> model.fit(X) #doctest: +ELLIPSIS +NORMALIZE_WHITESPACE LDA(alpha=... >>> model.components_ array([[ 0.85714286, 0.14285714], [ 0.45 , 0.55 ]]) >>> model.loglikelihood() #doctest: +ELLIPSIS -40.395...
Attributes: - `components_` : array, shape = [n_topics, n_features]
Point estimate of the topic-word distributions (Phi in literature)
- `topic_word_` :
Alias for components_
- `nzw_` : array, shape = [n_topics, n_features]
Matrix of counts recording topic-word assignments in final iteration.
- `ndz_` : array, shape = [n_samples, n_topics]
Matrix of counts recording document-topic assignments in final iteration.
- `doc_topic_` : array, shape = [n_samples, n_features]
Point estimate of the document-topic distributions (Theta in literature)
- `nz_` : array, shape = [n_topics]
Array of topic assignment counts in final iteration.
-
fit
(self, X, y=None)¶ Fit the model with X.
Parameters: - X: array-like, shape (n_samples, n_features)
Training data, where n_samples in the number of samples and n_features is the number of features. Sparse matrix allowed.
Returns: - self : object
Returns the instance itself.
-
fit_transform
(self, X, y=None)¶ Apply dimensionality reduction on X
Parameters: - X : array-like, shape (n_samples, n_features)
New data, where n_samples in the number of samples and n_features is the number of features. Sparse matrix allowed.
Returns: - doc_topic : array-like, shape (n_samples, n_topics)
Point estimate of the document-topic distributions
-
transform
(self, X, max_iter=20, tol=1e-16)¶ Transform the data X according to previously fitted model
Parameters: - X : array-like, shape (n_samples, n_features)
New data, where n_samples in the number of samples and n_features is the number of features.
- max_iter : int, optional
Maximum number of iterations in iterated-pseudocount estimation.
- tol: double, optional
Tolerance value used in stopping condition.
Returns: - doc_topic : array-like, shape (n_samples, n_topics)
Point estimate of the document-topic distributions
-
_transform_single
(self, doc, max_iter, tol)¶ Transform a single document according to the previously fit model
Parameters: - X : 1D numpy array of integers
Each element represents a word in the document
- max_iter : int
Maximum number of iterations in iterated-pseudocount estimation.
- tol: double
Tolerance value used in stopping condition.
Returns: - doc_topic : 1D numpy array of length n_topics
Point estimate of the topic distributions for document
-
_fit
(self, X)¶ Fit the model to the data X
Parameters: - X: array-like, shape (n_samples, n_features)
Training vector, where n_samples in the number of samples and n_features is the number of features. Sparse matrix allowed.
-
_initialize
(self, X)¶
-
loglikelihood
(self)¶ Calculate complete log likelihood, log p(w,z)
Formula used is log p(w,z) = log p(w|z) + log p(z)
-
_sample_topics
(self, rands)¶ Samples all topic assignments. Called once per iteration.
lda.utils
¶
Module Contents¶
check_random_state (seed) |
|
matrix_to_lists (doc_word) |
Convert a (sparse) matrix of counts into arrays of word and doc indices |
lists_to_matrix (WS, DS) |
Convert array of word (or topic) and document indices to doc-term array |
dtm2ldac (dtm, offset=0) |
Convert a document-term matrix into an LDA-C formatted file |
ldac2dtm (stream, offset=0) |
Convert an LDA-C formatted file to a document-term array |
-
lda.utils.
PY2
¶
-
lda.utils.
zip
¶
-
lda.utils.
logger
¶
-
lda.utils.
check_random_state
(seed)¶
-
lda.utils.
matrix_to_lists
(doc_word)¶ Convert a (sparse) matrix of counts into arrays of word and doc indices
Parameters: - doc_word : array or sparse matrix (D, V)
document-term matrix of counts
Returns: - (WS, DS) : tuple of two arrays
WS[k] contains the kth word in the corpus DS[k] contains the document index for the kth word
-
lda.utils.
lists_to_matrix
(WS, DS)¶ Convert array of word (or topic) and document indices to doc-term array
Parameters: - (WS, DS) : tuple of two arrays
WS[k] contains the kth word in the corpus DS[k] contains the document index for the kth word
Returns: - doc_word : array (D, V)
document-term array of counts
-
lda.utils.
dtm2ldac
(dtm, offset=0)¶ Convert a document-term matrix into an LDA-C formatted file
Parameters: - dtm : array of shape N,V
Returns: - doclines : iterable of LDA-C lines suitable for writing to file
Notes
If a format similar to SVMLight is desired, offset of 1 may be used.
-
lda.utils.
ldac2dtm
(stream, offset=0)¶ Convert an LDA-C formatted file to a document-term array
Parameters: - stream: file object
File yielding unicode strings in LDA-C format.
Returns: - dtm : array of shape N,V
Notes
If a format similar to SVMLight is the source, an offset of 1 may be used.
Package Contents¶
Classes¶
LDA (n_topics, n_iter=2000, alpha=0.1, eta=0.01, random_state=None, refresh=10) |
Latent Dirichlet allocation using collapsed Gibbs sampling |
-
class
lda.
LDA
(n_topics, n_iter=2000, alpha=0.1, eta=0.01, random_state=None, refresh=10)¶ Latent Dirichlet allocation using collapsed Gibbs sampling
Parameters: - n_topics : int
Number of topics
- n_iter : int, default 2000
Number of sampling iterations
- alpha : float, default 0.1
Dirichlet parameter for distribution over topics
- eta : float, default 0.01
Dirichlet parameter for distribution over words
- random_state : int or RandomState, optional
The generator used for the initial topics.
References
Blei, David M., Andrew Y. Ng, and Michael I. Jordan. “Latent Dirichlet Allocation.” Journal of Machine Learning Research 3 (2003): 993–1022.
Griffiths, Thomas L., and Mark Steyvers. “Finding Scientific Topics.” Proceedings of the National Academy of Sciences 101 (2004): 5228–5235. doi:10.1073/pnas.0307752101.
Wallach, Hanna, David Mimno, and Andrew McCallum. “Rethinking LDA: Why Priors Matter.” In Advances in Neural Information Processing Systems 22, edited by Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, 1973–1981, 2009.
Wallach, Hanna M., Iain Murray, Ruslan Salakhutdinov, and David Mimno. 2009. “Evaluation Methods for Topic Models.” In Proceedings of the 26th Annual International Conference on Machine Learning, 1105–1112. ICML ’09. New York, NY, USA: ACM. https://doi.org/10.1145/1553374.1553515.
Buntine, Wray. “Estimating Likelihoods for Topic Models.” In Advances in Machine Learning, First Asian Conference on Machine Learning (2009): 51–64. doi:10.1007/978-3-642-05224-8_6.
Examples
>>> import numpy >>> X = numpy.array([[1,1], [2, 1], [3, 1], [4, 1], [5, 8], [6, 1]]) >>> import lda >>> model = lda.LDA(n_topics=2, random_state=0, n_iter=100) >>> model.fit(X) #doctest: +ELLIPSIS +NORMALIZE_WHITESPACE LDA(alpha=... >>> model.components_ array([[ 0.85714286, 0.14285714], [ 0.45 , 0.55 ]]) >>> model.loglikelihood() #doctest: +ELLIPSIS -40.395...
Attributes: - `components_` : array, shape = [n_topics, n_features]
Point estimate of the topic-word distributions (Phi in literature)
- `topic_word_` :
Alias for components_
- `nzw_` : array, shape = [n_topics, n_features]
Matrix of counts recording topic-word assignments in final iteration.
- `ndz_` : array, shape = [n_samples, n_topics]
Matrix of counts recording document-topic assignments in final iteration.
- `doc_topic_` : array, shape = [n_samples, n_features]
Point estimate of the document-topic distributions (Theta in literature)
- `nz_` : array, shape = [n_topics]
Array of topic assignment counts in final iteration.
-
fit
(self, X, y=None)¶ Fit the model with X.
Parameters: - X: array-like, shape (n_samples, n_features)
Training data, where n_samples in the number of samples and n_features is the number of features. Sparse matrix allowed.
Returns: - self : object
Returns the instance itself.
-
fit_transform
(self, X, y=None)¶ Apply dimensionality reduction on X
Parameters: - X : array-like, shape (n_samples, n_features)
New data, where n_samples in the number of samples and n_features is the number of features. Sparse matrix allowed.
Returns: - doc_topic : array-like, shape (n_samples, n_topics)
Point estimate of the document-topic distributions
-
transform
(self, X, max_iter=20, tol=1e-16)¶ Transform the data X according to previously fitted model
Parameters: - X : array-like, shape (n_samples, n_features)
New data, where n_samples in the number of samples and n_features is the number of features.
- max_iter : int, optional
Maximum number of iterations in iterated-pseudocount estimation.
- tol: double, optional
Tolerance value used in stopping condition.
Returns: - doc_topic : array-like, shape (n_samples, n_topics)
Point estimate of the document-topic distributions
-
_transform_single
(self, doc, max_iter, tol)¶ Transform a single document according to the previously fit model
Parameters: - X : 1D numpy array of integers
Each element represents a word in the document
- max_iter : int
Maximum number of iterations in iterated-pseudocount estimation.
- tol: double
Tolerance value used in stopping condition.
Returns: - doc_topic : 1D numpy array of length n_topics
Point estimate of the topic distributions for document
-
_fit
(self, X)¶ Fit the model to the data X
Parameters: - X: array-like, shape (n_samples, n_features)
Training vector, where n_samples in the number of samples and n_features is the number of features. Sparse matrix allowed.
-
_initialize
(self, X)¶
-
loglikelihood
(self)¶ Calculate complete log likelihood, log p(w,z)
Formula used is log p(w,z) = log p(w|z) + log p(z)
-
_sample_topics
(self, rands)¶ Samples all topic assignments. Called once per iteration.
-
lda.
__version__
¶
[1] | Created with sphinx-autoapi |
Contributing¶
Style Guidlines¶
Before contributing a patch, please read the Python “Style Commandments” written by the OpenStack developers: http://docs.openstack.org/developer/hacking/
Building in Develop Mode¶
To build in develop mode on OS X, first install Cython and pbr. Then run:
git clone https://github.com/lda-project/lda.git
cd lda
make cython
python setup.py develop
What’s New¶
v2.0.0 (17. August 2020)¶
- Drop support for Python 2.7
- Wheels for Python 3.8
v1.1.0 (9. September 2018)¶
- Wheels for Python 3.7
- Minimum required NumPy version is 1.13.0.
- Major speed increase in data loading. Thanks @luoshao23.
- Bugfix in Cython searchsorted function. Thanks @luoshao23.
v1.0.5 (18. June 2017)¶
- Wheels for Python 3.6
v1.0.4 (13. July 2016)¶
- Linux wheels (manylinux1)
v1.0.3 (5. Nov 2015)¶
- Python 3.5 wheels
- Release GIL during sampling
- Many minor fixes