Limix’s documentation¶
Install¶
Pip¶
pip install limix
Conda¶
conda install -c conda-forge limix
I/O module¶
BIMBAM¶
>>> import limix
>>>
>>> url = "http://rest.s3for.me/limix/example/phenotype.gemma"
>>> filepath = limix.sh.download(url, verbose=False)
>>> print(limix.io.bimbam.read_phenotype(filepath, verbose=False))
trait 0 1 2
sample
0 1.20000 -0.30000 -1.50000
1 nan 1.50000 0.30000
2 2.70000 1.10000 nan
3 -0.20000 -0.70000 0.80000
4 3.30000 2.40000 2.10000
>>> limix.sh.remove(filepath)
BGEN reader¶
>>> url = "http://rest.s3for.me/bgen-reader/haplotypes.bgen"
>>> filepath = limix.sh.download(url, verbose=False)
>>>
>>> data = limix.io.bgen.read(filepath, verbose=False)
>>> print(data.keys())
dict_keys(['variants', 'samples', 'genotype'])
>>> print(data["variants"].head(4))
id rsid chrom pos nalleles allele_ids vaddr
0 SNP1 RS1 1 1 2 A,G 102
1 SNP2 RS2 1 2 2 A,G 159
2 SNP3 RS3 1 3 2 A,G 216
3 SNP4 RS4 1 4 2 A,G 273
>>> print(data["samples"])
0 sample_0
1 sample_1
2 sample_2
3 sample_3
Name: id, dtype: object
>>> geno = data["genotype"][0].compute()
>>> print(geno.keys())
dict_keys(['probs', 'phased', 'ploidy', 'missing'])
>>> print(geno["probs"])
[[1. 0. 1. 0.]
[0. 1. 1. 0.]
[1. 0. 0. 1.]
[0. 1. 0. 1.]]
>>> limix.sh.remove(filepath)
>>> limix.sh.remove(filepath + ".metadata")
CSV reader¶
>>> url = "http://rest.s3for.me/limix/expr.csv"
>>> filepath = limix.sh.download(url, verbose=False)
>>> data = limix.io.csv.read(filepath, verbose=False)
>>> print(data.head())
HG00111 HG00112 HG00116 HG00121 HG00133 HG00135 HG00142 HG00143 \
gene1 -3.75235 -0.42113 -0.53629 -0.90768 -0.25189 -0.60300 -0.31069 0.28849
gene2 -0.35145 1.28258 -2.83505 0.32953 -0.50711 -0.81912 -1.37971 0.26906
gene3 -1.31997 1.08197 0.28400 -0.41318 0.14609 -0.14714 0.30255 0.69654
gene4 -0.75163 0.37668 -0.23564 1.06111 0.58524 0.60962 -2.02384 -1.29969
gene5 0.06464 0.64204 -0.81127 -1.42806 -0.89599 -0.01391 0.34385 -0.48492
HG00151 HG00152 HG00154 HG00159 HG00160 HG00171 HG00173 HG00179 \
gene1 -1.72944 -1.69063 0.23706 -0.70690 0.31554 0.41398 1.53933 1.64413
gene2 -0.98741 -1.38205 -1.49273 1.43818 0.60203 0.72495 1.33730 -0.54032
gene3 0.92997 -0.97183 0.73781 0.13841 -0.27796 -0.30850 -1.37364 0.02908
gene4 0.74860 0.35024 0.46494 0.26519 -1.04980 -0.10405 0.24636 0.39698
gene5 -1.56242 -0.69343 -0.67140 -0.97220 0.51523 0.84428 0.37633 0.20097
HG00189 HG00190 HG00232 HG00233 HG00239 HG00245 HG00253 HG00257 \
gene1 0.65387 0.68840 -0.31933 -0.05787 0.18941 -0.96866 0.95566 0.09096
gene2 0.23202 0.62984 0.36502 -0.54166 -0.17954 1.56841 0.41138 0.85125
gene3 0.38357 1.53011 1.03765 1.66079 0.47459 -0.61217 -0.75705 -1.23464
gene4 0.22995 -0.74410 0.13555 1.11415 -0.27491 -0.63577 0.65697 -0.78026
gene5 0.22179 -0.88524 -0.90036 -0.30308 -0.47617 -0.65303 0.31526 0.21940
HG00263 HG00274 HG00281 HG00284 HG00309 HG00318 HG00319 HG00330 \
gene1 -0.84906 -0.38422 -0.20839 1.76988 -0.08126 -0.11885 0.06195 -0.39166
gene2 -0.68103 1.49606 0.03634 0.30436 -0.81669 -0.41680 -0.68541 1.05676
gene3 -0.39850 -0.04262 0.58753 -1.79663 0.06896 0.26583 -0.09755 1.30798
gene4 -0.03511 1.17201 -0.60239 -0.04585 1.52125 -0.01333 -1.03177 -1.04258
gene5 -1.44331 1.77727 0.03294 -0.90860 2.01101 1.30201 1.47212 -1.02498
HG00331 HG00332 HG00343 HG00344 HG00351 HG00357 HG00369 ... NA19819 \
gene1 1.19368 -2.60319 -0.00044 0.26917 0.35906 0.32333 -0.58944 ... 1.14834
gene2 -1.35361 -1.17153 -0.75354 1.08028 0.30128 1.19826 2.25417 ... 0.70334
gene3 0.84178 0.28865 -1.17596 -0.16204 0.24943 1.08654 -0.82590 ... -2.01485
gene4 0.63385 -1.60319 0.49055 0.53223 -0.00883 -1.39597 -0.24307 ... 0.09122
gene5 -0.48581 -1.59411 -1.25390 -0.78523 2.15585 1.15443 0.16866 ... 0.64129
NA19834 NA19908 NA19920 NA20127 NA20278 NA20281 NA20287 NA20291 \
gene1 -0.33572 -0.48395 0.28840 -0.19132 1.46027 -0.49558 -0.05913 -0.67027
gene2 -1.12835 -1.56760 -1.53615 1.30383 -0.83380 0.08954 1.35263 -0.76398
gene3 -0.65052 1.78800 0.54407 -0.19086 -0.54339 0.00340 0.93919 2.35539
gene4 -1.28366 0.49670 0.51998 0.01628 0.65825 1.42326 0.75003 -0.46688
gene5 -0.38025 -0.99212 0.86387 -1.30035 0.36494 -0.74369 0.97632 -1.41972
NA20314 NA20334 NA20339 NA20341 NA20344 NA20348 NA20357 NA20359 \
gene1 1.14834 0.03650 -0.39326 0.28873 -0.82685 -0.33570 0.35864 -0.88629
gene2 -0.73860 0.97905 -0.03794 0.52208 0.15237 0.05513 -0.43515 1.78638
gene3 -0.24083 -0.03304 -0.02394 0.63280 -0.14199 0.21436 0.01104 0.19409
gene4 1.02439 0.44723 -1.26319 0.04781 0.42133 -2.11834 0.59976 -0.45331
gene5 -0.68538 0.98479 0.58614 1.32101 -1.28100 -0.24468 -1.50173 -0.23772
NA20412 NA20414 NA20505 NA20507 NA20508 NA20517 NA20518 NA20521 \
gene1 -0.27868 -0.88236 0.85370 0.12215 -0.22804 0.11440 1.15530 -0.64616
gene2 -0.13907 -1.79076 -1.06388 1.84062 0.81188 0.74989 1.13613 -0.67296
gene3 0.20491 -1.09471 -0.41158 -0.19403 0.62979 1.45114 -0.68828 -0.21306
gene4 2.07913 0.64254 1.19309 1.36727 1.36291 -0.20257 -1.91187 -0.80394
gene5 0.55915 -0.70109 1.10276 0.19700 -1.00590 -0.19778 0.70381 1.32831
NA20525 NA20527 NA20534 NA20537 NA20581 NA20582 NA20753 NA20754 \
gene1 -1.00890 0.33830 -1.18606 -2.50144 1.11857 -1.35514 -0.45410 -1.40787
gene2 -0.52930 -0.08037 1.49958 0.48022 1.90088 1.19142 -1.07944 1.06962
gene3 1.28033 -1.03548 0.30717 -0.60877 0.14828 -0.02566 0.68297 -1.41462
gene4 0.84287 2.31155 -0.45076 0.27237 0.25196 0.13814 -1.58961 -0.61954
gene5 -0.13948 -0.59769 1.28226 0.56941 -3.31790 0.63016 0.17360 2.28942
NA20768 NA20771 NA20772 NA20774 NA20775 NA20804
gene1 0.74605 -1.92301 0.52952 0.59285 -0.25449 -0.42643
gene2 0.21579 0.84464 0.72602 0.17902 -1.18471 1.22427
gene3 0.02000 -0.00026 -0.43102 1.03429 -2.04323 -0.61358
gene4 0.20742 -1.52664 -0.02818 0.29817 0.01488 0.26651
gene5 -0.82307 -1.15892 -1.14967 -0.13977 0.84840 0.99755
[5 rows x 274 columns]
>>> limix.sh.remove(filepath)
HDF5 reader¶
>>> url = "http://rest.s3for.me/limix/smith08.hdf5.bz2"
>>> filepath = limix.sh.download(url, verbose=False)
>>> filepath = limix.sh.extract(filepath, verbose=False)
>>> data = limix.io.hdf5.read_limix(filepath)
>>> print(data)
{'phenotype': <xarray.DataArray 'phenotype' (sample: 109, outcome: 10986)>
array([[-0.037339, -0.078165, 0.042936, ..., 0.095596, -0.132385, -0.274954],
[-0.301376, 0.066055, 0.338624, ..., -0.142661, -0.238349, 0.732752],
[ 0.002661, 0.121835, -0.137064, ..., -0.144404, 0.257615, 0.015046],
...,
[-0.287339, 0.351835, 0.072936, ..., 0.097339, -0.038349, 0.162752],
[-0.577339, 0.011835, -0.007064, ..., 0.135596, 0.107615, 0.245046],
[-0.277339, 0.061835, 0.132936, ..., 0.015596, -0.142385, -0.124954]])
Coordinates:
* sample (sample) int64 0 1 2 3 4 5 6 7 ... 102 103 104 105 106 107 108
environment (outcome) float64 0.0 0.0 0.0 0.0 0.0 ... 1.0 1.0 1.0 1.0 1.0
gene_ID (outcome) object 'YOL161C' 'YJR107W' ... 'YLR118C' 'YBR242W'
gene_chrom (outcome) object '15' '10' '16' '7' '4' ... '3' '10' '12' '2'
gene_end (outcome) int64 11548 628319 32803 ... 315049 384726 705381
gene_start (outcome) int64 11910 627333 30482 ... 315552 385409 704665
gene_strand (outcome) object 'C' 'W' 'W' 'W' 'W' ... 'W' 'W' 'C' 'C' 'W'
phenotype_ID (outcome) object 'YOL161C:0' 'YJR107W:0' ... 'YBR242W:1'
Dimensions without coordinates: outcome, 'genotype': <xarray.DataArray 'genotype' (sample: 109, candidate: 2956)>
array([[1., 1., 1., ..., 0., 0., 0.],
[1., 0., 1., ..., 1., 1., 1.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 1., 1.],
[0., 0., 0., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.]])
Coordinates:
* sample (sample) int64 0 1 2 3 4 5 6 7 ... 101 102 103 104 105 106 107 108
chrom (candidate) int64 1 1 1 1 1 1 1 1 1 ... 16 16 16 16 16 16 16 16 16
pos (candidate) int64 483 484 3220 3223 ... 927506 932310 932535 932538
pos_cum (candidate) int64 483 484 3220 3223 ... 12055570 12055795 12055798
Dimensions without coordinates: candidate}
>>> limix.sh.remove(filepath)
NumPy reader¶
>>> url = "http://rest.s3for.me/limix/example.npy"
>>> filepath = limix.sh.download(url, verbose=False)
>>> K = limix.io.npy.read(filepath, verbose=False)
>>> print(K)
[[0.67003303 0.09512837 0.09346511 0.09252165 0.09679249]
[0.09512837 0.66972454 0.09344451 0.09109398 0.09347495]
[0.09346511 0.09344451 0.67305621 0.08987969 0.09689215]
[0.09252165 0.09109398 0.08987969 0.67209248 0.09378162]
[0.09679249 0.09347495 0.09689215 0.09378162 0.66773896]]
>>> limix.sh.remove(filepath)
PLINK reader¶
>>> from os.path import join
>>> from pandas_plink import get_data_folder
>>>
>>> (bim, fam, bed) = limix.io.plink.read(join(get_data_folder(), "data"),
... verbose=False)
>>> print(bim.head())
chrom snp cm pos a0 a1 i
candidate
rs10399749 1 rs10399749 0.00000 45162 G C 0
rs2949420 1 rs2949420 0.00000 45257 C T 1
rs2949421 1 rs2949421 0.00000 45413 0 0 2
rs2691310 1 rs2691310 0.00000 46844 A T 3
rs4030303 1 rs4030303 0.00000 72434 0 G 4
>>> print(fam.head())
fid iid father mother gender trait i
sample
Sample_1 Sample_1 Sample_1 0 0 1 -9.00000 0
Sample_2 Sample_2 Sample_2 0 0 2 -9.00000 1
Sample_3 Sample_3 Sample_3 Sample_1 Sample_2 2 -9.00000 2
>>> print(bed.compute())
[[ 2. 2. 1.]
[ 2. 1. 2.]
[nan nan nan]
[nan nan 1.]
[ 2. 2. 2.]
[ 2. 2. 2.]
[ 2. 1. 0.]
[ 2. 2. 2.]
[ 1. 2. 2.]
[ 2. 1. 2.]]
Quality control¶
Box-Cox¶
- limix.qc.boxcox(x)[source]
Box-Cox transformation for normality conformance.
It applies the power transformation
\[\begin{split}f(x) = \begin{cases} \frac{x^{\lambda} - 1}{\lambda}, & \text{if } \lambda > 0; \\ \log(x), & \text{if } \lambda = 0. \end{cases}\end{split}\]to the provided data, hopefully making it more normal distribution-like. The λ parameter is fit by maximum likelihood estimation.
- Parameters
X (array_like) – Data to be transformed.
- Returns
boxcox – Box-Cox transformed data.
- Return type
ndarray
Examples
(Source code, png)
Dependent columns¶
- limix.qc.remove_dependent_cols(X, tol=1.49e-08)[source]
Remove dependent columns.
Return a matrix with dependent columns removed.
- Parameters
X (array_like) – Matrix to might have dependent columns.
tol (float, optional) – Threshold above which columns are considered dependents.
- Returns
rank – Full column rank matrix.
- Return type
ndarray
- limix.qc.unique_variants(X)[source]
Filters out variants with the same genetic profile.
- Parameters
X (array_like) – Samples-by-variants matrix of genotype values.
- Returns
genotype – Genotype array with unique variants.
- Return type
ndarray
Example
>>> from numpy.random import RandomState >>> from numpy import kron, ones, sort >>> from limix.qc import unique_variants >>> random = RandomState(1) >>> >>> N = 4 >>> X = kron(random.randn(N, 3) < 0, ones((1, 2))) >>> >>> print(X) [[0. 0. 1. 1. 1. 1.] [1. 1. 0. 0. 1. 1.] [0. 0. 1. 1. 0. 0.] [1. 1. 0. 0. 1. 1.]] >>> >>> print(unique_variants(X)) [[0. 1. 1.] [1. 1. 0.] [0. 0. 1.] [1. 1. 0.]]
Genotype¶
- limix.qc.indep_pairwise(X, window_size, step_size, threshold, verbose=True)[source]
Determine pair-wise independent variants.
Independent variants are defined via squared Pearson correlations between pairs of variants inside a sliding window.
- Parameters
X (array_like) – Sample by variants matrix.
window_size (int) – Number of variants inside each window.
step_size (int) – Number of variants the sliding window skips.
threshold (float) – Squared Pearson correlation threshold for independence.
verbose (bool) – True for progress information; False otherwise.
- Returns
ok – Boolean array defining independent variants
- Return type
ndarray
Example
>>> from numpy.random import RandomState >>> from limix.qc import indep_pairwise >>> >>> random = RandomState(0) >>> X = random.randn(10, 20) >>> >>> indep_pairwise(X, 4, 2, 0.5, verbose=False) array([ True, True, False, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True])
- limix.qc.compute_maf(X)[source]
Compute minor allele frequencies.
It assumes that
X
encodes 0, 1, and 2 representing the number of alleles (or dosage), orNaN
to represent missing values.- Parameters
X (array_like) – Genotype matrix.
- Returns
Minor allele frequencies.
- Return type
array_like
Examples
>>> from numpy.random import RandomState >>> from limix.qc import compute_maf >>> >>> random = RandomState(0) >>> X = random.randint(0, 3, size=(100, 10)) >>> >>> print(compute_maf(X)) [0.49 0.49 0.445 0.495 0.5 0.45 0.48 0.48 0.47 0.435]
Impute¶
- limix.qc.mean_impute(X, axis=- 1, inplace=False)[source]
Impute
NaN
values.It defaults to column-wise imputation.
- Parameters
- Returns
Imputed array.
- Return type
ndarray
Examples
>>> from numpy.random import RandomState >>> from numpy import nan, array_str >>> from limix.qc import mean_impute >>> >>> random = RandomState(0) >>> X = random.randn(5, 2) >>> X[0, 0] = nan >>> >>> print(array_str(mean_impute(X), precision=4)) [[ 0.9233 0.4002] [ 0.9787 2.2409] [ 1.8676 -0.9773] [ 0.9501 -0.1514] [-0.1032 0.4106]]
- limix.qc.count_missingness(X)[source]
Count the number of missing values per column.
- Parameters
X (array_like) – Matrix.
- Returns
count – Number of missing values per column.
- Return type
ndarray
Kinship¶
- limix.qc.normalise_covariance(K, out=None)[source]
Variance rescaling of covariance matrix 𝙺.
Let n be the number of rows (or columns) of 𝙺 and let mᵢ be the average of the values in the i-th column. Gower rescaling is defined as
\[𝙺(n - 1)/(𝚝𝚛𝚊𝚌𝚎(𝙺) - ∑mᵢ).\]Notes
The reasoning of the scaling is as follows. Let 𝐠 be a vector of n independent samples and let 𝙲 be the Gower’s centering matrix. The unbiased variance estimator is
\[v = ∑ (gᵢ-ḡ)²/(n-1) = 𝚝𝚛𝚊𝚌𝚎((𝐠-ḡ𝟏)ᵀ(𝐠-ḡ𝟏))/(n-1) = 𝚝𝚛𝚊𝚌𝚎(𝙲𝐠𝐠ᵀ𝙲)/(n-1)\]Let 𝙺 be the covariance matrix of 𝐠. The expectation of the unbiased variance estimator is
\[𝐄[v] = 𝚝𝚛𝚊𝚌𝚎(𝙲𝐄[𝐠𝐠ᵀ]𝙲)/(n-1) = 𝚝𝚛𝚊𝚌𝚎(𝙲𝙺𝙲)/(n-1),\]assuming that 𝐄[gᵢ]=0. We thus divide 𝙺 by 𝐄[v] to achieve an unbiased normalisation on the random variable gᵢ.
- Parameters
K (array_like) – Covariance matrix to be normalised.
out (array_like, optional) – Result destination. Defaults to
None
.
Examples
>>> from numpy import dot, mean, zeros >>> from numpy.random import RandomState >>> from limix.qc import normalise_covariance >>> >>> random = RandomState(0) >>> X = random.randn(10, 10) >>> K = dot(X, X.T) >>> Z = random.multivariate_normal(zeros(10), K, 500) >>> print("%.3f" % mean(Z.var(1, ddof=1))) 9.824 >>> Kn = normalise_covariance(K) >>> Zn = random.multivariate_normal(zeros(10), Kn, 500) >>> print("%.3f" % mean(Zn.var(1, ddof=1))) 1.008
Normalisation¶
- limix.qc.mean_standardize(X, axis=- 1, inplace=False)[source]
Zero-mean and one-deviation normalisation.
Normalise in such a way that the mean and variance are equal to zero and one. This transformation is taken over the flattened array by default, otherwise over the specified axis. Missing values represented by
NaN
are ignored.- Parameters
- Returns
X – Normalized array.
- Return type
ndarray
Example
>>> import limix >>> from numpy import arange >>> >>> X = arange(15).reshape((5, 3)).astype(float) >>> print(X) [[ 0. 1. 2.] [ 3. 4. 5.] [ 6. 7. 8.] [ 9. 10. 11.] [12. 13. 14.]] >>> X = arange(6).reshape((2, 3)).astype(float) >>> X = limix.qc.mean_standardize(X, axis=0) >>> print(X) [[-1.22474487 0. 1.22474487] [-1.22474487 0. 1.22474487]]
- limix.qc.quantile_gaussianize(X, axis=1, inplace=False)[source]
Normalize a sequence of values via rank and Normal c.d.f.
It defaults to column-wise normalization.
- Parameters
- Returns
Gaussian-normalized values.
- Return type
array_like
Examples
>>> from limix.qc import quantile_gaussianize >>> from numpy import array_str >>> >>> qg = quantile_gaussianize([-1, 0, 2]) >>> print(qg) [-0.67448975 0. 0.67448975]
Quantitative trait locus¶
Introduction¶
Every genetic model considered here is an instance of generalized linear mixed model (GLMM). It consists in four main components [St16]:
A linear predictor, 𝐳 = M𝛂 + 𝚇𝐮.
The distribution of the random effects, 𝐮 ∼ 𝓝(𝟎, Σ).
The residual distribution, yᵢ | 𝐮.
The link function, 𝜇ᵢ = g(zᵢ).
The term 𝜇ᵢ represents the mean of yᵢ conditioned on 𝐮:
The role of the link function is to scale the domain of zᵢ, which ranges from -∞ to +∞, to the residual distribution parameter 𝜇ᵢ. For example, the mean of a Bernoulli distribution is bounded within [0, 1], and therefore requires a link function to translate values of zᵢ into values of 𝜇ᵢ.
The distribution of the outcome, conditioned on the random effects, has to be one from the exponential family [Ef18] having mean 𝜇ᵢ:
A notable instance of the above model is the linear mixed model (LMM). It consists of the identity link function, 𝜇ᵢ = g(𝜇ᵢ), and of normally distributed residuals, yᵢ | 𝐮 ∼ 𝓝(𝜇ᵢ, 𝜎ᵢ²) [Mc11]. It is more commonly described by the equation
for which 𝜀ᵢ∼𝓝(0, 𝜎ᵢ²). The random variables 𝐮 and 𝛆 are independent from each other as well as 𝜀ᵢ and 𝜀ⱼ for i≠j. Defining 𝐯 = 𝚇𝐮 leads to:
There is another even simpler instance of GLMM that is also used in genetic analysis: a linear model (LM) is merely a LMM without the random effects:
The above models are used to establish a statistical tests to find significant association between genetic loci and phenotype. For that, their parameters have to be estimated.
As an example, let us define two parameters that will describe the overall variances of the random effects and of the residual effects:
If we assume a LMM, this example of model can be described by Eq. (1) for which
Equivalently, we have
for which
Therefore we have a model with three parameters: an array of effect sizes 𝛂 and variances 𝓋₀ and 𝓋₁. If 𝚇 contains the normalized SNP genotypes of the samples, 𝚇𝚇ᵀ is an estimation of the genetic relationship between the samples [Wa17].
Statistical test¶
We use the likelihood ratio test (LRT) approach [LR18] to assess the significance of the association between genetic variants and the phenotype. It is based on the ratio between the marginal likelihood of the null 𝓗₀ and alternative 𝓗₁ models, for which the simpler model 𝓗₀ is defined by placing a constraint on one or more parameters of the alternative model 𝓗₁.
The parameter inference is done via the maximum likelihood estimation (MLE) approach [ML18], for which the marginal likelihood p(𝐲 | 𝙼, 𝚇; 𝛉) is maximized over the parameters set 𝛉. Let 𝛉₀ and 𝛉₁ be the optimal parameters set under the null and alternative models. The likelihood ratio statistics is given by
which asymptotically follows a χ² distribution [Wh14]. We will make use of the LRT approach in the next sections to flag significant genetic associations.
Single-trait association¶
We first consider that the observed phenotype is described by additive effects from covariates and genetic components. Any deviation from that is assumed to be captured by the residual distribution. Let 𝙼 be a matrix of covariates and let 𝙶 be a matrix of genetic variants that we suspect might have some effect on the phenotype. Therefore, we have the linear model:
and we wish to compare the following hypotheses:
Note that the parameters of the above model are the covariate effect sizes, 𝛂, the effect sizes of a set of genetic variants, 𝛃, and the variance 𝓋₁ of the noise variable. Under the null hypothesis, we set 𝛃=𝟎 and fit the rest of the parameters. Under the alternative hypothesis, we learn all the parameters. At the end, we compare the marginal likelihoods via the likelihood ratio test.
Let us first generate a random data set having a phenotype, covariates, and a set of genetic candidates.
>>> from numpy import ones, stack
>>> from numpy.random import RandomState
>>> from pandas import DataFrame
>>>
>>> random = RandomState(2)
>>>
>>> # sample size
>>> n = 100
>>>
>>> # covariates
>>> offset = ones(n) * random.randn()
>>> age = random.randint(16, 75, n)
>>> M = stack((offset, age), axis=1)
>>> M = DataFrame(stack([offset, age], axis=1), columns=["offset", "age"])
>>> M["sample"] = [f"sample{i}" for i in range(n)]
>>> M = M.set_index("sample")
>>> print(M.head())
offset age
sample
sample0 -0.41676 38.00000
sample1 -0.41676 59.00000
sample2 -0.41676 34.00000
sample3 -0.41676 27.00000
sample4 -0.41676 56.00000
>>> # genetic variants
>>> G = random.randn(n, 4)
>>>
>>> # sampling the phenotype
>>> alpha = random.randn(2)
>>> beta = random.randn(4)
>>> eps = random.randn(n)
>>> y = M @ alpha + G @ beta + eps
We now apply the function limix.qtl.scan()
to our data set
>>> from limix.qtl import scan
>>>
>>> r = scan(G, y, "normal", M=M, verbose=False)
>>> print(r)
Hypothesis 0
------------
𝐲 ~ 𝓝(𝙼𝜶, 3.462⋅𝙸)
M = ['offset' 'age']
𝜶 = [2.10096551 0.19582931]
se(𝜶) = [1.25826998 0.01068367]
lml = -203.98750767964498
Hypothesis 2
------------
𝐲 ~ 𝓝(𝙼𝜶 + G𝛃, s(3.462⋅𝙸))
lml cov. effsizes cand. effsizes
--------------------------------------------------
mean -1.951e+02 9.915e-01 -6.198e-01
std 9.227e+00 9.342e-01 3.974e-01
min -2.031e+02 1.844e-01 -1.025e+00
25% -2.026e+02 1.959e-01 -9.275e-01
50% -1.967e+02 5.965e-01 -6.047e-01
75% -1.893e+02 1.831e+00 -2.970e-01
max -1.841e+02 2.312e+00 -2.448e-01
Likelihood-ratio test p-values
------------------------------
𝓗₀ vs 𝓗₂
----------------
mean 6.514e-02
std 8.856e-02
min 2.804e-10
25% 2.606e-07
50% 3.651e-02
75% 1.016e-01
max 1.875e-01
Suppose we also have access to the whole genotype of our samples, 𝚇, and we want to use them to account for population structure and cryptic relatedness in our data [Ho13]. Since the number of genetic variants in 𝚇 is commonly larger than the number of samples, and because we are not actually interested in their effect sizes, we will include it in our model as a random component. We now have a linear mixed model:
It is important to note that 𝐯=𝚇𝐮 can be equivalently described by a multivariate Normal distribution with a covariance proportional to 𝙺 = 𝚇𝚇ᵀ:
We make use of the function limix.stats.linear_kinship()
to define the covariance
matrix 𝙺, and call limix.qtl.scan()
to perform the analysis.
>>> from limix.stats import linear_kinship, multivariate_normal
>>> from numpy import zeros, eye
>>>
>>> # Whole genotype of each sample.
>>> X = random.randn(n, 50)
>>> # Estimate a kinship relationship between samples.
>>> K = linear_kinship(X, verbose=False) + 1e-9 * eye(n)
>>> # Update the phenotype
>>> y += multivariate_normal(random, zeros(n), K)
>>>
>>> r = scan(X, y, "normal", K, 𝙼=M, verbose=False)
>>> print(r)
Hypothesis 0
------------
𝐲 ~ 𝓝(𝙼𝜶, 1.436⋅𝙺 + 2.934⋅𝙸)
M = ['offset' 'age']
𝜶 = [1.95338293 0.19448903]
se(𝜶) = [1.25455536 0.01076470]
lml = -211.3819625136375
Hypothesis 2
------------
𝐲 ~ 𝓝(𝙼𝜶 + G𝛃, s(1.436⋅𝙺 + 2.934⋅𝙸))
lml cov. effsizes cand. effsizes
--------------------------------------------------
mean -2.109e+02 1.069e+00 5.922e-02
std 7.210e-01 8.819e-01 2.474e-01
min -2.114e+02 1.919e-01 -5.204e-01
25% -2.113e+02 1.944e-01 -1.102e-01
50% -2.111e+02 8.904e-01 4.920e-02
75% -2.109e+02 1.956e+00 2.366e-01
max -2.076e+02 2.215e+00 6.433e-01
Likelihood-ratio test p-values
------------------------------
𝓗₀ vs 𝓗₂
----------------
mean 4.843e-01
std 2.705e-01
min 6.294e-03
25% 3.137e-01
50% 4.752e-01
75% 6.953e-01
max 9.929e-01
Non-normal trait association¶
If the residuals of the phenotype does not follow a Normal distribution, then we might consider performing the analysis using a generalized linear mixed model. Let us consider Poisson distributed residuals:
where the latent phenotype is described by
for
Note that the term 𝛆 in the above model is not the residual variable, as it were in the Eq. (1). The term 𝛆 is used to account for the so-called over-dispersion, i.e., when the residual distribution is not sufficient to explain the variability of yᵢ.
>>> from numpy import exp
>>>
>>> z = (y - y.mean()) / y.std()
>>> y = random.poisson(exp(z))
>>>
>>> r = scan(G, y, "poisson", K, M=M, verbose=False)
>>> print(r)
Hypothesis 0
------------
𝐳 ~ 𝓝(𝙼𝜶, 0.154⋅𝙺 + 0.000⋅𝙸) for yᵢ ~ Poisson(λᵢ=g(zᵢ)) and g(x)=eˣ
M = ['offset' 'age']
𝜶 = [5.17511934 0.04665214]
se(𝜶) = [0.85159296 0.00604330]
lml = -145.33385788740767
Hypothesis 2
------------
𝐳 ~ 𝓝(𝙼𝜶 + G𝛃, s(0.154⋅𝙺 + 0.000⋅𝙸)) for yᵢ ~ Poisson(λᵢ=g(zᵢ)) and g(x)=eˣ
lml cov. effsizes cand. effsizes
--------------------------------------------------
mean -1.440e+02 2.553e+00 -1.306e-01
std 1.343e+00 2.682e+00 9.268e-02
min -1.453e+02 4.345e-02 -2.227e-01
25% -1.450e+02 4.635e-02 -2.018e-01
50% -1.439e+02 2.456e+00 -1.344e-01
75% -1.428e+02 5.054e+00 -6.321e-02
max -1.427e+02 5.202e+00 -3.085e-02
Likelihood-ratio test p-values
------------------------------
𝓗₀ vs 𝓗₂
----------------
mean 2.830e-01
std 3.213e-01
min 2.274e-02
25% 2.519e-02
50% 2.113e-01
75% 4.692e-01
max 6.867e-01
Single-trait with interaction¶
The following linear mixed model is considered:
The operator ⊙ works as follows:
Therefore, the terms 𝙶⊙𝙴₀ and 𝙶⊙𝙴₁ can be understood as interaction terms between genetics, 𝙶, and environments, 𝙴₀ and 𝙴₁.
We define three hypotheses from the above linear mixed model:
The hypothesis 𝓗₀ is for no-interaction, 𝓗₁ is for interaction with environments encoded in 𝙴₀, and 𝓗₂ is for interaction with environments encoded in 𝙴₀ and 𝙴₁. We perform three statistical tests:
𝓗₀ (null) vs 𝓗₁ (alternative)
𝓗₀ (null) vs 𝓗₂ (alternative)
𝓗₁ (null) vs 𝓗₂ (alternative)
Here is an example.
>>> from numpy import concatenate, newaxis
>>> from limix.qtl import iscan
>>>
>>> # Generate interacting variables (environment)
>>> E0 = random.randn(y.shape[0], 1)
>>> E1 = random.randn(y.shape[0], 1)
>>>
>>> r = iscan(G, y, "normal", K, M, E0=E0, E1=E1, verbose=False)
>>> print(r)
Hypothesis 0
------------
𝐲 ~ 𝓝(𝙼𝜶, 0.376⋅𝙺 + 2.077⋅𝙸)
M = ['offset' 'age']
𝜶 = [3.12608063 0.06042316]
se(𝜶) = [1.01867609 0.00870181]
lml = -185.77488727691096
Hypothesis 1
------------
𝐲 ~ 𝓝(𝙼𝜶 + (𝙶⊙𝙴₀)𝛃₀, s(0.376⋅𝙺 + 2.077⋅𝙸))
lml cov. effsizes cand. effsizes
--------------------------------------------------
mean -1.856e+02 1.611e+00 -2.976e-03
std 1.949e-01 1.658e+00 1.208e-01
min -1.858e+02 6.034e-02 -1.461e-01
25% -1.858e+02 6.058e-02 -4.769e-02
50% -1.856e+02 1.590e+00 -7.487e-03
75% -1.854e+02 3.137e+00 3.722e-02
max -1.854e+02 3.235e+00 1.492e-01
Hypothesis 2
------------
𝐲 ~ 𝓝(𝙼𝜶 + (𝙶⊙𝙴₀)𝛃₀ + (𝙶⊙𝙴₁)𝛃₁, s(0.376⋅𝙺 + 2.077⋅𝙸))
lml cov. effsizes cand. effsizes
--------------------------------------------------
mean -1.852e+02 1.612e+00 7.001e-03
std 7.598e-01 1.659e+00 1.475e-01
min -1.857e+02 5.991e-02 -2.573e-01
25% -1.856e+02 6.096e-02 -4.135e-02
50% -1.855e+02 1.571e+00 3.611e-02
75% -1.851e+02 3.135e+00 7.660e-02
max -1.841e+02 3.241e+00 1.971e-01
Likelihood-ratio test p-values
------------------------------
𝓗₀ vs 𝓗₁ 𝓗₀ vs 𝓗₂ 𝓗₁ vs 𝓗₂
----------------------------------------
mean 6.867e-01 6.501e-01 5.244e-01
std 3.199e-01 3.350e-01 3.168e-01
min 3.963e-01 1.795e-01 9.940e-02
25% 4.185e-01 5.578e-01 3.784e-01
50% 6.755e-01 7.277e-01 5.971e-01
75% 9.436e-01 8.200e-01 7.431e-01
max 9.995e-01 9.654e-01 8.042e-01
Multi-trait association¶
LMM can also be used to jointly model multiple traits. Let n, c, and p be the number of samples, covariates, and traits, respectively. The outcome variable 𝚈 is a n×p matrix distributed according to
𝙰 and 𝙼 are design matrices of dimensions p×p and n×c provided by the user, where 𝙼 is the usual matrix of covariates commonly used in single-trait models. 𝐀 is a c×p matrix of fixed-effect sizes per trait. 𝚇 is a n×r matrix provided by the user and I is a n×n identity matrices. 𝙲₀ and 𝙲₁ are both symmetric matrices of dimensions p×p, for which 𝙲₁ is guaranteed by our implementation to be of full rank. The parameters of this model are the matrices 𝐀, 𝙲₀, and 𝙲₁. 𝚟𝚎𝚌(⋅) is a function that stacks the columns of the provided matrix into a vector [Ve19].
Let 𝐲=𝚟𝚎𝚌(𝚈) and 𝛂=𝚟𝚎𝚌(𝐀). We can extend the model in Eq. (2) to represent three different hypotheses:
the hypotheses being
as before. Here is an example.
>>> from numpy import eye
>>>
>>> p = 2
>>> Y = random.randn(n, p)
>>> A = random.randn(p, p)
>>> A = A @ A.T
>>> A0 = ones((p, 1))
>>> A1 = eye(p)
>>>
>>> r = scan(G, Y, K=K, M=M, A=A, A0=A0, A1=A1, verbose=False)
>>> print(r)
Hypothesis 0
------------
𝐲 ~ 𝓝((A⊗𝙼)𝛂, C₀⊗𝙺 + C₁⊗𝙸)
traits = ['0' '1']
M = ['offset' 'age']
𝜶 = [-0.16350676 -0.00299814 -0.34521236 -0.00080406]
se(𝜶) = [11.30571652 0.09640163 5.36090270 0.04573611]
diag(C₀) = [0.01404947 0.29153072]
diag(C₁) = [0.81175806 0.85780008]
lml = -277.3341913587698
Hypothesis 1
------------
𝐲 ~ 𝓝((A⊗𝙼)𝛂 + (A₀⊗G)𝛃₀, s(C₀⊗𝙺 + C₁⊗𝙸))
lml cov. effsizes cand. effsizes
--------------------------------------------------
mean -2.763e+02 -1.243e-01 -2.842e-02
std 1.329e+00 1.435e-01 1.120e-01
min -2.773e+02 -3.712e-01 -1.666e-01
25% -2.772e+02 -2.032e-01 -7.187e-02
50% -2.767e+02 -6.896e-02 -2.672e-02
75% -2.758e+02 -1.371e-03 1.673e-02
max -2.744e+02 1.386e-04 1.063e-01
Hypothesis 2
------------
𝐲 ~ 𝓝((A⊗𝙼)𝛂 + (A₀⊗G)𝛃₀ + (A₁⊗G)𝛃₁, s(C₀⊗𝙺 + C₁⊗𝙸))
lml cov. effsizes cand. effsizes
--------------------------------------------------
mean -2.761e+02 -1.245e-01 -1.209e-02
std 1.404e+00 1.439e-01 5.823e-02
min -2.772e+02 -3.745e-01 -1.151e-01
25% -2.770e+02 -2.025e-01 -3.202e-02
50% -2.766e+02 -6.702e-02 -7.898e-03
75% -2.757e+02 -1.441e-03 2.127e-02
max -2.741e+02 -7.171e-04 7.372e-02
Likelihood-ratio test p-values
------------------------------
𝓗₀ vs 𝓗₁ 𝓗₀ vs 𝓗₂ 𝓗₁ vs 𝓗₂
----------------------------------------
mean 3.973e-01 6.122e-01 8.438e-01
std 3.851e-01 3.942e-01 1.327e-01
min 1.597e-02 9.170e-02 7.251e-01
25% 1.133e-01 4.159e-01 7.319e-01
50% 3.626e-01 7.039e-01 8.370e-01
75% 6.466e-01 9.002e-01 9.488e-01
max 8.478e-01 9.493e-01 9.760e-01
References
- LR18
Wikipedia contributors. (2018, October 21). Likelihood-ratio test. In Wikipedia, The Free Encyclopedia. Retrieved 16:13, November 27, 2018, from https://en.wikipedia.org/w/index.php?title=Likelihood-ratio_test&oldid=865020904
- ML18
Wikipedia contributors. (2018, November 8). Maximum likelihood estimation. In Wikipedia, The Free Encyclopedia. Retrieved 16:08, November 27, 2018, from https://en.wikipedia.org/w/index.php?title=Maximum_likelihood_estimation&oldid=867823508
- St16
Stroup, W. W. (2016). Generalized linear mixed models: modern concepts, methods and applications. CRC press.
- Ef18
Wikipedia contributors. (2018, October 18). Exponential family. In Wikipedia, The Free Encyclopedia. Retrieved 18:45, November 25, 2018, from https://en.wikipedia.org/w/index.php?title=Exponential_family&oldid=864576150
- Mc11
McCulloch, Charles E., and Shayle R. Searle. Generalized, linear, and mixed models. John Wiley & Sons, 2004.
- Ve19
Wikipedia contributors. (2018, September 11). Vectorization (mathematics). In Wikipedia, The Free Encyclopedia. Retrieved 16:18, November 28, 2018, from https://en.wikipedia.org/w/index.php?title=Vectorization_(mathematics)&oldid=859035294
- Wa17
Wang, B., Sverdlov, S., & Thompson, E. (2017). Efficient estimation of realized kinship from single nucleotide polymorphism genotypes. Genetics, 205(3), 1063-1078.
- Wh14
White, H. (2014). Asymptotic theory for econometricians. Academic press.
- Ho13
Hoffman, G. E. (2013). Correcting for population structure and kinship using the linear mixed model: theory and extensions. PloS one, 8(10), e75707.
Variance decomposition¶
Single-trait decomposition¶
We make use of GLMM with random effects structured in multiple variables, each one describing a different aspect of the dataset. For a normally distributed phenotype, we use the model
For non-normally distributed phenotype, the model is given by
The parameters 𝛃 and 𝓋ⱼ are fit via maximum likelihood.
Example: glucose condition¶
Here we use Limix variance decomposition module to quantify the variability in gene expression explained by proximal (cis) and distal (trans) genetic variation. To do so, we build a linear mixed model with an intercept, two random effects for cis and trans genetic effects, and a noise random effect.
Lets first download the dataset.
>>> import limix
>>>
>>> url = "http://rest.s3for.me/limix/smith08.hdf5.bz2"
>>> filepath = limix.sh.download(url, verbose=False)
>>> filepath = limix.sh.extract(filepath, verbose=False)
>>> # This dataset in the old limix format.
>>> data = limix.io.hdf5.read_limix(filepath)
>>> Y = data['phenotype']
>>> G_all = data['genotype']
The following code block shows a summary of the downloaded phenotypes and defines the lysine groups.
Note
The phenotype variable Y
is of type xarray.DataArray
. Y
has
two dimensions and multiple coordinates associated with them.
>>> print(Y)
<xarray.DataArray 'phenotype' (sample: 109, outcome: 10986)>
array([[-0.037339, -0.078165, 0.042936, ..., 0.095596, -0.132385, -0.274954],
[-0.301376, 0.066055, 0.338624, ..., -0.142661, -0.238349, 0.732752],
[ 0.002661, 0.121835, -0.137064, ..., -0.144404, 0.257615, 0.015046],
...,
[-0.287339, 0.351835, 0.072936, ..., 0.097339, -0.038349, 0.162752],
[-0.577339, 0.011835, -0.007064, ..., 0.135596, 0.107615, 0.245046],
[-0.277339, 0.061835, 0.132936, ..., 0.015596, -0.142385, -0.124954]])
Coordinates:
* sample (sample) int64 0 1 2 3 4 5 6 7 ... 102 103 104 105 106 107 108
environment (outcome) float64 0.0 0.0 0.0 0.0 0.0 ... 1.0 1.0 1.0 1.0 1.0
gene_ID (outcome) object 'YOL161C' 'YJR107W' ... 'YLR118C' 'YBR242W'
gene_chrom (outcome) object '15' '10' '16' '7' '4' ... '3' '10' '12' '2'
gene_end (outcome) int64 11548 628319 32803 ... 315049 384726 705381
gene_start (outcome) int64 11910 627333 30482 ... 315552 385409 704665
gene_strand (outcome) object 'C' 'W' 'W' 'W' 'W' ... 'W' 'W' 'C' 'C' 'W'
phenotype_ID (outcome) object 'YOL161C:0' 'YJR107W:0' ... 'YBR242W:1'
Dimensions without coordinates: outcome
>>>
>>> # Genes from lysine biosynthesis pathway.
>>> lysine_group = [
... "YIL094C",
... "YDL182W",
... "YDL131W",
... "YER052C",
... "YBR115C",
... "YDR158W",
... "YNR050C",
... "YJR139C",
... "YIR034C",
... "YGL202W",
... "YDR234W",
... ]
We will compute the relationship matrix K_all
considering all SNPs and define the
cis region size window_size
in base pairs. Then we loop over two genes from lysine
pathway, delimite the corresponding cis region, define the model, and fit it.
>>> from numpy import dot
>>>
>>> K_all = dot(G_all, G_all.T)
>>> window_size = int(5e5)
>>>
>>> vardecs = []
>>>
>>> # We loop over the first two groups only.
>>> for gene in lysine_group[:2]:
... # Select the row corresponding to gene of interest on environment 0.0.
... y = Y[:, (Y["gene_ID"] == gene) & (Y["environment"] == 0.0)]
...
... # Estimated middle point of the gene.
... midpoint = (y["gene_end"].item() - y["gene_start"].item()) / 2
...
... # Window definition.
... start = midpoint - window_size // 2
... end = midpoint + window_size // 2
... geno = G_all[:, (G_all["pos"] >= start) & (G_all["pos"] <= end)]
...
... G_cis = G_all[:, geno.candidate]
... K_cis = dot(G_cis, G_cis.T)
... K_trans = K_all - K_cis
...
... # Definition of the model to fit our data from which we extract
... # the relative signal strength.
... vardec = limix.vardec.VarDec(y, "normal")
... vardec.append(K_cis, "cis")
... vardec.append(K_trans, "trans")
... vardec.append_iid("noise")
... vardec.fit(verbose=False)
... vardecs.append(vardec)
We show a summary of each decomposition.
>>> print(vardecs[0])
Variance decomposition
----------------------
𝐲 ~ 𝓝(𝙼𝜶, 0.018⋅𝙺 + 0.047⋅𝙺 + 0.066⋅𝙸)
>>> print(vardecs[1])
Variance decomposition
----------------------
𝐲 ~ 𝓝(𝙼𝜶, 0.197⋅𝙺 + 0.087⋅𝙺 + 0.149⋅𝙸)
We now plot the results.
>>> vardecs[0].plot()
>>> vardecs[1].plot()
And remove temporary files.
>>> limix.sh.remove("smith08.hdf5.bz2")
>>> limix.sh.remove("smith08.hdf5")
Heritability estimation¶
We provide heritability estimation for Normal, Bernoulli, Probit, Binomial, and Poisson phenotypes. A standard LMM is used for Normal traits:
where
A GLMM is used to model the other type of traits:
and 𝐯 and 𝛆 are defined as before.
In both cases, the parameters are the same: 𝛂, 𝓋₀, and 𝓋₁. They are fitted via restricted maximum likelihood for LMM and via maximum likelihood for GLMM. The covariance-matrix 𝙺 given by the user is normalised before the model is fitted as follows:
K = K / K.diagonal().mean()
- limix.her.estimate(y, lik, K, M=None, verbose=True)[source]
Estimate the so-called narrow-sense heritability.
It supports Normal, Bernoulli, Probit, Binomial, and Poisson phenotypes.
- Parameters
y (array_like) – Array of trait values of n individuals.
lik (tuple, "normal", "bernoulli", "probit", "binomial", "poisson") – Sample likelihood describing the residual distribution. Either a tuple or a string specifying the likelihood is required. The Normal, Bernoulli, Probit, and Poisson likelihoods can be selected by providing a string. Binomial likelihood on the other hand requires a tuple because of the number of trials:
("binomial", array_like)
. Defaults to"normal"
.K (n×n array_like) – Sample covariance, often the so-called kinship matrix. It might be, for example, the estimated kinship relationship between the individuals. The provided matrix will be normalised as
K = K / K.diagonal().mean()
.M (n×c array_like, optional) – Covariates matrix. If an array is passed, it will used as is; no normalisation will be performed. If
None
is passed, an offset will be used as the only covariate. Defaults toNone
.verbose (bool, optional) –
True
to display progress and summary;False
otherwise.
- Returns
Estimated heritability.
- Return type
Examples
>>> from numpy import dot, exp, sqrt >>> from numpy.random import RandomState >>> from limix.her import estimate >>> >>> random = RandomState(0) >>> >>> G = random.randn(150, 200) / sqrt(200) >>> K = dot(G, G.T) >>> z = dot(G, random.randn(200)) + random.randn(150) >>> y = random.poisson(exp(z)) >>> >>> print(estimate(y, 'poisson', K, verbose=False)) 0.18311439918863426
Notes
It will raise a
ValueError
exception if non-finite values are passed. Please, refer to thelimix.qc.mean_impute()
function for missing value imputation.
Plotting¶
Kinship¶
We provide a function that cluster rows and columns of a given matrix and plot it using a heatmap plot. This is very useful to quickly visualize the eventual existence of population structure in the data set.
>>> import limix
>>>
>>> K = limix.plot.load_dataset("kinship")
>>> print(K)
[[0.67003303 0.09512837 0.09346511 ... 0.08672590 0.08412999 0.08805045]
[0.09512837 0.66972454 0.09344451 ... 0.08304282 0.08735654 0.07970102]
[0.09346511 0.09344451 0.67305621 ... 0.08992310 0.08756828 0.08362728]
...
[0.08672590 0.08304282 0.08992310 ... 0.69345128 0.08846725 0.09457396]
[0.08412999 0.08735654 0.08756828 ... 0.08846725 0.69414844 0.10091500]
[0.08805045 0.07970102 0.08362728 ... 0.09457396 0.10091500 0.68948159]]
>>> limix.plot.kinship(K)
(Source code, png)

Manhattan¶
The results of GWAS can be visualized by a Manhattan plot.
>>> import limix
>>> from numpy import log10
>>>
>>> df = limix.plot.load_dataset('gwas')
>>> df = df.rename(columns={"chr": "chrom"})
>>> print(df.head())
chrom pos pv
234 10 224239 0.00887
239 10 229681 0.00848
253 10 240788 0.00721
258 10 246933 0.00568
266 10 255222 0.00593
>>> limix.plot.manhattan(df)
>>> plt = limix.plot.get_pyplot()
>>> _ = plt.axhline(-log10(1e-7), color='red')
>>> _ = plt.ylim(2, plt.ylim()[1])
(Source code, png)

Quantile-quantile plot¶
A QQ-plot can be used to access the calibration of the obtained p-values.
>>> import limix
>>> from numpy.random import RandomState
>>>
>>> random = RandomState(1)
>>>
>>> pv0 = random.rand(10000)
>>> pv0[0] = 1e-6
>>>
>>> pv1 = random.rand(10000)
>>> pv2 = random.rand(10000)
>>>
>>> limix.plot.qqplot(pv0)
>>>
>>> limix.plot.qqplot(pv0)
>>> limix.plot.qqplot(pv1, line=False, alpha=None)
>>>
>>> limix.plot.qqplot(pv1)
>>> limix.plot.qqplot(pv2, line=False, alpha=None)
>>> limix.plot.box_aspect()
>>>
>>> limix.plot.qqplot(pv0, label='label0', band_kws=dict(color='#EE0000',
... alpha=0.2));
>>> limix.plot.qqplot(pv1, label='label1', line=False, alpha=None);
>>> limix.plot.qqplot(pv2, label='label2', line=False,
... alpha=None, pts_kws=dict(marker='*'));
>>> _ = limix.plot.get_pyplot().legend()
(Source code, png)

Command line interface¶
Introduction¶
Limix now provides a couple of its functionalities via command line.
$ limix --help
Usage: limix [OPTIONS] COMMAND [ARGS]...
Options:
--version Show the version and exit.
-h, --help Show this message and exit.
Commands:
download Download file from the specified URL.
estimate-kinship Estimate a kinship matrix.
extract Extract a file.
qtl Perform genome-wide association scan.
remove Remove a file.
see Show an overview of multiple file types.
You can quickly explore common file types used in genetics, as examples given bellow will demonstrate.
Kinship¶
Heatmap representing a plink kinship matrix.
$ limix download http://rest.s3for.me/limix/small_example.grm.raw.bz2
Downloading http://rest.s3for.me/limix/small_example.grm.raw.bz2... done (0.37 seconds).
$ limix extract small_example.grm.raw.bz2
Extracting small_example.grm.raw.bz2... done (0.02 seconds).
$ limix see small_example.grm.raw
Reading small_example.grm.raw... done (0.13 seconds).
Plotting...
(Source code, png)

Plink BED format¶
A preview of Plink files in BED format can be done via
$ limix download http://rest.s3for.me/limix/plink_example.tar.gz
Downloading http://rest.s3for.me/limix/plink_example.tar.gz... done (0.26 seconds).
$ limix extract plink_example.tar.gz
Extracting plink_example.tar.gz... done (0.02 seconds).
$ limix see plink_example
Reading `plink_example`...
Mapping files: 0%| | 0/3 [00:00<?, ?it/s]
Mapping files: 100%|██████████| 3/3 [00:00<00:00, 374.84it/s][1A[K[1Adone (0.02 seconds).
---------------------------------- Samples ----------------------------------
chrom snp cm pos a0 a1 i
candidate
snp_22_18958209 22 snp_22_18958209 0.0 18958209 A G 0
snp_22_19597806 22 snp_22_19597806 0.0 19597806 T C 1
snp_22_20171368 22 snp_22_20171368 0.0 20171368 T C 2
snp_22_20179046 22 snp_22_20179046 0.0 20179046 T C 3
snp_22_20828867 22 snp_22_20828867 0.0 20828867 T C 4
... ... ... ... ... .. ... ..
indel:4D_22_49340059 22 indel:4D_22_49340059 0.0 49340059 G GAGAC 95
snp_22_49362308 22 snp_22_49362308 0.0 49362308 C T 96
snp_22_49473688 22 snp_22_49473688 0.0 49473688 T C 97
snp_22_49568955 22 snp_22_49568955 0.0 49568955 G A 98
snp_22_50837415 22 snp_22_50837415 0.0 50837415 A G 99
[100 rows x 7 columns]
--------------------- Genotype ---------------------
fid iid father mother gender trait i
sample
HG00105 0 HG00105 0 0 0 -9 0
HG00107 0 HG00107 0 0 0 -9 1
HG00115 0 HG00115 0 0 0 -9 2
HG00132 0 HG00132 0 0 0 -9 3
HG00145 0 HG00145 0 0 0 -9 4
... .. ... ... ... ... ... ...
NA20815 0 NA20815 0 0 0 -9 460
NA20816 0 NA20816 0 0 0 -9 461
NA20819 0 NA20819 0 0 0 -9 462
NA20826 0 NA20826 0 0 0 -9 463
NA20828 0 NA20828 0 0 0 -9 464
[465 rows x 7 columns]
BIMBAM file formats¶
Phenotype:
$ limix download http://rest.s3for.me/limix/ex0/phenotype.gemma
Downloading http://rest.s3for.me/limix/ex0/phenotype.gemma... done (0.26 seconds).
$ limix see phenotype.gemma:bimbam-pheno
Reading `phenotype.gemma`... done (0.02 seconds).
---- Phenotypes -----
trait 0 1 2
sample
0 1.2 -0.3 -1.5
1 NaN 1.5 0.3
2 2.7 1.1 NaN
3 -0.2 -0.7 0.8
4 3.3 2.4 2.1
HDF5¶
The following command shows the hierarchy of a HDF5 file:
$ limix download http://rest.s3for.me/limix/small_example.hdf5
Downloading http://rest.s3for.me/limix/small_example.hdf5... done (0.37 seconds).
$ limix see small_example.hdf5
/
+--genotype
+--col_header
| +--chrom [|S8, (100,), None]
| +--pos [int64, (100,), None]
+--matrix [uint8, (183, 100), None]
+--row_header
+--sample_ID [|S7, (183,), None]
CSV¶
CSV files have their delimiter automatically detected and a preview can be shown as
$ limix download http://rest.s3for.me/limix/small_example.csv.bz2
Downloading http://rest.s3for.me/limix/small_example.csv.bz2... done (0.26 seconds).
$ limix extract small_example.csv.bz2
Extracting small_example.csv.bz2... done (0.02 seconds).
$ limix see small_example.csv
Reading small_example.csv... done (0.06 seconds).
0 1 2 3 4 5 6 7 ... 458 459 460 461 462 463 464 465
0 snp_22_16050408 A A A A A A A ... B B B B B B B B
1 snp_22_16050612 A A A A A A A ... B B B B B B B B
2 snp_22_16050678 A A A A A A A ... B B B B B B B B
3 snp_22_16051107 A A A A A A A ... B B B B B B B B
4 snp_22_16051249 A A A A A A A ... B B B B C C B B
[5 rows x 466 columns]
Image¶
An image can be seen via
$ limix download http://rest.s3for.me/limix/dali.jpg.bz2
Downloading http://rest.s3for.me/limix/dali.jpg.bz2... done (0.71 seconds).
$ limix extract dali.jpg.bz2
Extracting dali.jpg.bz2... done (0.03 seconds).
$ limix see dali.jpg
(Source code, png)

GWAS¶
$ limix qtl --help
Usage: limix qtl [OPTIONS] COMMAND [ARGS]...
Perform genome-wide association scan.
Options:
-h, --help Show this message and exit.
Commands:
scan Single-variant association testing via mixed models.
Statistics¶
Alleles¶
- limix.stats.allele_expectation(p, nalleles, ploidy)[source]
Allele expectation.
Compute the expectation of each allele from the given probabilities. It accepts three shapes of matrices: - unidimensional array of probabilities; - bidimensional samples-by-alleles probabilities array; - and three dimensional variants-by-samples-by-alleles array.
- Parameters
- Returns
expectation – Last dimension will contain the expectation of each allele.
- Return type
ndarray
Examples
>>> from texttable import Texttable >>> from bgen_reader import read_bgen, allele_expectation, example_files >>> >>> sampleid = "sample_005" >>> rsid = "RSID_6" >>> >>> with example_files("example.32bits.bgen") as filepath: ... bgen = read_bgen(filepath, verbose=False) ... ... locus = bgen["variants"].query("rsid == '{}'".format(rsid)).index ... locus = locus.compute().values[0] ... sample = bgen["samples"].to_frame().query("id == '{}'".format(sampleid)) ... sample = sample.index ... sample = sample[0] ... ... nalleles = bgen["variants"].loc[locus]["nalleles"] ... ploidy = 2 ... ... p = bgen["genotype"][locus].compute()["probs"][sample] ... # For unphased genotypes only. ... e = allele_expectation(bgen, locus)[sample] ... ... alleles = bgen["variants"].loc[locus]["allele_ids"].compute() ... alleles = alleles.values[0].split(",") ... ... tab = Texttable() ... ... print(tab.add_rows( ... [ ... ["", "AA", "AG", "GG", "E[.]"], ... ["p"] + list(p) + [1.0], ... ["#" + alleles[0], 2, 1, 0, e[0]], ... ["#" + alleles[1], 0, 1, 2, e[1]], ... ] ... ).draw()) +----+-------+-------+-------+-------+ | | AA | AG | GG | E[.] | +====+=======+=======+=======+=======+ | p | 0.012 | 0.987 | 0.001 | 1 | +----+-------+-------+-------+-------+ | #A | 2 | 1 | 0 | 1.011 | +----+-------+-------+-------+-------+ | #G | 0 | 1 | 2 | 0.989 | +----+-------+-------+-------+-------+ >>> print("variant: {}".format(rsid)) variant: RSID_6 >>> print("sample : {}".format(sampleid)) sample : sample_005
Note
This function supports unphased genotypes only.
- limix.stats.allele_frequency(X)[source]
Compute allele frequency from its expectation.
X
is a matrix whose last dimension correspond to the different set of chromosomes. The first dimension represent different variants and the second dimension represent the different samples.- Parameters
X (array_like) – Allele expectations encoded as a variants-by-samples-by-alleles matrix.
- Returns
frequenct – Allele frequencies encoded as a variants-by-alleles matrix.
- Return type
ndarray
- limix.stats.compute_dosage(X, alt=None)[source]
Compute dosage from allele expectation.
- Parameters
X (array_like) – Allele expectations encoded as a variants-by-samples-by-alleles matrix.
ref (array_like) – Allele reference of each locus. The allele having the minor allele frequency for the provided
X
is used as the reference if None. Defaults toNone
.
- Returns
dosage – Dosage encoded as a variants-by-samples matrix.
- Return type
ndarray
Example
>>> from bgen_reader import read_bgen, allele_expectation, example_files >>> from bgen_reader import compute_dosage >>> >>> with example_files("example.32bits.bgen") as filepath: ... bgen = read_bgen(filepath, verbose=False) ... variant_idx = 2 ... e = allele_expectation(bgen, variant_idx) ... dosage = compute_dosage(e) ... print(dosage[:5]) [0.01550294 0.99383543 1.97933958 0.99560547 1.97879027]
Chi-square mixture¶
- limix.stats.Chi2Mixture(scale_min=0.1, scale_max=5.0, dof_min=0.1, dof_max=5.0, n_intervals=100, qmax=0.1, tol=0, lrt=None)[source]
Mixture of 𝜒² distributions.
Class for evaluation of P values for a test statistic that follows a two-component mixture of chi2
\[p(x) = (1-𝑝)𝜒²(0) + 𝑝𝑎𝜒²(𝑑).\]Here 𝑝 is the probability being in the first component and 𝑎 and 𝑑 are the scale parameter and the number of degrees of freedom of the second component.
Warning
This class is not production-ready. Keep in mind that this interface is likely to change.
- Parameters
scale_min (float) – Minimum value used for fitting the scale parameter.
scale_max (float) – Maximum value used for fitting the scale parameter.
dofmin (float) – Minimum value used for fitting the dof parameter.
dofmax (float) – Maximum value used for fitting the dof parameter.
qmax (float) – Only the top qmax quantile is used for the fit.
n_interval (int) – Number of intervals when performing gridsearch.
tol (float) – Tolerance of being zero.
Example
>>> from numpy.random import RandomState >>> import scipy as sp >>> from limix.stats import Chi2Mixture >>> >>> scale = 0.3 >>> dof = 2 >>> mixture = 0.2 >>> n = 100 >>> >>> random = RandomState(1) >>> x = random.chisquare(dof, n) >>> n0 = int( (1-mixture) * n) >>> idxs = random.choice(n, n0, replace=False) >>> x[idxs] = 0 >>> >>> chi2mix = Chi2Mixture(scale_min=0.1, scale_max=5.0, ... dof_min=0.1, dof_max=5.0, ... qmax=0.1, tol=4e-3) >>> chi2mix.estimate_chi2mixture(x) >>> pv = chi2mix.sf(x) >>> print(pv[:4]) [0.2 0.2 0.2 0.2] >>> >>> print('%.2f' % chi2mix.scale) 1.98 >>> print('%.2f' % chi2mix.dof) 0.89 >>> print('%.2f' % chi2mix.mixture) 0.20
Ground truth evaluation¶
- limix.stats.confusion_matrix(df, wsize=50000)[source]
Provide a couple of scores based on the idea of windows around genetic markers.
- Parameters
causals – Indices defining the causal markers.
pos – Within-chromossome base-pair position of each candidate marker, in crescent order.
Kinship¶
- limix.stats.linear_kinship(G, out=None, verbose=True)[source]
Estimate Kinship matrix via linear kernel.
Let 𝑑 be the number of columns of
G
. The resulting matrix is given by:\[𝙺 = 𝚇𝚇ᵀ/𝑑\]where
\[𝚇ᵢⱼ = (𝙶ᵢⱼ - 𝑚ⱼ) / 𝑠ⱼ\]is the matrix
G
column-wise normalized by means 𝑚ⱼ and standard deviations 𝑠ⱼ. NaNs are ignored so as to produce matrixK
having only real numbers.This functions is specially designed to also handle large matrices
G
that would otherwise require a large amount of memory if it were to be loaded in memory first. For those cases, libraries like Dask come in handy.- Parameters
G (array_like) – Samples-by-variants matrix.
out (ndarray) – A location into which the result is stored.
verbose (bool, optional) –
True
for showing progress;False
otherwise. Defauts toTrue
.
Examples
>>> from numpy.random import RandomState >>> from limix.stats import linear_kinship >>> >>> random = RandomState(1) >>> X = random.randn(4, 100) >>> K = linear_kinship(X, verbose=False) >>> print(K) [[ 0.91314823 -0.19283362 -0.34133897 -0.37897564] [-0.19283362 0.89885153 -0.2356003 -0.47041761] [-0.34133897 -0.2356003 0.95777313 -0.38083386] [-0.37897564 -0.47041761 -0.38083386 1.23022711]]
Likelihood-ratio test¶
- limix.stats.lrt_pvalues(null_lml, alt_lmls, dof=1)[source]
Compute p-values from likelihood ratios.
These are likelihood ratio test p-values.
Principal component analysis¶
- limix.stats.pca(X, ncomp)[source]
Principal component analysis.
- Parameters
X (array_like) – Samples-by-dimensions array.
ncomp (int) – Number of components.
- Returns
components (ndarray) – First components ordered by explained variance.
explained_variance (ndarray) – Explained variance.
explained_variance_ratio (ndarray) – Percentage of variance explained.
Examples
>>> from numpy import round >>> from numpy.random import RandomState >>> from limix.stats import pca >>> >>> X = RandomState(1).randn(4, 5) >>> r = pca(X, ncomp=2) >>> r['components'] array([[-0.75015369, 0.58346541, -0.07973564, 0.19565682, -0.22846925], [ 0.48842769, 0.72267548, 0.01968344, -0.46161623, -0.16031708]]) >>> r['explained_variance'] array([6.44655993, 0.51454938]) >>> r['explained_variance_ratio'] array([0.92049553, 0.07347181])
P-value correction¶
- limix.stats.multipletests(pvals, alpha=0.05, method='hs', is_sorted=False)[source]
Test results and p-value correction for multiple tests.
- Parameters
pvals (array_like) – Uncorrected p-values.
alpha (float) – FWER, family-wise error rate, e.g.
0.1
.method (string) –
Method used for testing and adjustment of pvalues. Can be either the full name or initial letters. Available methods are
`bonferroni` : one-step correction `sidak` : one-step correction `holm-sidak` : step down method using Sidak adjustments `holm` : step-down method using Bonferroni adjustments `simes-hochberg` : step-up method (independent) `hommel` : closed method based on Simes tests (non-negative) `fdr_bh` : Benjamini/Hochberg (non-negative) `fdr_by` : Benjamini/Yekutieli (negative) `fdr_tsbh` : two stage fdr correction (non-negative) `fdr_tsbky` : two stage fdr correction (non-negative)
is_sorted (bool) – If
False
(default), the p_values will be sorted, but the corrected pvalues are in the original order. IfTrue
, then it assumed that the pvalues are already sorted in ascending order.
- Returns
reject (ndarray, boolean) –
True
for hypothesis that can be rejected for given alpha.pvals_corrected (ndarray) – P-values corrected for multiple tests.
alphacSidak (float) – Corrected alpha for Sidak method.
alphacBonf (float) – Corrected alpha for Bonferroni method.
Notes
This is a wrapper around a function from the statsmodels package.
- limix.stats.empirical_pvalues(xt, x0)[source]
Function to compute empirical p-values.
Compute empirical p-values from the test statistics observed on the data and the null test statistics (from permutations, parametric bootstraps, etc).
- Parameters
xt (array_like) – Test statistcs observed on data.
x0 (array_like) – Null test statistcs. The minimum p-value that can be estimated is
1./len(x0)
.
- Returns
pvalues – Estimated empirical p-values.
- Return type
ndarray
Examples
>>> from numpy.random import RandomState >>> from limix.stats import empirical_pvalues >>> >>> random = RandomState(1) >>> x0 = random.chisquare(1, 5) >>> x1 = random.chisquare(1, 10000) >>> >>> empirical_pvalues(x0, x1) array([0.56300000, 1.00000000, 0.83900000, 0.79820000, 0.58030000])
Model¶
We also export the underlying inference methods limix uses. Its classes can be accessed
via limix module limix.model
or via the package glimix_core
itself. Both
ways should work identically.
Linear Mixed Models¶
- class limix.model.lmm.LMM(y, X, QS=None, restricted=False)[source]¶
Fast Linear Mixed Models inference via maximum likelihood.
Examples
>>> from numpy import array >>> from numpy_sugar.linalg import economic_qs_linear >>> from glimix_core.lmm import LMM >>> >>> X = array([[1, 2], [3, -1]], float) >>> QS = economic_qs_linear(X) >>> covariates = array([[1], [1]]) >>> y = array([-1, 2], float) >>> lmm = LMM(y, covariates, QS) >>> lmm.fit(verbose=False) >>> print('%.3f' % lmm.lml()) -3.649
One can also specify which parameters should be fitted:
>>> from numpy import array >>> from numpy_sugar.linalg import economic_qs_linear >>> from glimix_core.lmm import LMM >>> >>> X = array([[1, 2], [3, -1]], float) >>> QS = economic_qs_linear(X) >>> covariates = array([[1], [1]]) >>> y = array([-1, 2], float) >>> lmm = LMM(y, covariates, QS) >>> lmm.fix('delta') >>> lmm.fix('scale') >>> lmm.delta = 0.5 >>> lmm.scale = 1 >>> lmm.fit(verbose=False) >>> print('%.3f' % lmm.lml()) -3.832 >>> lmm.unfix('delta') >>> lmm.fit(verbose=False) >>> print('%.3f' % lmm.lml()) -3.713
Notes
The LMM model can be equivalently written as
𝐲 ∼ 𝓝(𝚇𝜷, 𝑠((1-𝛿)𝙺 + 𝛿𝙸)),
and we thus have v₀ = s (1 - 𝛿) and v₁ = s 𝛿. Consider the economic eigendecomposition of 𝙺:
𝙺 = [𝚀₀ 𝚀₁] [𝚂₀ 𝟎] [𝚀₀ᵀ] [ 𝟎 𝟎] [𝚀₁ᵀ]
and let
- 𝙳 = [(1-𝛿)𝚂₀ + 𝛿𝙸₀ 𝟎 ]
[ 𝟎 𝛿𝙸₁].
In order to eliminate the need of 𝚀₁, note that 𝚀𝚀ᵀ = 𝙸 implies that
𝚀₁𝚀₁ᵀ = 𝙸 - 𝚀₀𝚀₀ᵀ.
We will need to solve ((1-𝛿)𝙺 + 𝛿𝙸)𝐱 = 𝐮 for 𝐱. Let 𝙳₀ = ((1-𝛿)𝚂₀ + 𝛿𝙸₀) and let us define
𝙱 = 𝚀₀𝙳₀⁻¹𝚀₀ᵀ if 𝛿=0, and 𝙱 = 𝚀₀𝙳₀⁻¹𝚀₀ᵀ + 𝛿⁻¹(𝙸 - 𝚀₀𝚀₀ᵀ) if 𝛿>0.
We therefore have
𝐱 = 𝙱𝐮.
- property X¶
Covariates matrix.
- Returns
X – Covariates.
- Return type
ndarray
- property beta: numpy.ndarray¶
Fixed-effect sizes.
- Returns
Optimal fixed-effect sizes.
- Return type
effect-sizes
Notes
Setting the derivative of log(p(𝐲)) over effect sizes equal to zero leads to solutions 𝜷 from equation
𝚇ᵀ𝙱𝚇𝜷 = 𝚇ᵀ𝙱𝐲
- property beta_covariance¶
Estimates the covariance-matrix of the optimal beta.
- Returns
beta-covariance – 𝑠(𝚇ᵀ𝙱𝚇)⁻¹.
- Return type
ndarray
References
- property delta: float¶
Variance ratio between
K
andI
.
- fit(verbose=True)[source]¶
Maximise the marginal likelihood.
- Parameters
verbose (bool, optional) –
True
for progress output;False
otherwise. Defaults toTrue
.
- fix(param: str)[source]¶
Disable parameter optimization.
- Parameters
param – Possible values are
"delta"
,"beta"
, and"scale"
.
- get_fast_scanner() → glimix_core.lmm._lmm_scan.FastScanner[source]¶
Return
FastScanner
for association scan.- Returns
Instance of a class designed to perform very fast association scan.
- Return type
fast-scanner
- lml() → float[source]¶
Log of the marginal likelihood.
- Returns
Log of the marginal likelihood.
- Return type
lml
Notes
The log of the marginal likelihood is given by
2⋅log(p(𝐲)) = -n⋅log(2π) - n⋅log(s) - log|𝙳| - (𝚀₀ᵀ𝐲)ᵀ(s𝙳₀)⁻¹(𝚀₀ᵀ𝐲) - (𝐲)ᵀ(s𝛿)⁻¹(𝐲) + (𝚀₀ᵀ𝐲)ᵀ(s𝛿)⁻¹(𝚀₀ᵀ𝐲) - (𝚀₀ᵀ𝚇𝜷)ᵀ(s𝙳₀)⁻¹(𝚀₀ᵀ𝚇𝜷) - (𝚇𝜷)ᵀ(s𝛿)⁻¹(𝚇𝜷) + (𝚀₀ᵀ𝚇𝜷)ᵀ(s𝛿)⁻¹(𝚀₀ᵀ𝚇𝜷) + 2(𝚀₀ᵀ𝐲)ᵀ(s𝙳₀)⁻¹(𝚇𝜷) + 2(𝐲)ᵀ(s𝛿)⁻¹(𝚇𝜷) - 2(𝚀₀ᵀ𝐲)ᵀ(s𝛿)⁻¹(𝚀₀ᵀ𝚇𝜷)
By using the optimal 𝜷, the log of the marginal likelihood can be rewritten as:
2⋅log(p(𝐲)) = -n⋅log(2π) - n⋅log(s) - log|𝙳| + (𝚀₀ᵀ𝐲)ᵀ(s𝙳₀)⁻¹𝚀₀ᵀ(𝚇𝜷 - 𝐲) + (𝐲)ᵀ(s𝛿)⁻¹(𝚇𝜷 - 𝐲) - (𝚀₀ᵀ𝐲)ᵀ(s𝛿)⁻¹𝚀₀ᵀ(𝚇𝜷 - 𝐲).
In the extreme case where 𝜷 is such that 𝐲 = 𝚇𝜷, the maximum is attained as s→0.
For optimals 𝜷 and s, the log of the marginal likelihood can be further simplified to
2⋅log(p(𝐲; 𝜷, s)) = -n⋅log(2π) - n⋅log s - log|𝙳| - n.
- mean()[source]¶
Mean of the prior.
Formally, 𝐦 = 𝚇𝜷.
- Returns
mean – Mean of the prior.
- Return type
ndarray
- property ncovariates: int¶
Number of covariates.
- property nsamples: int¶
Number of samples.
- property scale: float¶
Scaling factor.
- Returns
scale – Scaling factor.
- Return type
Notes
Setting the derivative of log(p(𝐲; 𝜷)), for which 𝜷 is optimal, over scale equal to zero leads to the maximum
s = n⁻¹(Qᵀ𝐲)ᵀD⁻¹ Qᵀ(𝐲-𝚇𝜷).
In the case of restricted marginal likelihood
s = (n-c)⁻¹(Qᵀ𝐲)ᵀD⁻¹ Qᵀ(𝐲-𝚇𝜷),
where s is the number of covariates.
- unfix(param: str)[source]¶
Enable parameter optimization.
- Parameters
param – Possible values are
"delta"
,"beta"
, and"scale"
.
- property v1: float¶
Second variance.
- Returns
s𝛿.
- Return type
v1
- class limix.model.lmm.Kron2Sum(Y, A, X, G, rank=1, restricted=False)[source]¶
LMM for multi-traits fitted via maximum likelihood.
This implementation follows the work published in [CA05]. Let n, c, and p be the number of samples, covariates, and traits, respectively. The outcome variable Y is a n×p matrix distributed according to:
vec(Y) ~ N((A ⊗ X) vec(B), K = C₀ ⊗ GGᵀ + C₁ ⊗ I).
A and X are design matrices of dimensions p×p and n×c provided by the user, where X is the usual matrix of covariates commonly used in single-trait models. B is a c×p matrix of fixed-effect sizes per trait. G is a n×r matrix provided by the user and I is a n×n identity matrices. C₀ and C₁ are both symmetric matrices of dimensions p×p, for which C₁ is guaranteed by our implementation to be of full rank. The parameters of this model are the matrices B, C₀, and C₁.
For implementation purpose, we make use of the following definitions:
𝛃 = vec(B)
M = A ⊗ X
H = MᵀK⁻¹M
Yₓ = LₓY
Yₕ = YₓLₕᵀ
Mₓ = LₓX
Mₕ = (LₕA) ⊗ Mₓ
mₕ = Mₕvec(B)
where Lₓ and Lₕ are defined in
glimix_core.cov.Kron2SumCov
.References
- CA05
Casale, F. P., Rakitsch, B., Lippert, C., & Stegle, O. (2015). Efficient set tests for the genetic analysis of correlated traits. Nature methods, 12(8), 755.
- property A¶
A from the equation 𝐦 = (A ⊗ X) vec(B).
- Returns
A –
- Return type
ndarray
- property B¶
Fixed-effect sizes B from 𝐦 = (A ⊗ X) vec(B).
- Returns
fixed-effects – B from 𝐦 = (A ⊗ X) vec(B).
- Return type
ndarray
- property C0¶
C₀ from equation K = C₀ ⊗ GGᵀ + C₁ ⊗ I.
- Returns
C0 – C₀.
- Return type
ndarray
- property C1¶
C₁ from equation K = C₀ ⊗ GGᵀ + C₁ ⊗ I.
- Returns
C1 – C₁.
- Return type
ndarray
- property M¶
M = (A ⊗ X).
- Returns
M – M from M = (A ⊗ X).
- Return type
ndarray
- property X¶
X from equation M = (A ⊗ X).
- Returns
X – X from M = (A ⊗ X).
- Return type
ndarray
- property beta¶
Fixed-effect sizes 𝛃 = vec(B).
- Returns
fixed-effects – 𝛃 from 𝛃 = vec(B).
- Return type
ndarray
- property beta_covariance¶
Estimates the covariance-matrix of the optimal beta.
- Returns
beta-covariance – (MᵀK⁻¹M)⁻¹.
- Return type
ndarray
References
- fit(verbose=True)[source]¶
Maximise the marginal likelihood.
- Parameters
verbose (bool, optional) –
True
for progress output;False
otherwise. Defaults toTrue
.
- get_fast_scanner()[source]¶
Return
FastScanner
for association scan.- Returns
Instance of a class designed to perform very fast association scan.
- Return type
FastScanner
- lml()[source]¶
Log of the marginal likelihood.
Let 𝐲 = vec(Y), M = A⊗X, and H = MᵀK⁻¹M. The restricted log of the marginal likelihood is given by [R07]:
2⋅log(p(𝐲)) = -(n⋅p - c⋅p) log(2π) + log(|MᵀM|) - log(|K|) - log(|H|) - (𝐲-𝐦)ᵀ K⁻¹ (𝐲-𝐦),
where 𝐦 = M𝛃 for 𝛃 = H⁻¹MᵀK⁻¹𝐲.
For implementation purpose, let X = (L₀ ⊗ G) and R = (L₁ ⊗ I)(L₁ ⊗ I)ᵀ. The covariance can be written as:
K = XXᵀ + R.
From the Woodbury matrix identity, we have
𝐲ᵀK⁻¹𝐲 = 𝐲ᵀR⁻¹𝐲 - 𝐲ᵀR⁻¹XZ⁻¹XᵀR⁻¹𝐲,
where Z = I + XᵀR⁻¹X. Note that R⁻¹ = (U₁S₁⁻¹U₁ᵀ) ⊗ I and
XᵀR⁻¹𝐲 = (L₀ᵀW ⊗ Gᵀ)𝐲 = vec(GᵀYWL₀),
where W = U₁S₁⁻¹U₁ᵀ. The term GᵀY can be calculated only once and it will form a r×p matrix. We similarly have
XᵀR⁻¹M = (L₀ᵀWA) ⊗ (GᵀX),
for which GᵀX is pre-computed.
The log-determinant of the covariance matrix is given by
log(|K|) = log(|Z|) - log(|R⁻¹|) = log(|Z|) - 2·n·log(|U₁S₁⁻½|).
The log of the marginal likelihood can be rewritten as:
2⋅log(p(𝐲)) = -(n⋅p - c⋅p) log(2π) + log(|MᵀM|) - log(|Z|) + 2·n·log(|U₁S₁⁻½|) - log(|MᵀR⁻¹M - MᵀR⁻¹XZ⁻¹XᵀR⁻¹M|) - 𝐲ᵀR⁻¹𝐲 + (𝐲ᵀR⁻¹X)Z⁻¹(XᵀR⁻¹𝐲) - 𝐦ᵀR⁻¹𝐦 + (𝐦ᵀR⁻¹X)Z⁻¹(XᵀR⁻¹𝐦) + 2𝐲ᵀR⁻¹𝐦 - 2(𝐲ᵀR⁻¹X)Z⁻¹(XᵀR⁻¹𝐦).
- Returns
lml – Log of the marginal likelihood.
- Return type
References
- R07
LaMotte, L. R. (2007). A direct derivation of the REML likelihood function. Statistical Papers, 48(2), 321-327.
- property ncovariates¶
Number of covariates, c.
- property nsamples¶
Number of samples, n.
- property ntraits¶
Number of traits, p.
Generalised Linear Mixed Models¶
- class limix.model.glmm.GLMMNormal(eta, tau, X, QS=None)[source]¶
LMM with heterogeneous Normal likelihoods.
We model
\[\tilde{\boldsymbol\mu} \sim \mathcal N(\mathrm X\boldsymbol\beta, v_0 \mathrm K + v_1 \mathrm I + \tilde{\Sigma}),\]where \(\tilde{\boldsymbol\mu}\) and \(\tilde{\Sigma}\) are typically given by EP approximation to non-Normal likelihood (please, refer to
glimix_core.expfam.GLMMExpFam
). If that is the case, \(\tilde{\Sigma}\) is a diagonal matrix with non-negative values. Those EP parameters are given to this class via their natural forms:eta
andtau
.- Parameters
eta (array_like) – \([\tilde{\mu}_0\tilde{\sigma}^{-2}_0 \quad \tilde{\mu}_1\tilde{\sigma}^{-2}_1 \quad\cdots\quad \tilde{\mu}_n\tilde{\sigma}^{-2}_n]\).
tau (array_like) – \([\tilde{\sigma}^{-2}_0\quad\tilde{\sigma}^{-2}_1 \quad\cdots\quad \tilde{\sigma}^{-2}_n]\).
X (array_like) – Covariates.
QS (tuple) – Economic eigendecomposition of \(\mathrm K\).
- property beta¶
Fixed-effect sizes.
- Returns
\(\boldsymbol\beta\).
- Return type
- fix(var_name)[source]¶
Prevent a variable to be adjusted.
- Parameters
var_name (str) – Variable name.
- get_fast_scanner()[source]¶
Return
glimix_core.lmm.FastScanner
for the current delta.
- value()[source]¶
Log of the marginal likelihood.
Formally,
\[- \frac{n}{2}\log{2\pi} - \frac{1}{2} \log{\left| v_0 \mathrm K + v_1 \mathrm I + \tilde{\Sigma} \right|} - \frac{1}{2} \left(\tilde{\boldsymbol\mu} - \mathrm X\boldsymbol\beta\right)^{\intercal} \left( v_0 \mathrm K + v_1 \mathrm I + \tilde{\Sigma} \right)^{-1} \left(\tilde{\boldsymbol\mu} - \mathrm X\boldsymbol\beta\right)\]- Returns
\(\log{p(\tilde{\boldsymbol\mu})}\)
- Return type
- class limix.model.glmm.GLMMExpFam(y, lik, X, QS=None, n_int=1000, rtol=1.49e-05, atol=1.49e-08)[source]¶
Generalised Linear Gaussian Processes implementation.
It implements inference over GLMMs via the Expectation Propagation [Min01] algorithm. It currently supports the
"Bernoulli"
,"Probit"
,"Binomial"
, and"Poisson"
likelihoods. (For heterogeneous Normal likelihood, please refer toglimix_core.glmm.GLMMNormal
for a closed-form inference.)- Parameters
y (array_like) – Outcome variable.
lik (tuple) – Likelihood definition. The first item is one of the following likelihood names:
"Bernoulli"
,"Binomial"
,"Normal"
, and"Poisson"
. For Binomial, the second item is an array of outcomes.X (array_like) – Covariates.
QS (tuple) – Economic eigendecomposition.
Example
>>> from numpy import dot, sqrt, zeros >>> from numpy.random import RandomState >>> >>> from numpy_sugar.linalg import economic_qs >>> >>> from glimix_core.glmm import GLMMExpFam >>> >>> random = RandomState(0) >>> nsamples = 10 >>> >>> X = random.randn(nsamples, 2) >>> G = random.randn(nsamples, 100) >>> K = dot(G, G.T) >>> ntrials = random.randint(1, 100, nsamples) >>> z = dot(G, random.randn(100)) / sqrt(100) >>> >>> successes = zeros(len(ntrials), int) >>> for i in range(len(ntrials)): ... successes[i] = sum(z[i] + 0.2 * random.randn(ntrials[i]) > 0) >>> >>> QS = economic_qs(K) >>> >>> glmm = GLMMExpFam(successes, ('binomial', ntrials), X, QS) >>> print('Before: %.2f' % glmm.lml()) Before: -16.40 >>> glmm.fit(verbose=False) >>> print('After: %.2f' % glmm.lml()) After: -13.43
- property beta¶
Fixed-effect sizes.
- Returns
\(\boldsymbol\beta\).
- Return type
- covariance()[source]¶
Covariance of the prior.
- Returns
\(v_0 \mathrm K + v_1 \mathrm I\).
- Return type
- fit(verbose=True, factr=100000.0, pgtol=1e-07)[source]¶
Maximise the marginal likelihood.
- Parameters
verbose (bool) –
True
for progress output;False
otherwise. Defaults toTrue
.factr (float, optional) – The iteration stops when
(f^k - f^{k+1})/max{|f^k|,|f^{k+1}|,1} <= factr * eps
, whereeps
is the machine precision.pgtol (float, optional) – The iteration will stop when
max{|proj g_i | i = 1, ..., n} <= pgtol
wherepg_i
is the i-th component of the projected gradient.
Notes
Please, refer to
scipy.optimize.fmin_l_bfgs_b()
for further information aboutfactr
andpgtol
.
- fix(var_name)[source]¶
Prevent a variable to be adjusted.
- Parameters
var_name (str) – Variable name.
- gradient()[source]¶
Gradient of the log of the marginal likelihood.
- Returns
Map between variables to their gradient values.
- Return type
Gaussian Process¶
- class limix.model.gp.GP(y, mean, cov)[source]¶
Gaussian Process inference via maximum likelihood.
- Parameters
y (array_like) – Outcome variable.
mean (function) – Mean function. (Refer to Mean functions.)
cov (function) – Covariance function. (Refer to Covariance functions.)
Example
>>> from numpy.random import RandomState >>> >>> from glimix_core.example import offset_mean >>> from glimix_core.example import linear_eye_cov >>> from glimix_core.gp import GP >>> from glimix_core.random import GPSampler >>> >>> random = RandomState(94584) >>> >>> mean = offset_mean() >>> cov = linear_eye_cov() >>> >>> y = GPSampler(mean, cov).sample(random) >>> >>> gp = GP(y, mean, cov) >>> print('Before: %.4f' % gp.lml()) Before: -15.5582 >>> gp.fit(verbose=False) >>> print('After: %.4f' % gp.lml()) After: -13.4791 >>> print(gp) GP(...) lml: -13.47907874997517 OffsetMean(): OffsetMean offset: 0.7755803668772308 SumCov(covariances=...): SumCov LinearCov(): LinearCov scale: 2.061153622438558e-09 EyeCov(dim=10): EyeCov scale: 0.8675680523425126
- fit(verbose=True, factr=100000.0, pgtol=1e-07)[source]¶
Maximise the marginal likelihood.
- Parameters
verbose (bool) –
True
for progress output;False
otherwise. Defaults toTrue
.factr (float, optional) – The iteration stops when
(f^k - f^{k+1})/max{|f^k|,|f^{k+1}|,1} <= factr * eps
, whereeps
is the machine precision.pgtol (float, optional) – The iteration will stop when
max{|proj g_i | i = 1, ..., n} <= pgtol
wherepg_i
is the i-th component of the projected gradient.
Notes
Please, refer to
scipy.optimize.fmin_l_bfgs_b()
for further information aboutfactr
andpgtol
.
Generalised Gaussian Process¶
- class limix.model.ggp.ExpFamGP(y, lik, mean, cov)[source]¶
Expectation Propagation for Generalised Gaussian Processes.
- Parameters
y (array_like) – Outcome variable.
lik_name (str) – Likelihood name.
mean (function) – Mean function. (Refer to Mean functions.)
cov (function) – Covariance function. (Refer to Covariance functions.)
Example
>>> from numpy.random import RandomState >>> >>> from glimix_core.example import offset_mean >>> from glimix_core.example import linear_eye_cov >>> from glimix_core.ggp import ExpFamGP >>> from glimix_core.lik import BernoulliProdLik >>> from glimix_core.link import LogitLink >>> from glimix_core.random import GGPSampler >>> >>> random = RandomState(1) >>> >>> lik = BernoulliProdLik(LogitLink()) >>> mean = offset_mean() >>> cov = linear_eye_cov() >>> >>> y = GGPSampler(lik, mean, cov).sample(random) >>> >>> ggp = ExpFamGP(y, 'bernoulli', mean, cov) >>> print('Before: %.4f' % ggp.lml()) Before: -6.9802 >>> ggp.fit(verbose=False) >>> print('After: %.2f' % ggp.lml()) After: -6.55
- fit(verbose=True, factr=100000.0, pgtol=1e-07)[source]¶
Maximise the marginal likelihood.
- Parameters
verbose (bool) –
True
for progress output;False
otherwise. Defaults toTrue
.factr (float, optional) – The iteration stops when
(f^k - f^{k+1})/max{|f^k|,|f^{k+1}|,1} <= factr * eps
, whereeps
is the machine precision.pgtol (float, optional) – The iteration will stop when
max{|proj g_i | i = 1, ..., n} <= pgtol
wherepg_i
is the i-th component of the projected gradient.
Notes
Please, refer to
scipy.optimize.fmin_l_bfgs_b()
for further information aboutfactr
andpgtol
.
Covariance¶
- class limix.model.cov.EyeCov(dim)[source]¶
Identity covariance function, K = s·I.
The parameter s is the scale of the matrix.
Example
>>> from glimix_core.cov import EyeCov >>> >>> cov = EyeCov(2) >>> cov.scale = 2.5 >>> print(cov.value()) [[2.5 0. ] [0. 2.5]] >>> g = cov.gradient() >>> print(g['logscale']) [[2.5 0. ] [0. 2.5]] >>> cov.name = "I" >>> print(cov) EyeCov(dim=2): I scale: 2.5
- Parameters
dim (int) – Matrix dimension, d.
- property dim¶
Dimension of the matrix, d.
It corresponds to the number of rows and to the number of columns.
- gradient()[source]¶
Derivative of the covariance matrix over log(s), s⋅I.
- Returns
logscale – s⋅I, for scale s and a d×d identity matrix I.
- Return type
ndarray
- property scale¶
Scale parameter.
- class limix.model.cov.FreeFormCov(dim)[source]¶
General definite positive matrix, K = LLᵀ + ϵI.
A d×d covariance matrix K will have ((d+1)⋅d)/2 parameters defining the lower triangular elements of a Cholesky matrix L such that:
K = LLᵀ + ϵI,
for a very small positive number ϵ. That additional term is necessary to avoid singular and ill conditioned covariance matrices.
Example
>>> from glimix_core.cov import FreeFormCov >>> >>> cov = FreeFormCov(2) >>> cov.L = [[1., .0], [0.5, 2.]] >>> print(cov.gradient()["Lu"]) [[[0. 2. 0. ] [1. 0.5 0. ]] [[1. 0.5 0. ] [1. 0. 8. ]]] >>> cov.name = "K" >>> print(cov) FreeFormCov(dim=2): K L: [[1. 0. ] [0.5 2. ]]
- property L¶
Lower-triangular matrix L such that K = LLᵀ + ϵI.
- Returns
L – Lower-triangular matrix.
- Return type
(d, d) ndarray
- property Lu¶
Lower-triangular, flat part of L.
- eigh()[source]¶
Eigen decomposition of K.
- Returns
S (ndarray) – The eigenvalues in ascending order, each repeated according to its multiplicity.
U (ndarray) – Normalized eigenvectors.
- gradient()[source]¶
Derivative of the covariance matrix over the parameters of L.
- Returns
Lu – Derivative of K over the lower triangular part of L.
- Return type
ndarray
- listen(func)[source]¶
Listen to parameters change.
- Parameters
func (callable) – Function to be called when a parameter changes.
- property nparams¶
Number of parameters.
- property shape¶
Array shape.
- class limix.model.cov.GivenCov(K0)[source]¶
Given covariance function, K = s⋅K₀.
The covariance matrix is the provided matrix K₀ scaled by s: K = s⋅K₀.
Example
>>> from glimix_core.cov import GivenCov >>> from numpy import dot >>> from numpy.random import RandomState >>> >>> G = RandomState(0).randn(5, 3) >>> K0 = dot(G, G.T) >>> cov = GivenCov(K0) >>> cov.scale = 1.3 >>> cov.name = "K" >>> print(cov) GivenCov(K0=...): K scale: 1.3
- gradient()[source]¶
Derivative of the covariance matrix over log(s).
- Returns
logscale – s⋅K₀.
- Return type
- property scale¶
Scale parameter, s.
- class limix.model.cov.LinearCov(X)[source]¶
Linear covariance function, K = s⋅XXᵀ.
The mathematical representation is s⋅XXᵀ, for a n×r matrix X provided by the user and a scaling parameter s.
Example
>>> from glimix_core.cov import LinearCov >>> from numpy import dot >>> from numpy.random import RandomState >>> >>> X = RandomState(0).randn(2, 3) >>> cov = LinearCov(X) >>> cov.scale = 1.3 >>> cov.name = "K" >>> print(cov) LinearCov(): K scale: 1.3
- property X¶
Matrix X from K = s⋅XXᵀ.
- gradient()[source]¶
Derivative of the covariance matrix over log(s).
- Returns
logscale – s⋅XXᵀ.
- Return type
ndarray
- property scale¶
Scale parameter.
- class limix.model.cov.SumCov(covariances)[source]¶
Sum of covariance functions, K = K₀ + K₁ + ⋯.
Example
>>> from glimix_core.cov import LinearCov, SumCov >>> from numpy.random import RandomState >>> >>> random = RandomState(0) >>> cov_left = LinearCov(random.randn(4, 20)) >>> cov_right = LinearCov(random.randn(4, 15)) >>> cov_left.scale = 0.5 >>> >>> cov = SumCov([cov_left, cov_right]) >>> cov_left.name = "A" >>> cov_right.name = "B" >>> cov.name = "A+B" >>> print(cov) SumCov(covariances=...): A+B LinearCov(): A scale: 0.5 LinearCov(): B scale: 1.0
- class limix.model.cov.LRFreeFormCov(n, m)[source]¶
General semi-definite positive matrix of low rank, K = LLᵀ.
The covariance matrix K is given by LLᵀ, where L is a n×m matrix and n≥m. Therefore, K will have rank(K) ≤ m.
Example
>>> from glimix_core.cov import LRFreeFormCov >>> cov = LRFreeFormCov(3, 2) >>> print(cov.L) [[1. 1.] [1. 1.] [1. 1.]] >>> cov.L = [[1, 2], [0, 3], [1, 3]] >>> print(cov.L) [[1. 2.] [0. 3.] [1. 3.]] >>> cov.name = "F" >>> print(cov) LRFreeFormCov(n=3, m=2): F L: [[1. 2.] [0. 3.] [1. 3.]]
- property L¶
Matrix L from K = LLᵀ.
- Returns
L – Parametric matrix.
- Return type
(n, m) ndarray
- property Lu¶
Lower-triangular, flat part of L.
- gradient()[source]¶
Derivative of the covariance matrix over the lower triangular, flat part of L.
It is equal to
∂K/∂Lᵢⱼ = ALᵀ + LAᵀ,
where Aᵢⱼ is an n×m matrix of zeros except at [Aᵢⱼ]ᵢⱼ=1.
- Returns
Lu – Derivative of K over the lower-triangular, flat part of L.
- Return type
ndarray
- listen(func)[source]¶
Listen to parameters change.
- Parameters
func (callable) – Function to be called when a parameter changes.
- property nparams¶
Number of parameters.
- property shape¶
Array shape.
- class limix.model.cov.Kron2SumCov(G, dim, rank)[source]¶
Implements K = C₀ ⊗ GGᵀ + C₁ ⊗ I.
C₀ and C₁ are d×d symmetric matrices. C₀ is a semi-definite positive matrix while C₁ is a positive definite one. G is a n×m matrix and I is a n×n identity matrix. Let M = Uₘ Sₘ Uₘᵀ be the eigen decomposition for any matrix M. The documentation and implementation of this class make use of the following definitions:
X = GGᵀ = Uₓ Sₓ Uₓᵀ
C₁ = U₁ S₁ U₁ᵀ
Cₕ = S₁⁻½ U₁ᵀ C₀ U₁ S₁⁻½
Cₕ = Uₕ Sₕ Uₕᵀ
D = (Sₕ ⊗ Sₓ + Iₕₓ)⁻¹
Lₓ = Uₓᵀ
Lₕ = Uₕᵀ S₁⁻½ U₁ᵀ
L = Lₕ ⊗ Lₓ
The above definitions allows us to write the inverse of the covariance matrix as:
K⁻¹ = LᵀDL,
where D is a diagonal matrix.
Example
>>> from numpy import array >>> from glimix_core.cov import Kron2SumCov >>> >>> G = array([[-1.5, 1.0], [-1.5, 1.0], [-1.5, 1.0]]) >>> Lr = array([[3], [2]], float) >>> Ln = array([[1, 0], [2, 1]], float) >>> >>> cov = Kron2SumCov(G, 2, 1) >>> cov.C0.L = Lr >>> cov.C1.L = Ln >>> print(cov) Kron2SumCov(G=..., dim=2, rank=1): Kron2SumCov LRFreeFormCov(n=2, m=1): C₀ L: [[3.] [2.]] FreeFormCov(dim=2): C₁ L: [[1. 0.] [2. 1.]]
- property C0¶
Semi-definite positive matrix C₀.
- property C1¶
Definite positive matrix C₁.
- property D¶
(Sₕ ⊗ Sₓ + Iₕₓ)⁻¹.
- property G¶
User-provided matrix G, n×m.
- Ge¶
Result of US from the SVD decomposition G = USVᵀ.
- LdKL_dot(v)[source]¶
Implements L(∂K)Lᵀv.
The array v can have one or two dimensions and the first dimension has to have size n⋅p.
Let vec(V) = v. We have
L(∂K)Lᵀ⋅v = ((Lₕ∂C₀Lₕᵀ) ⊗ (LₓGGᵀLₓᵀ))vec(V) = vec(LₓGGᵀLₓᵀVLₕ∂C₀Lₕᵀ),
when the derivative is over the parameters of C₀. Similarly,
L(∂K)Lᵀv = ((Lₕ∂C₁Lₕᵀ) ⊗ (LₓLₓᵀ))vec(V) = vec(LₓLₓᵀVLₕ∂C₁Lₕᵀ),
over the parameters of C₁.
- property Lh¶
Lₕ.
- property Lx¶
Lₓ.
- gradient()[source]¶
Gradient of K.
- Returns
C0 (ndarray) – Derivative of C₀ over its parameters.
C1 (ndarray) – Derivative of C₁ over its parameters.
- gradient_dot(v)[source]¶
Implements ∂K⋅v.
- Parameters
v (array_like) – Vector from ∂K⋅v.
- Returns
C0.Lu (ndarray) – ∂K⋅v, where the gradient is taken over the C₀ parameters.
C1.Lu (ndarray) – ∂K⋅v, where the gradient is taken over the C₁ parameters.
- listen(func)[source]¶
Listen to parameters change.
- Parameters
func (callable) – Function to be called when a parameter changes.
- logdet()[source]¶
Implements log|K| = - log|D| + n⋅log|C₁|.
- Returns
logdet – Log-determinant of K.
- Return type
- logdet_gradient()[source]¶
Implements ∂log|K| = Tr[K⁻¹∂K].
It can be shown that:
∂log|K| = diag(D)ᵀdiag(L(∂K)Lᵀ) = diag(D)ᵀ(diag(Lₕ∂C₀Lₕᵀ)⊗diag(LₓGGᵀLₓᵀ)),
when the derivative is over the parameters of C₀. Similarly,
∂log|K| = diag(D)ᵀdiag(L(∂K)Lᵀ) = diag(D)ᵀ(diag(Lₕ∂C₁Lₕᵀ)⊗diag(I)),
over the parameters of C₁.
- Returns
C0 (ndarray) – Derivative of C₀ over its parameters.
C1 (ndarray) – Derivative of C₁ over its parameters.
- property nparams¶
Number of parameters.
Mean¶
- class limix.model.mean.OffsetMean(n)[source]¶
Offset mean function, θ⋅𝟏.
It represents a mean vector 𝐦 = θ⋅𝟏 of size n. The offset is given by the parameter θ.
Example
>>> from glimix_core.mean import OffsetMean >>> >>> mean = OffsetMean(3) >>> mean.offset = 2.0 >>> print(mean.value()) [2. 2. 2.] >>> print(mean.gradient()) {'offset': array([1., 1., 1.])} >>> mean.name = "𝐦" >>> print(mean) OffsetMean(): 𝐦 offset: 2.0
- gradient()[source]¶
Gradient of the offset function.
- Returns
offset – Vector 𝟏.
- Return type
(n,) ndarray
- property offset¶
Offset parameter.
- class limix.model.mean.LinearMean(X)[source]¶
Linear mean function, X𝜶.
It defines X𝜶, for which X is a n×m matrix provided by the user and 𝜶 is a vector of size m.
Example
>>> from glimix_core.mean import LinearMean >>> >>> mean = LinearMean([[1.5, 0.2], [0.5, 0.4]]) >>> mean.effsizes = [1.0, -1.0] >>> print(mean.value()) [1.3 0.1] >>> print(mean.gradient()["effsizes"]) [[1.5 0.2] [0.5 0.4]] >>> mean.name = "𝐦" >>> print(mean) LinearMean(m=2): 𝐦 effsizes: [ 1. -1.]
- property X¶
An n×m matrix of covariates.
- property effsizes¶
Effect-sizes parameter, 𝜶, of size m.
- class limix.model.mean.SumMean(means)[source]¶
Sum mean function, 𝐟₀ + 𝐟₁ + ….
The mathematical representation is
𝐦 = 𝐟₀ + 𝐟₁ + …
In other words, it is a sum of mean vectors.
Example
>>> from glimix_core.mean import OffsetMean, LinearMean, SumMean >>> >>> X = [[5.1, 1.0], ... [2.1, -0.2]] >>> >>> mean0 = LinearMean(X) >>> mean0.effsizes = [-1.0, 0.5] >>> >>> mean1 = OffsetMean(2) >>> mean1.offset = 2.0 >>> >>> mean = SumMean([mean0, mean1]) >>> >>> print(mean.value()) [-2.6 -0.2] >>> g = mean.gradient() >>> print(g["SumMean[0].effsizes"]) [[ 5.1 1. ] [ 2.1 -0.2]] >>> print(g["SumMean[1].offset"]) [1. 1.] >>> mean0.name = "A" >>> mean1.name = "B" >>> mean.name = "A+B" >>> print(mean) SumMean(means=...): A+B LinearMean(m=2): A effsizes: [-1. 0.5] OffsetMean(): B offset: 2.0
- class limix.model.mean.KronMean(A, X)[source]¶
Kronecker mean function, (A⊗X)vec(B).
Let
n be the number of samples;
p the number of traits; and
c the number of covariates.
The mathematical representation is
𝐦 = (A⊗X)vec(B)
where A is a p×p trait design matrix of fixed effects and X is a n×c sample design matrix of fixed effects. B is a c×p matrix of fixed-effect sizes.
- property A¶
Matrix A.
- property AX¶
A ⊗ X.
- property B¶
Effect-sizes parameter, B.
- property X¶
Matrix X.
- gradient()[source]¶
Gradient of the linear mean function.
- Returns
vecB – Derivative of M over vec(B).
- Return type
ndarray
- property nparams¶
Number of parameters.
API reference¶
I/O module¶
|
Read a given BGEN file. |
|
Read a BIMBAM phenotype file. |
|
Read a CSV file. |
|
Read GEN files into Pandas data frames. |
|
Read the HDF5 limix file format. |
|
Read NumPy arrays saved in a file. |
|
Read PLINK files into Pandas data frames. |
Quality control¶
Box-Cox transformation for normality conformance. |
|
Compute minor allele frequencies. |
|
Count the number of missing values per column. |
|
|
Determine pair-wise independent variants. |
|
Impute |
|
Zero-mean and one-deviation normalisation. |
|
Variance rescaling of covariance matrix 𝙺. |
|
Normalize a sequence of values via rank and Normal c.d.f. |
|
Remove dependent columns. |
Filters out variants with the same genetic profile. |
Statistics¶
|
Mixture of 𝜒² distributions. |
|
Allele expectation. |
Compute allele frequency from its expectation. |
|
|
Compute dosage from allele expectation. |
|
Provide a couple of scores based on the idea of windows around genetic markers. |
|
Function to compute empirical p-values. |
|
Estimate Kinship matrix via linear kernel. |
|
Compute p-values from likelihood ratios. |
|
Test results and p-value correction for multiple tests. |
|
Draw random samples from a multivariate normal distribution. |
|
Principal component analysis. |
Heritability estimation¶
|
Estimate the so-called narrow-sense heritability. |
Variance decomposition¶
|
Variance decompositon through GLMMs. |
Quantitative trait loci¶
|
Multi-trait association and interaction testing via linear mixed models. |
|
Single-trait association with interaction test via generalized linear mixed models. |
Plotting & Graphics¶
|
Change to box aspect considering the plotted points. |
Consolidate multiple curves in a single one. |
|
|
Show an image. |
|
Plot heatmap of a kinship matrix. |
|
Example datasets. |
|
Produce a manhattan plot. |
|
Plot a fit of a normal distribution to the data in x. |
|
Plot the first two principal components of a design matrix. |
|
Plot number of hits across significance levels. |
|
Quantile-Quantile plot of observed p-values versus theoretical ones. |
|
Show an image. |
Get |
|
Shell utilities¶
|
Compute sha256 from a given file. |
|
Download file. |
|
Extract a compressed file. |
|
Remove file. |
Comments and bugs¶
You can get the source and open issues on Github.