Limix’s documentation

Install

Pip

pip install limix

Conda

conda install -c conda-forge limix

I/O module

BIMBAM

>>> import limix
>>>
>>> url = "http://rest.s3for.me/limix/example/phenotype.gemma"
>>> filepath = limix.sh.download(url, verbose=False)
>>> print(limix.io.bimbam.read_phenotype(filepath, verbose=False))
trait         0        1        2
sample
0       1.20000 -0.30000 -1.50000
1           nan  1.50000  0.30000
2       2.70000  1.10000      nan
3      -0.20000 -0.70000  0.80000
4       3.30000  2.40000  2.10000
>>> limix.sh.remove(filepath)

BGEN reader

>>> url = "http://rest.s3for.me/bgen-reader/haplotypes.bgen"
>>> filepath = limix.sh.download(url, verbose=False)
>>>
>>> data = limix.io.bgen.read(filepath, verbose=False)
>>> print(data.keys())
dict_keys(['variants', 'samples', 'genotype'])
>>> print(data["variants"].head(4))
         id rsid chrom  pos  nalleles allele_ids  vaddr
0  SNP1  RS1     1    1         2        A,G    102
1  SNP2  RS2     1    2         2        A,G    159
2  SNP3  RS3     1    3         2        A,G    216
3  SNP4  RS4     1    4         2        A,G    273
>>> print(data["samples"])
0    sample_0
1    sample_1
2    sample_2
3    sample_3
Name: id, dtype: object
>>> geno = data["genotype"][0].compute()
>>> print(geno.keys())
dict_keys(['probs', 'phased', 'ploidy', 'missing'])
>>> print(geno["probs"])
[[1. 0. 1. 0.]
 [0. 1. 1. 0.]
 [1. 0. 0. 1.]
 [0. 1. 0. 1.]]
>>> limix.sh.remove(filepath)
>>> limix.sh.remove(filepath + ".metadata")

CSV reader

>>> url = "http://rest.s3for.me/limix/expr.csv"
>>> filepath = limix.sh.download(url, verbose=False)
>>> data = limix.io.csv.read(filepath, verbose=False)
>>> print(data.head())
       HG00111  HG00112  HG00116  HG00121  HG00133  HG00135  HG00142  HG00143  \
gene1 -3.75235 -0.42113 -0.53629 -0.90768 -0.25189 -0.60300 -0.31069  0.28849
gene2 -0.35145  1.28258 -2.83505  0.32953 -0.50711 -0.81912 -1.37971  0.26906
gene3 -1.31997  1.08197  0.28400 -0.41318  0.14609 -0.14714  0.30255  0.69654
gene4 -0.75163  0.37668 -0.23564  1.06111  0.58524  0.60962 -2.02384 -1.29969
gene5  0.06464  0.64204 -0.81127 -1.42806 -0.89599 -0.01391  0.34385 -0.48492

       HG00151  HG00152  HG00154  HG00159  HG00160  HG00171  HG00173  HG00179  \
gene1 -1.72944 -1.69063  0.23706 -0.70690  0.31554  0.41398  1.53933  1.64413
gene2 -0.98741 -1.38205 -1.49273  1.43818  0.60203  0.72495  1.33730 -0.54032
gene3  0.92997 -0.97183  0.73781  0.13841 -0.27796 -0.30850 -1.37364  0.02908
gene4  0.74860  0.35024  0.46494  0.26519 -1.04980 -0.10405  0.24636  0.39698
gene5 -1.56242 -0.69343 -0.67140 -0.97220  0.51523  0.84428  0.37633  0.20097

       HG00189  HG00190  HG00232  HG00233  HG00239  HG00245  HG00253  HG00257  \
gene1  0.65387  0.68840 -0.31933 -0.05787  0.18941 -0.96866  0.95566  0.09096
gene2  0.23202  0.62984  0.36502 -0.54166 -0.17954  1.56841  0.41138  0.85125
gene3  0.38357  1.53011  1.03765  1.66079  0.47459 -0.61217 -0.75705 -1.23464
gene4  0.22995 -0.74410  0.13555  1.11415 -0.27491 -0.63577  0.65697 -0.78026
gene5  0.22179 -0.88524 -0.90036 -0.30308 -0.47617 -0.65303  0.31526  0.21940

       HG00263  HG00274  HG00281  HG00284  HG00309  HG00318  HG00319  HG00330  \
gene1 -0.84906 -0.38422 -0.20839  1.76988 -0.08126 -0.11885  0.06195 -0.39166
gene2 -0.68103  1.49606  0.03634  0.30436 -0.81669 -0.41680 -0.68541  1.05676
gene3 -0.39850 -0.04262  0.58753 -1.79663  0.06896  0.26583 -0.09755  1.30798
gene4 -0.03511  1.17201 -0.60239 -0.04585  1.52125 -0.01333 -1.03177 -1.04258
gene5 -1.44331  1.77727  0.03294 -0.90860  2.01101  1.30201  1.47212 -1.02498

       HG00331  HG00332  HG00343  HG00344  HG00351  HG00357  HG00369  ...  NA19819  \
gene1  1.19368 -2.60319 -0.00044  0.26917  0.35906  0.32333 -0.58944  ...  1.14834
gene2 -1.35361 -1.17153 -0.75354  1.08028  0.30128  1.19826  2.25417  ...  0.70334
gene3  0.84178  0.28865 -1.17596 -0.16204  0.24943  1.08654 -0.82590  ... -2.01485
gene4  0.63385 -1.60319  0.49055  0.53223 -0.00883 -1.39597 -0.24307  ...  0.09122
gene5 -0.48581 -1.59411 -1.25390 -0.78523  2.15585  1.15443  0.16866  ...  0.64129

       NA19834  NA19908  NA19920  NA20127  NA20278  NA20281  NA20287  NA20291  \
gene1 -0.33572 -0.48395  0.28840 -0.19132  1.46027 -0.49558 -0.05913 -0.67027
gene2 -1.12835 -1.56760 -1.53615  1.30383 -0.83380  0.08954  1.35263 -0.76398
gene3 -0.65052  1.78800  0.54407 -0.19086 -0.54339  0.00340  0.93919  2.35539
gene4 -1.28366  0.49670  0.51998  0.01628  0.65825  1.42326  0.75003 -0.46688
gene5 -0.38025 -0.99212  0.86387 -1.30035  0.36494 -0.74369  0.97632 -1.41972

       NA20314  NA20334  NA20339  NA20341  NA20344  NA20348  NA20357  NA20359  \
gene1  1.14834  0.03650 -0.39326  0.28873 -0.82685 -0.33570  0.35864 -0.88629
gene2 -0.73860  0.97905 -0.03794  0.52208  0.15237  0.05513 -0.43515  1.78638
gene3 -0.24083 -0.03304 -0.02394  0.63280 -0.14199  0.21436  0.01104  0.19409
gene4  1.02439  0.44723 -1.26319  0.04781  0.42133 -2.11834  0.59976 -0.45331
gene5 -0.68538  0.98479  0.58614  1.32101 -1.28100 -0.24468 -1.50173 -0.23772

       NA20412  NA20414  NA20505  NA20507  NA20508  NA20517  NA20518  NA20521  \
gene1 -0.27868 -0.88236  0.85370  0.12215 -0.22804  0.11440  1.15530 -0.64616
gene2 -0.13907 -1.79076 -1.06388  1.84062  0.81188  0.74989  1.13613 -0.67296
gene3  0.20491 -1.09471 -0.41158 -0.19403  0.62979  1.45114 -0.68828 -0.21306
gene4  2.07913  0.64254  1.19309  1.36727  1.36291 -0.20257 -1.91187 -0.80394
gene5  0.55915 -0.70109  1.10276  0.19700 -1.00590 -0.19778  0.70381  1.32831

       NA20525  NA20527  NA20534  NA20537  NA20581  NA20582  NA20753  NA20754  \
gene1 -1.00890  0.33830 -1.18606 -2.50144  1.11857 -1.35514 -0.45410 -1.40787
gene2 -0.52930 -0.08037  1.49958  0.48022  1.90088  1.19142 -1.07944  1.06962
gene3  1.28033 -1.03548  0.30717 -0.60877  0.14828 -0.02566  0.68297 -1.41462
gene4  0.84287  2.31155 -0.45076  0.27237  0.25196  0.13814 -1.58961 -0.61954
gene5 -0.13948 -0.59769  1.28226  0.56941 -3.31790  0.63016  0.17360  2.28942

       NA20768  NA20771  NA20772  NA20774  NA20775  NA20804
gene1  0.74605 -1.92301  0.52952  0.59285 -0.25449 -0.42643
gene2  0.21579  0.84464  0.72602  0.17902 -1.18471  1.22427
gene3  0.02000 -0.00026 -0.43102  1.03429 -2.04323 -0.61358
gene4  0.20742 -1.52664 -0.02818  0.29817  0.01488  0.26651
gene5 -0.82307 -1.15892 -1.14967 -0.13977  0.84840  0.99755

[5 rows x 274 columns]
>>> limix.sh.remove(filepath)

HDF5 reader

>>> url = "http://rest.s3for.me/limix/smith08.hdf5.bz2"
>>> filepath = limix.sh.download(url, verbose=False)
>>> filepath = limix.sh.extract(filepath, verbose=False)
>>> data = limix.io.hdf5.read_limix(filepath)
>>> print(data)
{'phenotype': <xarray.DataArray 'phenotype' (sample: 109, outcome: 10986)>
array([[-0.037339, -0.078165,  0.042936, ...,  0.095596, -0.132385, -0.274954],
       [-0.301376,  0.066055,  0.338624, ..., -0.142661, -0.238349,  0.732752],
       [ 0.002661,  0.121835, -0.137064, ..., -0.144404,  0.257615,  0.015046],
       ...,
       [-0.287339,  0.351835,  0.072936, ...,  0.097339, -0.038349,  0.162752],
       [-0.577339,  0.011835, -0.007064, ...,  0.135596,  0.107615,  0.245046],
       [-0.277339,  0.061835,  0.132936, ...,  0.015596, -0.142385, -0.124954]])
Coordinates:
  * sample        (sample) int64 0 1 2 3 4 5 6 7 ... 102 103 104 105 106 107 108
    environment   (outcome) float64 0.0 0.0 0.0 0.0 0.0 ... 1.0 1.0 1.0 1.0 1.0
    gene_ID       (outcome) object 'YOL161C' 'YJR107W' ... 'YLR118C' 'YBR242W'
    gene_chrom    (outcome) object '15' '10' '16' '7' '4' ... '3' '10' '12' '2'
    gene_end      (outcome) int64 11548 628319 32803 ... 315049 384726 705381
    gene_start    (outcome) int64 11910 627333 30482 ... 315552 385409 704665
    gene_strand   (outcome) object 'C' 'W' 'W' 'W' 'W' ... 'W' 'W' 'C' 'C' 'W'
    phenotype_ID  (outcome) object 'YOL161C:0' 'YJR107W:0' ... 'YBR242W:1'
Dimensions without coordinates: outcome, 'genotype': <xarray.DataArray 'genotype' (sample: 109, candidate: 2956)>
array([[1., 1., 1., ..., 0., 0., 0.],
       [1., 0., 1., ..., 1., 1., 1.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 1., 1.],
       [0., 0., 0., ..., 1., 1., 1.],
       [1., 1., 1., ..., 1., 1., 1.]])
Coordinates:
  * sample   (sample) int64 0 1 2 3 4 5 6 7 ... 101 102 103 104 105 106 107 108
    chrom    (candidate) int64 1 1 1 1 1 1 1 1 1 ... 16 16 16 16 16 16 16 16 16
    pos      (candidate) int64 483 484 3220 3223 ... 927506 932310 932535 932538
    pos_cum  (candidate) int64 483 484 3220 3223 ... 12055570 12055795 12055798
Dimensions without coordinates: candidate}
>>> limix.sh.remove(filepath)

NumPy reader

>>> url = "http://rest.s3for.me/limix/example.npy"
>>> filepath = limix.sh.download(url, verbose=False)
>>> K = limix.io.npy.read(filepath, verbose=False)
>>> print(K)
[[0.67003303 0.09512837 0.09346511 0.09252165 0.09679249]
 [0.09512837 0.66972454 0.09344451 0.09109398 0.09347495]
 [0.09346511 0.09344451 0.67305621 0.08987969 0.09689215]
 [0.09252165 0.09109398 0.08987969 0.67209248 0.09378162]
 [0.09679249 0.09347495 0.09689215 0.09378162 0.66773896]]
>>> limix.sh.remove(filepath)

Quality control

Box-Cox

limix.qc.boxcox(x)[source]

Box-Cox transformation for normality conformance.

It applies the power transformation

\[\begin{split}f(x) = \begin{cases} \frac{x^{\lambda} - 1}{\lambda}, & \text{if } \lambda > 0; \\ \log(x), & \text{if } \lambda = 0. \end{cases}\end{split}\]

to the provided data, hopefully making it more normal distribution-like. The λ parameter is fit by maximum likelihood estimation.

Parameters

X (array_like) – Data to be transformed.

Returns

boxcox – Box-Cox transformed data.

Return type

ndarray

Examples

(Source code, png)

_images/qc-1.png

Dependent columns

limix.qc.remove_dependent_cols(X, tol=1.49e-08)[source]

Remove dependent columns.

Return a matrix with dependent columns removed.

Parameters
  • X (array_like) – Matrix to might have dependent columns.

  • tol (float, optional) – Threshold above which columns are considered dependents.

Returns

rank – Full column rank matrix.

Return type

ndarray

limix.qc.unique_variants(X)[source]

Filters out variants with the same genetic profile.

Parameters

X (array_like) – Samples-by-variants matrix of genotype values.

Returns

genotype – Genotype array with unique variants.

Return type

ndarray

Example

>>> from numpy.random import RandomState
>>> from numpy import kron, ones, sort
>>> from limix.qc import unique_variants
>>> random = RandomState(1)
>>>
>>> N = 4
>>> X = kron(random.randn(N, 3) < 0, ones((1, 2)))
>>>
>>> print(X)  
[[0. 0. 1. 1. 1. 1.]
 [1. 1. 0. 0. 1. 1.]
 [0. 0. 1. 1. 0. 0.]
 [1. 1. 0. 0. 1. 1.]]
>>>
>>> print(unique_variants(X))  
[[0. 1. 1.]
 [1. 1. 0.]
 [0. 0. 1.]
 [1. 1. 0.]]

Genotype

limix.qc.indep_pairwise(X, window_size, step_size, threshold, verbose=True)[source]

Determine pair-wise independent variants.

Independent variants are defined via squared Pearson correlations between pairs of variants inside a sliding window.

Parameters
  • X (array_like) – Sample by variants matrix.

  • window_size (int) – Number of variants inside each window.

  • step_size (int) – Number of variants the sliding window skips.

  • threshold (float) – Squared Pearson correlation threshold for independence.

  • verbose (bool) – True for progress information; False otherwise.

Returns

ok – Boolean array defining independent variants

Return type

ndarray

Example

>>> from numpy.random import RandomState
>>> from limix.qc import indep_pairwise
>>>
>>> random = RandomState(0)
>>> X = random.randn(10, 20)
>>>
>>> indep_pairwise(X, 4, 2, 0.5, verbose=False)
array([ True,  True, False,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True])
limix.qc.compute_maf(X)[source]

Compute minor allele frequencies.

It assumes that X encodes 0, 1, and 2 representing the number of alleles (or dosage), or NaN to represent missing values.

Parameters

X (array_like) – Genotype matrix.

Returns

Minor allele frequencies.

Return type

array_like

Examples

>>> from numpy.random import RandomState
>>> from limix.qc import compute_maf
>>>
>>> random = RandomState(0)
>>> X = random.randint(0, 3, size=(100, 10))
>>>
>>> print(compute_maf(X)) 
[0.49  0.49  0.445 0.495 0.5   0.45  0.48  0.48  0.47  0.435]

Impute

limix.qc.mean_impute(X, axis=- 1, inplace=False)[source]

Impute NaN values.

It defaults to column-wise imputation.

Parameters
  • X (array_like) – Matrix to have its missing values imputed.

  • axis (int, optional) – Axis value. Defaults to -1.

  • inplace (bool, optional) – Defaults to False.

Returns

Imputed array.

Return type

ndarray

Examples

>>> from numpy.random import RandomState
>>> from numpy import nan, array_str
>>> from limix.qc import mean_impute
>>>
>>> random = RandomState(0)
>>> X = random.randn(5, 2)
>>> X[0, 0] = nan
>>>
>>> print(array_str(mean_impute(X), precision=4))
[[ 0.9233  0.4002]
 [ 0.9787  2.2409]
 [ 1.8676 -0.9773]
 [ 0.9501 -0.1514]
 [-0.1032  0.4106]]
limix.qc.count_missingness(X)[source]

Count the number of missing values per column.

Parameters

X (array_like) – Matrix.

Returns

count – Number of missing values per column.

Return type

ndarray

Kinship

limix.qc.normalise_covariance(K, out=None)[source]

Variance rescaling of covariance matrix 𝙺.

Let n be the number of rows (or columns) of 𝙺 and let mᵢ be the average of the values in the i-th column. Gower rescaling is defined as

\[𝙺(n - 1)/(𝚝𝚛𝚊𝚌𝚎(𝙺) - ∑mᵢ).\]

Notes

The reasoning of the scaling is as follows. Let 𝐠 be a vector of n independent samples and let 𝙲 be the Gower’s centering matrix. The unbiased variance estimator is

\[v = ∑ (gᵢ-ḡ)²/(n-1) = 𝚝𝚛𝚊𝚌𝚎((𝐠-ḡ𝟏)ᵀ(𝐠-ḡ𝟏))/(n-1) = 𝚝𝚛𝚊𝚌𝚎(𝙲𝐠𝐠ᵀ𝙲)/(n-1)\]

Let 𝙺 be the covariance matrix of 𝐠. The expectation of the unbiased variance estimator is

\[𝐄[v] = 𝚝𝚛𝚊𝚌𝚎(𝙲𝐄[𝐠𝐠ᵀ]𝙲)/(n-1) = 𝚝𝚛𝚊𝚌𝚎(𝙲𝙺𝙲)/(n-1),\]

assuming that 𝐄[gᵢ]=0. We thus divide 𝙺 by 𝐄[v] to achieve an unbiased normalisation on the random variable gᵢ.

Parameters
  • K (array_like) – Covariance matrix to be normalised.

  • out (array_like, optional) – Result destination. Defaults to None.

Examples

>>> from numpy import dot, mean, zeros
>>> from numpy.random import RandomState
>>> from limix.qc import normalise_covariance
>>>
>>> random = RandomState(0)
>>> X = random.randn(10, 10)
>>> K = dot(X, X.T)
>>> Z = random.multivariate_normal(zeros(10), K, 500)
>>> print("%.3f" % mean(Z.var(1, ddof=1)))
9.824
>>> Kn = normalise_covariance(K)
>>> Zn = random.multivariate_normal(zeros(10), Kn, 500)
>>> print("%.3f" % mean(Zn.var(1, ddof=1)))
1.008

Normalisation

limix.qc.mean_standardize(X, axis=- 1, inplace=False)[source]

Zero-mean and one-deviation normalisation.

Normalise in such a way that the mean and variance are equal to zero and one. This transformation is taken over the flattened array by default, otherwise over the specified axis. Missing values represented by NaN are ignored.

Parameters
  • X (array_like) – Array of values.

  • axis (int, optional) – Axis value. Defaults to 1.

  • inplace (bool, optional) – Defaults to False.

Returns

X – Normalized array.

Return type

ndarray

Example

>>> import limix
>>> from numpy import arange
>>>
>>> X = arange(15).reshape((5, 3)).astype(float)
>>> print(X)  
[[ 0.  1.  2.]
 [ 3.  4.  5.]
 [ 6.  7.  8.]
 [ 9. 10. 11.]
 [12. 13. 14.]]
>>> X = arange(6).reshape((2, 3)).astype(float)
>>> X = limix.qc.mean_standardize(X, axis=0)
>>> print(X)  
[[-1.22474487  0.          1.22474487]
 [-1.22474487  0.          1.22474487]]
limix.qc.quantile_gaussianize(X, axis=1, inplace=False)[source]

Normalize a sequence of values via rank and Normal c.d.f.

It defaults to column-wise normalization.

Parameters
  • X (array_like) – Array of values.

  • axis (int, optional) – Axis value. Defaults to 1.

  • inplace (bool, optional) – Defaults to False.

Returns

Gaussian-normalized values.

Return type

array_like

Examples

>>> from limix.qc import quantile_gaussianize
>>> from numpy import array_str
>>>
>>> qg = quantile_gaussianize([-1, 0, 2])
>>> print(qg) 
[-0.67448975  0.          0.67448975]

Quantitative trait locus

Introduction

Every genetic model considered here is an instance of generalized linear mixed model (GLMM). It consists in four main components [St16]:

  • A linear predictor, 𝐳 = M𝛂 + 𝚇𝐮.

  • The distribution of the random effects, 𝐮 ∼ 𝓝(𝟎, Σ).

  • The residual distribution, yᵢ | 𝐮.

  • The link function, 𝜇ᵢ = g(zᵢ).

The term 𝜇ᵢ represents the mean of yᵢ conditioned on 𝐮:

\[𝜇ᵢ = 𝙴[yᵢ|𝐮].\]

The role of the link function is to scale the domain of zᵢ, which ranges from -∞ to +∞, to the residual distribution parameter 𝜇ᵢ. For example, the mean of a Bernoulli distribution is bounded within [0, 1], and therefore requires a link function to translate values of zᵢ into values of 𝜇ᵢ.

The distribution of the outcome, conditioned on the random effects, has to be one from the exponential family [Ef18] having mean 𝜇ᵢ:

\[yᵢ|𝐮 ∼ 𝙴𝚡𝚙𝙵𝚊𝚖(𝜇ᵢ).\]

A notable instance of the above model is the linear mixed model (LMM). It consists of the identity link function, 𝜇ᵢ = g(𝜇ᵢ), and of normally distributed residuals, yᵢ | 𝐮 ∼ 𝓝(𝜇ᵢ, 𝜎ᵢ²) [Mc11]. It is more commonly described by the equation

(1)\[𝐲 = 𝙼𝛂 + 𝚇𝐮 + 𝛆,\]

for which 𝜀ᵢ∼𝓝(0, 𝜎ᵢ²). The random variables 𝐮 and 𝛆 are independent from each other as well as 𝜀ᵢ and 𝜀ⱼ for i≠j. Defining 𝐯 = 𝚇𝐮 leads to:

\[𝐯 ∼ 𝓝(𝟎, 𝚇Σ𝚇ᵀ).\]

There is another even simpler instance of GLMM that is also used in genetic analysis: a linear model (LM) is merely a LMM without the random effects:

\[𝐲 = 𝙼𝛂 + 𝛆.\]

The above models are used to establish a statistical tests to find significant association between genetic loci and phenotype. For that, their parameters have to be estimated.

As an example, let us define two parameters that will describe the overall variances of the random effects and of the residual effects:

\[Σ = 𝓋₀𝙸₀ ~~\text{and}~~ 𝜎ᵢ² = 𝓋₁.\]

If we assume a LMM, this example of model can be described by Eq. (1) for which

\[𝐮 ∼ 𝓝(𝟎, 𝓋₀𝙸₀) ~~\text{and}~~ 𝛆 ∼ 𝓝(𝟎, 𝓋₁𝙸₁).\]

Equivalently, we have

\[𝐲 = 𝙼𝛂 + 𝐯 + 𝛆,\]

for which

\[𝐯 ∼ 𝓝(𝟎, 𝓋₀𝚇𝚇ᵀ) ~~\text{and}~~ 𝛆 ∼ 𝓝(𝟎, 𝓋₁𝙸₁).\]

Therefore we have a model with three parameters: an array of effect sizes 𝛂 and variances 𝓋₀ and 𝓋₁. If 𝚇 contains the normalized SNP genotypes of the samples, 𝚇𝚇ᵀ is an estimation of the genetic relationship between the samples [Wa17].

Statistical test

We use the likelihood ratio test (LRT) approach [LR18] to assess the significance of the association between genetic variants and the phenotype. It is based on the ratio between the marginal likelihood of the null 𝓗₀ and alternative 𝓗₁ models, for which the simpler model 𝓗₀ is defined by placing a constraint on one or more parameters of the alternative model 𝓗₁.

The parameter inference is done via the maximum likelihood estimation (MLE) approach [ML18], for which the marginal likelihood p(𝐲 | 𝙼, 𝚇; 𝛉) is maximized over the parameters set 𝛉. Let 𝛉₀ and 𝛉₁ be the optimal parameters set under the null and alternative models. The likelihood ratio statistics is given by

\[-2 \log(p(𝐲| 𝙼, 𝚇; 𝛉₀) / p(𝐲| 𝙼, 𝚇; 𝛉₁)),\]

which asymptotically follows a χ² distribution [Wh14]. We will make use of the LRT approach in the next sections to flag significant genetic associations.

Single-trait association

We first consider that the observed phenotype is described by additive effects from covariates and genetic components. Any deviation from that is assumed to be captured by the residual distribution. Let 𝙼 be a matrix of covariates and let 𝙶 be a matrix of genetic variants that we suspect might have some effect on the phenotype. Therefore, we have the linear model:

\[\begin{split}𝐲 = \underbrace{𝙼𝛂}_{\text{covariates}}+ \underbrace{𝙶𝛃}_{\text{genetics}}+ \underbrace{𝛆}_{\text{noise}},\\ \text{where}~~𝛆∼𝓝(𝟎, 𝓋₁𝙸),~~~~~~\end{split}\]

and we wish to compare the following hypotheses:

\[\begin{split}𝓗₀: 𝛃 = 𝟎\\ 𝓗₁: 𝛃 ≠ 𝟎\end{split}\]

Note that the parameters of the above model are the covariate effect sizes, 𝛂, the effect sizes of a set of genetic variants, 𝛃, and the variance 𝓋₁ of the noise variable. Under the null hypothesis, we set 𝛃=𝟎 and fit the rest of the parameters. Under the alternative hypothesis, we learn all the parameters. At the end, we compare the marginal likelihoods via the likelihood ratio test.

Let us first generate a random data set having a phenotype, covariates, and a set of genetic candidates.

>>> from numpy import ones, stack
>>> from numpy.random import RandomState
>>> from pandas import DataFrame
>>>
>>> random = RandomState(2)
>>>
>>> # sample size
>>> n = 100
>>>
>>> # covariates
>>> offset = ones(n) * random.randn()
>>> age = random.randint(16, 75, n)
>>> M = stack((offset, age), axis=1)
>>> M = DataFrame(stack([offset, age], axis=1), columns=["offset", "age"])
>>> M["sample"] = [f"sample{i}" for i in range(n)]
>>> M = M.set_index("sample")
>>> print(M.head())
          offset      age
sample
sample0 -0.41676 38.00000
sample1 -0.41676 59.00000
sample2 -0.41676 34.00000
sample3 -0.41676 27.00000
sample4 -0.41676 56.00000
>>> # genetic variants
>>> G = random.randn(n, 4)
>>>
>>> # sampling the phenotype
>>> alpha = random.randn(2)
>>> beta = random.randn(4)
>>> eps = random.randn(n)
>>> y = M @ alpha + G @ beta + eps

We now apply the function limix.qtl.scan() to our data set

>>> from limix.qtl import scan
>>>
>>> r = scan(G, y, "normal", M=M, verbose=False)
>>> print(r)
Hypothesis 0
------------

𝐲 ~ 𝓝(𝙼𝜶, 3.462⋅𝙸)

M     = ['offset' 'age']
𝜶     = [2.10096551 0.19582931]
se(𝜶) = [1.25826998 0.01068367]
lml   = -203.98750767964498

Hypothesis 2
------------

𝐲 ~ 𝓝(𝙼𝜶 + G𝛃, s(3.462⋅𝙸))

          lml       cov. effsizes   cand. effsizes
--------------------------------------------------
mean   -1.951e+02       9.915e-01       -6.198e-01
std     9.227e+00       9.342e-01        3.974e-01
min    -2.031e+02       1.844e-01       -1.025e+00
25%    -2.026e+02       1.959e-01       -9.275e-01
50%    -1.967e+02       5.965e-01       -6.047e-01
75%    -1.893e+02       1.831e+00       -2.970e-01
max    -1.841e+02       2.312e+00       -2.448e-01

Likelihood-ratio test p-values
------------------------------

       𝓗₀ vs 𝓗₂
----------------
mean   6.514e-02
std    8.856e-02
min    2.804e-10
25%    2.606e-07
50%    3.651e-02
75%    1.016e-01
max    1.875e-01

Suppose we also have access to the whole genotype of our samples, 𝚇, and we want to use them to account for population structure and cryptic relatedness in our data [Ho13]. Since the number of genetic variants in 𝚇 is commonly larger than the number of samples, and because we are not actually interested in their effect sizes, we will include it in our model as a random component. We now have a linear mixed model:

\[\begin{split}𝐲 = \underbrace{𝙼𝛂}_{\text{covariates}}+ \underbrace{𝙶𝛃}_{\text{genetics}}+ \underbrace{𝚇𝐮}_{\text{pop. struct.}}+ \underbrace{𝛆}_{\text{noise}},\\ \text{where}~~ 𝐮∼𝓝(𝟎, 𝓋₀𝙸₀) ~~\text{and} ~~𝛆∼𝓝(𝟎, 𝓋₁𝙸₁).\end{split}\]

It is important to note that 𝐯=𝚇𝐮 can be equivalently described by a multivariate Normal distribution with a covariance proportional to 𝙺 = 𝚇𝚇ᵀ:

\[𝐯 ∼ 𝓝(𝟎, 𝓋₀𝙺).\]

We make use of the function limix.stats.linear_kinship() to define the covariance matrix 𝙺, and call limix.qtl.scan() to perform the analysis.

>>> from limix.stats import linear_kinship, multivariate_normal
>>> from numpy import zeros, eye
>>>
>>> # Whole genotype of each sample.
>>> X = random.randn(n, 50)
>>> # Estimate a kinship relationship between samples.
>>> K = linear_kinship(X, verbose=False) + 1e-9 * eye(n)
>>> # Update the phenotype
>>> y += multivariate_normal(random, zeros(n), K)
>>>
>>> r = scan(X, y, "normal", K, 𝙼=M, verbose=False)
>>> print(r)
Hypothesis 0
------------

𝐲 ~ 𝓝(𝙼𝜶, 1.436⋅𝙺 + 2.934⋅𝙸)

M     = ['offset' 'age']
𝜶     = [1.95338293 0.19448903]
se(𝜶) = [1.25455536 0.01076470]
lml   = -211.3819625136375

Hypothesis 2
------------

𝐲 ~ 𝓝(𝙼𝜶 + G𝛃, s(1.436⋅𝙺 + 2.934⋅𝙸))

          lml       cov. effsizes   cand. effsizes
--------------------------------------------------
mean   -2.109e+02       1.069e+00        5.922e-02
std     7.210e-01       8.819e-01        2.474e-01
min    -2.114e+02       1.919e-01       -5.204e-01
25%    -2.113e+02       1.944e-01       -1.102e-01
50%    -2.111e+02       8.904e-01        4.920e-02
75%    -2.109e+02       1.956e+00        2.366e-01
max    -2.076e+02       2.215e+00        6.433e-01

Likelihood-ratio test p-values
------------------------------

       𝓗₀ vs 𝓗₂
----------------
mean   4.843e-01
std    2.705e-01
min    6.294e-03
25%    3.137e-01
50%    4.752e-01
75%    6.953e-01
max    9.929e-01

Non-normal trait association

If the residuals of the phenotype does not follow a Normal distribution, then we might consider performing the analysis using a generalized linear mixed model. Let us consider Poisson distributed residuals:

\[yᵢ | 𝐳 ∼ 𝙿𝚘𝚒𝚜𝚜𝚘𝚗(𝜇ᵢ=g(zᵢ)),\]

where the latent phenotype is described by

\[𝐳 = 𝙼𝛃 + 𝚇𝐮 + 𝛆,\]

for

\[𝐮 ∼ 𝓝(𝟎, 𝓋₀𝙸₀) ~~\text{and}~~ 𝛆 ∼ 𝓝(𝟎, 𝓋₁𝙸₁).\]

Note that the term 𝛆 in the above model is not the residual variable, as it were in the Eq. (1). The term 𝛆 is used to account for the so-called over-dispersion, i.e., when the residual distribution is not sufficient to explain the variability of yᵢ.

>>> from numpy import exp
>>>
>>> z = (y - y.mean()) / y.std()
>>> y = random.poisson(exp(z))
>>>
>>> r = scan(G, y, "poisson", K, M=M, verbose=False)
>>> print(r) 
Hypothesis 0
------------

𝐳 ~ 𝓝(𝙼𝜶, 0.154⋅𝙺 + 0.000⋅𝙸) for yᵢ ~ Poisson(λᵢ=g(zᵢ)) and g(x)=eˣ

M     = ['offset' 'age']
𝜶     = [5.17511934 0.04665214]
se(𝜶) = [0.85159296 0.00604330]
lml   = -145.33385788740767

Hypothesis 2
------------

𝐳 ~ 𝓝(𝙼𝜶 + G𝛃, s(0.154⋅𝙺 + 0.000⋅𝙸)) for yᵢ ~ Poisson(λᵢ=g(zᵢ)) and g(x)=eˣ

          lml       cov. effsizes   cand. effsizes
--------------------------------------------------
mean   -1.440e+02       2.553e+00       -1.306e-01
std     1.343e+00       2.682e+00        9.268e-02
min    -1.453e+02       4.345e-02       -2.227e-01
25%    -1.450e+02       4.635e-02       -2.018e-01
50%    -1.439e+02       2.456e+00       -1.344e-01
75%    -1.428e+02       5.054e+00       -6.321e-02
max    -1.427e+02       5.202e+00       -3.085e-02

Likelihood-ratio test p-values
------------------------------

       𝓗₀ vs 𝓗₂
----------------
mean   2.830e-01
std    3.213e-01
min    2.274e-02
25%    2.519e-02
50%    2.113e-01
75%    4.692e-01
max    6.867e-01

Single-trait with interaction

The following linear mixed model is considered:

\[\begin{split}𝐲 = 𝙼𝛂 + (𝙶⊙𝙴₀)𝛃₀ + (𝙶⊙𝙴₁)𝛃₁ + 𝚇𝐮 + 𝛆,\\ \text{where}~~ 𝐮∼𝓝(𝟎, 𝓋₀𝙸₀) ~~\text{and}~~ 𝛆∼𝓝(𝟎, 𝓋₁𝙸₁).\end{split}\]

The operator ⊙ works as follows:

\[𝙰⊙𝙱 = [𝙰₀𝙱₀ ~~...~~ 𝙰₀𝙱ₙ ~~ 𝙰₁𝙱₀ ~~...~~ 𝙰₁𝙱ₙ ~~...~~ 𝙰ₘ𝙱ₙ]\]

Therefore, the terms 𝙶⊙𝙴₀ and 𝙶⊙𝙴₁ can be understood as interaction terms between genetics, 𝙶, and environments, 𝙴₀ and 𝙴₁.

We define three hypotheses from the above linear mixed model:

\[\begin{split}𝓗₀: 𝛃₀=𝟎 ~~\text{and}~~ 𝛃₁=𝟎\\ 𝓗₁: 𝛃₀≠𝟎 ~~\text{and}~~ 𝛃₁=𝟎\\ 𝓗₂: 𝛃₀≠𝟎 ~~\text{and}~~ 𝛃₁≠𝟎\end{split}\]

The hypothesis 𝓗₀ is for no-interaction, 𝓗₁ is for interaction with environments encoded in 𝙴₀, and 𝓗₂ is for interaction with environments encoded in 𝙴₀ and 𝙴₁. We perform three statistical tests:

  • 𝓗₀ (null) vs 𝓗₁ (alternative)

  • 𝓗₀ (null) vs 𝓗₂ (alternative)

  • 𝓗₁ (null) vs 𝓗₂ (alternative)

Here is an example.

>>> from numpy import concatenate, newaxis
>>> from limix.qtl import iscan
>>>
>>> # Generate interacting variables (environment)
>>> E0 = random.randn(y.shape[0], 1)
>>> E1 = random.randn(y.shape[0], 1)
>>>
>>> r = iscan(G, y, "normal", K, M, E0=E0, E1=E1, verbose=False)
>>> print(r)
Hypothesis 0
------------

𝐲 ~ 𝓝(𝙼𝜶, 0.376⋅𝙺 + 2.077⋅𝙸)

M     = ['offset' 'age']
𝜶     = [3.12608063 0.06042316]
se(𝜶) = [1.01867609 0.00870181]
lml   = -185.77488727691096

Hypothesis 1
------------

𝐲 ~ 𝓝(𝙼𝜶 + (𝙶⊙𝙴₀)𝛃₀, s(0.376⋅𝙺 + 2.077⋅𝙸))

          lml       cov. effsizes   cand. effsizes
--------------------------------------------------
mean   -1.856e+02       1.611e+00       -2.976e-03
std     1.949e-01       1.658e+00        1.208e-01
min    -1.858e+02       6.034e-02       -1.461e-01
25%    -1.858e+02       6.058e-02       -4.769e-02
50%    -1.856e+02       1.590e+00       -7.487e-03
75%    -1.854e+02       3.137e+00        3.722e-02
max    -1.854e+02       3.235e+00        1.492e-01

Hypothesis 2
------------

𝐲 ~ 𝓝(𝙼𝜶 + (𝙶⊙𝙴₀)𝛃₀ + (𝙶⊙𝙴₁)𝛃₁, s(0.376⋅𝙺 + 2.077⋅𝙸))

          lml       cov. effsizes   cand. effsizes
--------------------------------------------------
mean   -1.852e+02       1.612e+00        7.001e-03
std     7.598e-01       1.659e+00        1.475e-01
min    -1.857e+02       5.991e-02       -2.573e-01
25%    -1.856e+02       6.096e-02       -4.135e-02
50%    -1.855e+02       1.571e+00        3.611e-02
75%    -1.851e+02       3.135e+00        7.660e-02
max    -1.841e+02       3.241e+00        1.971e-01

Likelihood-ratio test p-values
------------------------------

       𝓗₀ vs 𝓗₁    𝓗₀ vs 𝓗₂    𝓗₁ vs 𝓗₂
----------------------------------------
mean   6.867e-01   6.501e-01   5.244e-01
std    3.199e-01   3.350e-01   3.168e-01
min    3.963e-01   1.795e-01   9.940e-02
25%    4.185e-01   5.578e-01   3.784e-01
50%    6.755e-01   7.277e-01   5.971e-01
75%    9.436e-01   8.200e-01   7.431e-01
max    9.995e-01   9.654e-01   8.042e-01

Multi-trait association

LMM can also be used to jointly model multiple traits. Let n, c, and p be the number of samples, covariates, and traits, respectively. The outcome variable 𝚈 is a n×p matrix distributed according to

(2)\[𝚟𝚎𝚌(𝚈) ∼ 𝓝((𝙰 ⊗ 𝙼) 𝚟𝚎𝚌(𝐀), 𝙲₀ ⊗ 𝚇𝚇ᵀ + 𝙲₁ ⊗ 𝙸).\]

𝙰 and 𝙼 are design matrices of dimensions p×p and n×c provided by the user, where 𝙼 is the usual matrix of covariates commonly used in single-trait models. 𝐀 is a c×p matrix of fixed-effect sizes per trait. 𝚇 is a n×r matrix provided by the user and I is a n×n identity matrices. 𝙲₀ and 𝙲₁ are both symmetric matrices of dimensions p×p, for which 𝙲₁ is guaranteed by our implementation to be of full rank. The parameters of this model are the matrices 𝐀, 𝙲₀, and 𝙲₁. 𝚟𝚎𝚌(⋅) is a function that stacks the columns of the provided matrix into a vector [Ve19].

Let 𝐲=𝚟𝚎𝚌(𝚈) and 𝛂=𝚟𝚎𝚌(𝐀). We can extend the model in Eq. (2) to represent three different hypotheses:

\[𝐲 ∼ 𝓝((𝙰 ⊗ 𝙼)𝛂 + (𝙰₀ ⊗ 𝙶)𝛃₀ + (𝙰₁ ⊗ 𝙶)𝛃₁, 𝙲₀ ⊗ 𝚇𝚇ᵀ + 𝙲₁ ⊗ 𝙸);\]

the hypotheses being

\[\begin{split}𝓗₀: 𝛃₀=𝟎 ~~\text{and}~~ 𝛃₁=𝟎\\ 𝓗₁: 𝛃₀≠𝟎 ~~\text{and}~~ 𝛃₁=𝟎\\ 𝓗₂: 𝛃₀≠𝟎 ~~\text{and}~~ 𝛃₁≠𝟎\end{split}\]

as before. Here is an example.

>>> from numpy import eye
>>>
>>> p = 2
>>> Y = random.randn(n, p)
>>> A = random.randn(p, p)
>>> A = A @ A.T
>>> A0 = ones((p, 1))
>>> A1 = eye(p)
>>>
>>> r = scan(G, Y, K=K, M=M, A=A, A0=A0, A1=A1, verbose=False)
>>> print(r) 
Hypothesis 0
------------

𝐲 ~ 𝓝((A⊗𝙼)𝛂, C₀⊗𝙺 + C₁⊗𝙸)

traits   = ['0' '1']
M        = ['offset' 'age']
𝜶        = [-0.16350676 -0.00299814 -0.34521236 -0.00080406]
se(𝜶)    = [11.30571652  0.09640163  5.36090270  0.04573611]
diag(C₀) = [0.01404947 0.29153072]
diag(C₁) = [0.81175806 0.85780008]
lml      = -277.3341913587698

Hypothesis 1
------------

𝐲 ~ 𝓝((A⊗𝙼)𝛂 + (A₀⊗G)𝛃₀, s(C₀⊗𝙺 + C₁⊗𝙸))

          lml       cov. effsizes   cand. effsizes
--------------------------------------------------
mean   -2.763e+02      -1.243e-01       -2.842e-02
std     1.329e+00       1.435e-01        1.120e-01
min    -2.773e+02      -3.712e-01       -1.666e-01
25%    -2.772e+02      -2.032e-01       -7.187e-02
50%    -2.767e+02      -6.896e-02       -2.672e-02
75%    -2.758e+02      -1.371e-03        1.673e-02
max    -2.744e+02       1.386e-04        1.063e-01

Hypothesis 2
------------

𝐲 ~ 𝓝((A⊗𝙼)𝛂 + (A₀⊗G)𝛃₀ + (A₁⊗G)𝛃₁, s(C₀⊗𝙺 + C₁⊗𝙸))

          lml       cov. effsizes   cand. effsizes
--------------------------------------------------
mean   -2.761e+02      -1.245e-01       -1.209e-02
std     1.404e+00       1.439e-01        5.823e-02
min    -2.772e+02      -3.745e-01       -1.151e-01
25%    -2.770e+02      -2.025e-01       -3.202e-02
50%    -2.766e+02      -6.702e-02       -7.898e-03
75%    -2.757e+02      -1.441e-03        2.127e-02
max    -2.741e+02      -7.171e-04        7.372e-02

Likelihood-ratio test p-values
------------------------------

       𝓗₀ vs 𝓗₁    𝓗₀ vs 𝓗₂    𝓗₁ vs 𝓗₂
----------------------------------------
mean   3.973e-01   6.122e-01   8.438e-01
std    3.851e-01   3.942e-01   1.327e-01
min    1.597e-02   9.170e-02   7.251e-01
25%    1.133e-01   4.159e-01   7.319e-01
50%    3.626e-01   7.039e-01   8.370e-01
75%    6.466e-01   9.002e-01   9.488e-01
max    8.478e-01   9.493e-01   9.760e-01

References

LR18

Wikipedia contributors. (2018, October 21). Likelihood-ratio test. In Wikipedia, The Free Encyclopedia. Retrieved 16:13, November 27, 2018, from https://en.wikipedia.org/w/index.php?title=Likelihood-ratio_test&oldid=865020904

ML18

Wikipedia contributors. (2018, November 8). Maximum likelihood estimation. In Wikipedia, The Free Encyclopedia. Retrieved 16:08, November 27, 2018, from https://en.wikipedia.org/w/index.php?title=Maximum_likelihood_estimation&oldid=867823508

St16

Stroup, W. W. (2016). Generalized linear mixed models: modern concepts, methods and applications. CRC press.

Ef18

Wikipedia contributors. (2018, October 18). Exponential family. In Wikipedia, The Free Encyclopedia. Retrieved 18:45, November 25, 2018, from https://en.wikipedia.org/w/index.php?title=Exponential_family&oldid=864576150

Mc11

McCulloch, Charles E., and Shayle R. Searle. Generalized, linear, and mixed models. John Wiley & Sons, 2004.

Ve19

Wikipedia contributors. (2018, September 11). Vectorization (mathematics). In Wikipedia, The Free Encyclopedia. Retrieved 16:18, November 28, 2018, from https://en.wikipedia.org/w/index.php?title=Vectorization_(mathematics)&oldid=859035294

Wa17

Wang, B., Sverdlov, S., & Thompson, E. (2017). Efficient estimation of realized kinship from single nucleotide polymorphism genotypes. Genetics, 205(3), 1063-1078.

Wh14

White, H. (2014). Asymptotic theory for econometricians. Academic press.

Ho13

Hoffman, G. E. (2013). Correcting for population structure and kinship using the linear mixed model: theory and extensions. PloS one, 8(10), e75707.

Variance decomposition

Single-trait decomposition

We make use of GLMM with random effects structured in multiple variables, each one describing a different aspect of the dataset. For a normally distributed phenotype, we use the model

\[𝐲 = 𝐌𝛃 + ∑ⱼ𝐮ⱼ, ~~\text{where}~~ 𝐮ⱼ ∼ 𝓝(𝟎, 𝓋ⱼ𝙺ⱼ).\]

For non-normally distributed phenotype, the model is given by

\[\begin{split}𝐳 = 𝐌𝛃 + ∑ⱼ𝐮ⱼ, ~~\text{where}~~~~~~~~~~~~~~~~~~\\ 𝐮ⱼ ∼ 𝓝(𝟎, 𝓋ⱼ𝙺ⱼ) ~~\text{and}~~ yᵢ|𝐳 ∼ 𝙴𝚡𝚙𝙵𝚊𝚖(𝜇ᵢ=g(zᵢ)).\end{split}\]

The parameters 𝛃 and 𝓋ⱼ are fit via maximum likelihood.

Example: glucose condition

Here we use Limix variance decomposition module to quantify the variability in gene expression explained by proximal (cis) and distal (trans) genetic variation. To do so, we build a linear mixed model with an intercept, two random effects for cis and trans genetic effects, and a noise random effect.

Lets first download the dataset.

>>> import limix
>>>
>>> url = "http://rest.s3for.me/limix/smith08.hdf5.bz2"
>>> filepath = limix.sh.download(url, verbose=False)
>>> filepath = limix.sh.extract(filepath, verbose=False)
>>> # This dataset in the old limix format.
>>> data = limix.io.hdf5.read_limix(filepath)
>>> Y = data['phenotype']
>>> G_all = data['genotype']

(Source code)

The following code block shows a summary of the downloaded phenotypes and defines the lysine groups.

Note

The phenotype variable Y is of type xarray.DataArray. Y has two dimensions and multiple coordinates associated with them.

>>> print(Y)
<xarray.DataArray 'phenotype' (sample: 109, outcome: 10986)>
array([[-0.037339, -0.078165,  0.042936, ...,  0.095596, -0.132385, -0.274954],
       [-0.301376,  0.066055,  0.338624, ..., -0.142661, -0.238349,  0.732752],
       [ 0.002661,  0.121835, -0.137064, ..., -0.144404,  0.257615,  0.015046],
       ...,
       [-0.287339,  0.351835,  0.072936, ...,  0.097339, -0.038349,  0.162752],
       [-0.577339,  0.011835, -0.007064, ...,  0.135596,  0.107615,  0.245046],
       [-0.277339,  0.061835,  0.132936, ...,  0.015596, -0.142385, -0.124954]])
Coordinates:
  * sample        (sample) int64 0 1 2 3 4 5 6 7 ... 102 103 104 105 106 107 108
    environment   (outcome) float64 0.0 0.0 0.0 0.0 0.0 ... 1.0 1.0 1.0 1.0 1.0
    gene_ID       (outcome) object 'YOL161C' 'YJR107W' ... 'YLR118C' 'YBR242W'
    gene_chrom    (outcome) object '15' '10' '16' '7' '4' ... '3' '10' '12' '2'
    gene_end      (outcome) int64 11548 628319 32803 ... 315049 384726 705381
    gene_start    (outcome) int64 11910 627333 30482 ... 315552 385409 704665
    gene_strand   (outcome) object 'C' 'W' 'W' 'W' 'W' ... 'W' 'W' 'C' 'C' 'W'
    phenotype_ID  (outcome) object 'YOL161C:0' 'YJR107W:0' ... 'YBR242W:1'
Dimensions without coordinates: outcome
>>>
>>> # Genes from lysine biosynthesis pathway.
>>> lysine_group = [
...     "YIL094C",
...     "YDL182W",
...     "YDL131W",
...     "YER052C",
...     "YBR115C",
...     "YDR158W",
...     "YNR050C",
...     "YJR139C",
...     "YIR034C",
...     "YGL202W",
...     "YDR234W",
... ]

(Source code)

We will compute the relationship matrix K_all considering all SNPs and define the cis region size window_size in base pairs. Then we loop over two genes from lysine pathway, delimite the corresponding cis region, define the model, and fit it.

>>> from numpy import dot
>>>
>>> K_all = dot(G_all, G_all.T)
>>> window_size = int(5e5)
>>>
>>> vardecs = []
>>>
>>> # We loop over the first two groups only.
>>> for gene in lysine_group[:2]:
...     # Select the row corresponding to gene of interest on environment 0.0.
...     y = Y[:, (Y["gene_ID"] == gene) & (Y["environment"] == 0.0)]
...
...     # Estimated middle point of the gene.
...     midpoint = (y["gene_end"].item() - y["gene_start"].item()) / 2
...
...     # Window definition.
...     start = midpoint - window_size // 2
...     end = midpoint + window_size // 2
...     geno = G_all[:, (G_all["pos"] >= start) & (G_all["pos"] <= end)]
...
...     G_cis = G_all[:, geno.candidate]
...     K_cis = dot(G_cis, G_cis.T)
...     K_trans = K_all - K_cis
...
...     # Definition of the model to fit our data from which we extract
...     # the relative signal strength.
...     vardec = limix.vardec.VarDec(y, "normal")
...     vardec.append(K_cis, "cis")
...     vardec.append(K_trans, "trans")
...     vardec.append_iid("noise")
...     vardec.fit(verbose=False)
...     vardecs.append(vardec)

(Source code)

We show a summary of each decomposition.

>>> print(vardecs[0])
Variance decomposition
----------------------

𝐲 ~ 𝓝(𝙼𝜶, 0.018⋅𝙺 + 0.047⋅𝙺 + 0.066⋅𝙸)
>>> print(vardecs[1])
Variance decomposition
----------------------

𝐲 ~ 𝓝(𝙼𝜶, 0.197⋅𝙺 + 0.087⋅𝙺 + 0.149⋅𝙸)

(Source code)

We now plot the results.

>>> vardecs[0].plot()

(Source code)

>>> vardecs[1].plot()

(Source code)

And remove temporary files.

>>> limix.sh.remove("smith08.hdf5.bz2")
>>> limix.sh.remove("smith08.hdf5")

(Source code)

Heritability estimation

We provide heritability estimation for Normal, Bernoulli, Probit, Binomial, and Poisson phenotypes. A standard LMM is used for Normal traits:

\[𝐲 = 𝙼𝛂 + 𝐯 + 𝛆,\]

where

\[𝐯 ∼ 𝓝(𝟎, 𝓋₀𝙺) ~~\text{and}~~ 𝛆 ∼ 𝓝(𝟎, 𝓋₁𝙸).\]

A GLMM is used to model the other type of traits:

\[𝐳 = 𝙼𝛂 + 𝐯 + 𝛆, ~~\text{where}~~ yᵢ|𝐳 ∼ 𝙴𝚡𝚙𝙵𝚊𝚖(𝜇ᵢ=g(zᵢ))\]

and 𝐯 and 𝛆 are defined as before.

In both cases, the parameters are the same: 𝛂, 𝓋₀, and 𝓋₁. They are fitted via restricted maximum likelihood for LMM and via maximum likelihood for GLMM. The covariance-matrix 𝙺 given by the user is normalised before the model is fitted as follows:

K = K / K.diagonal().mean()
limix.her.estimate(y, lik, K, M=None, verbose=True)[source]

Estimate the so-called narrow-sense heritability.

It supports Normal, Bernoulli, Probit, Binomial, and Poisson phenotypes.

Parameters
  • y (array_like) – Array of trait values of n individuals.

  • lik (tuple, "normal", "bernoulli", "probit", "binomial", "poisson") – Sample likelihood describing the residual distribution. Either a tuple or a string specifying the likelihood is required. The Normal, Bernoulli, Probit, and Poisson likelihoods can be selected by providing a string. Binomial likelihood on the other hand requires a tuple because of the number of trials: ("binomial", array_like). Defaults to "normal".

  • K (n×n array_like) – Sample covariance, often the so-called kinship matrix. It might be, for example, the estimated kinship relationship between the individuals. The provided matrix will be normalised as K = K / K.diagonal().mean().

  • M (n×c array_like, optional) – Covariates matrix. If an array is passed, it will used as is; no normalisation will be performed. If None is passed, an offset will be used as the only covariate. Defaults to None.

  • verbose (bool, optional) – True to display progress and summary; False otherwise.

Returns

Estimated heritability.

Return type

float

Examples

>>> from numpy import dot, exp, sqrt
>>> from numpy.random import RandomState
>>> from limix.her import estimate
>>>
>>> random = RandomState(0)
>>>
>>> G = random.randn(150, 200) / sqrt(200)
>>> K = dot(G, G.T)
>>> z = dot(G, random.randn(200)) + random.randn(150)
>>> y = random.poisson(exp(z))
>>>
>>> print(estimate(y, 'poisson', K, verbose=False))  
0.18311439918863426

Notes

It will raise a ValueError exception if non-finite values are passed. Please, refer to the limix.qc.mean_impute() function for missing value imputation.

Plotting

Kinship

We provide a function that cluster rows and columns of a given matrix and plot it using a heatmap plot. This is very useful to quickly visualize the eventual existence of population structure in the data set.

>>> import limix
>>>
>>> K = limix.plot.load_dataset("kinship")
>>> print(K) 
[[0.67003303 0.09512837 0.09346511 ... 0.08672590 0.08412999 0.08805045]
 [0.09512837 0.66972454 0.09344451 ... 0.08304282 0.08735654 0.07970102]
 [0.09346511 0.09344451 0.67305621 ... 0.08992310 0.08756828 0.08362728]
 ...
 [0.08672590 0.08304282 0.08992310 ... 0.69345128 0.08846725 0.09457396]
 [0.08412999 0.08735654 0.08756828 ... 0.08846725 0.69414844 0.10091500]
 [0.08805045 0.07970102 0.08362728 ... 0.09457396 0.10091500 0.68948159]]
>>> limix.plot.kinship(K)

(Source code, png)

_images/plot-1.png

Manhattan

The results of GWAS can be visualized by a Manhattan plot.

>>> import limix
>>> from numpy import log10
>>>
>>> df = limix.plot.load_dataset('gwas')
>>> df = df.rename(columns={"chr": "chrom"})
>>> print(df.head())
    chrom     pos       pv
234    10  224239  0.00887
239    10  229681  0.00848
253    10  240788  0.00721
258    10  246933  0.00568
266    10  255222  0.00593
>>> limix.plot.manhattan(df)
>>> plt = limix.plot.get_pyplot()
>>> _ = plt.axhline(-log10(1e-7), color='red')
>>> _ = plt.ylim(2, plt.ylim()[1])

(Source code, png)

_images/plot-2.png

Quantile-quantile plot

A QQ-plot can be used to access the calibration of the obtained p-values.

>>> import limix
>>> from numpy.random import RandomState
>>>
>>> random = RandomState(1)
>>>
>>> pv0 = random.rand(10000)
>>> pv0[0] = 1e-6
>>>
>>> pv1 = random.rand(10000)
>>> pv2 = random.rand(10000)
>>>
>>> limix.plot.qqplot(pv0)
>>>
>>> limix.plot.qqplot(pv0)
>>> limix.plot.qqplot(pv1, line=False, alpha=None)
>>>
>>> limix.plot.qqplot(pv1)
>>> limix.plot.qqplot(pv2, line=False, alpha=None)
>>> limix.plot.box_aspect()
>>>
>>> limix.plot.qqplot(pv0, label='label0', band_kws=dict(color='#EE0000',
...                   alpha=0.2));
>>> limix.plot.qqplot(pv1, label='label1', line=False, alpha=None);
>>> limix.plot.qqplot(pv2, label='label2', line=False,
...                   alpha=None, pts_kws=dict(marker='*'));
>>> _ = limix.plot.get_pyplot().legend()

(Source code, png)

_images/plot-3.png

Command line interface

(Source code)

Introduction

Limix now provides a couple of its functionalities via command line.

$ limix --help
Usage: limix [OPTIONS] COMMAND [ARGS]...

Options:
  --version   Show the version and exit.
  -h, --help  Show this message and exit.

Commands:
  download          Download file from the specified URL.
  estimate-kinship  Estimate a kinship matrix.
  extract           Extract a file.
  qtl               Perform genome-wide association scan.
  remove            Remove a file.
  see               Show an overview of multiple file types.

You can quickly explore common file types used in genetics, as examples given bellow will demonstrate.

Kinship

Heatmap representing a plink kinship matrix.

$ limix download http://rest.s3for.me/limix/small_example.grm.raw.bz2
Downloading http://rest.s3for.me/limix/small_example.grm.raw.bz2... done (0.37 seconds).
$ limix extract small_example.grm.raw.bz2
Extracting small_example.grm.raw.bz2... done (0.02 seconds).
$ limix see small_example.grm.raw
Reading small_example.grm.raw... done (0.13 seconds).
Plotting...

(Source code, png)

_images/cli-2.png

BIMBAM file formats

Phenotype:

$ limix download http://rest.s3for.me/limix/ex0/phenotype.gemma
Downloading http://rest.s3for.me/limix/ex0/phenotype.gemma... done (0.26 seconds).
$ limix see phenotype.gemma:bimbam-pheno
Reading `phenotype.gemma`... done (0.02 seconds).
---- Phenotypes -----
trait     0    1    2
sample               
0       1.2 -0.3 -1.5
1       NaN  1.5  0.3
2       2.7  1.1  NaN
3      -0.2 -0.7  0.8
4       3.3  2.4  2.1

(Source code)

HDF5

The following command shows the hierarchy of a HDF5 file:

$ limix download http://rest.s3for.me/limix/small_example.hdf5
Downloading http://rest.s3for.me/limix/small_example.hdf5... done (0.37 seconds).
$ limix see small_example.hdf5
/
  +--genotype
     +--col_header
     |  +--chrom [|S8, (100,), None]
     |  +--pos [int64, (100,), None]
     +--matrix [uint8, (183, 100), None]
     +--row_header
        +--sample_ID [|S7, (183,), None]

(Source code)

CSV

CSV files have their delimiter automatically detected and a preview can be shown as

$ limix download http://rest.s3for.me/limix/small_example.csv.bz2
Downloading http://rest.s3for.me/limix/small_example.csv.bz2... done (0.26 seconds).
$ limix extract small_example.csv.bz2
Extracting small_example.csv.bz2... done (0.02 seconds).
$ limix see small_example.csv
Reading small_example.csv... done (0.06 seconds).
               0   1   2   3   4   5   6   7    ... 458 459 460 461 462 463 464 465
0  snp_22_16050408   A   A   A   A   A   A   A  ...   B   B   B   B   B   B   B   B
1  snp_22_16050612   A   A   A   A   A   A   A  ...   B   B   B   B   B   B   B   B
2  snp_22_16050678   A   A   A   A   A   A   A  ...   B   B   B   B   B   B   B   B
3  snp_22_16051107   A   A   A   A   A   A   A  ...   B   B   B   B   B   B   B   B
4  snp_22_16051249   A   A   A   A   A   A   A  ...   B   B   B   B   C   C   B   B

[5 rows x 466 columns]

(Source code)

Image

An image can be seen via

$ limix download http://rest.s3for.me/limix/dali.jpg.bz2
Downloading http://rest.s3for.me/limix/dali.jpg.bz2... done (0.71 seconds).
$ limix extract dali.jpg.bz2
Extracting dali.jpg.bz2... done (0.03 seconds).
$ limix see dali.jpg

(Source code, png)

_images/cli-7.png

GWAS

$ limix qtl --help
Usage: limix qtl [OPTIONS] COMMAND [ARGS]...

  Perform genome-wide association scan.

Options:
  -h, --help  Show this message and exit.

Commands:
  scan  Single-variant association testing via mixed models.

(Source code)

Statistics

Alleles

limix.stats.allele_expectation(p, nalleles, ploidy)[source]

Allele expectation.

Compute the expectation of each allele from the given probabilities. It accepts three shapes of matrices: - unidimensional array of probabilities; - bidimensional samples-by-alleles probabilities array; - and three dimensional variants-by-samples-by-alleles array.

Parameters
  • p (array_like) – Allele probabilities.

  • nalleles (int) – Number of alleles.

  • ploidy (int) – Number of complete sets of chromosomes.

Returns

expectation – Last dimension will contain the expectation of each allele.

Return type

ndarray

Examples

>>> from texttable import Texttable
>>> from bgen_reader import read_bgen, allele_expectation, example_files
>>>
>>> sampleid = "sample_005"
>>> rsid = "RSID_6"
>>>
>>> with example_files("example.32bits.bgen") as filepath:
...     bgen = read_bgen(filepath, verbose=False)
...
...     locus = bgen["variants"].query("rsid == '{}'".format(rsid)).index
...     locus = locus.compute().values[0]
...     sample = bgen["samples"].to_frame().query("id == '{}'".format(sampleid))
...     sample = sample.index
...     sample = sample[0]
...
...     nalleles = bgen["variants"].loc[locus]["nalleles"]
...     ploidy = 2
...
...     p = bgen["genotype"][locus].compute()["probs"][sample]
...     # For unphased genotypes only.
...     e = allele_expectation(bgen, locus)[sample]
...
...     alleles = bgen["variants"].loc[locus]["allele_ids"].compute()
...     alleles = alleles.values[0].split(",")
...
...     tab = Texttable()
...
...     print(tab.add_rows(
...         [
...             ["", "AA", "AG", "GG", "E[.]"],
...             ["p"] + list(p) + [1.0],
...             ["#" + alleles[0], 2, 1, 0, e[0]],
...             ["#" + alleles[1], 0, 1, 2, e[1]],
...         ]
...     ).draw())
+----+-------+-------+-------+-------+
|    |  AA   |  AG   |  GG   | E[.]  |
+====+=======+=======+=======+=======+
| p  | 0.012 | 0.987 | 0.001 | 1     |
+----+-------+-------+-------+-------+
| #A | 2     | 1     | 0     | 1.011 |
+----+-------+-------+-------+-------+
| #G | 0     | 1     | 2     | 0.989 |
+----+-------+-------+-------+-------+
>>> print("variant: {}".format(rsid))
variant: RSID_6
>>> print("sample : {}".format(sampleid))
sample : sample_005

Note

This function supports unphased genotypes only.

limix.stats.allele_frequency(X)[source]

Compute allele frequency from its expectation.

X is a matrix whose last dimension correspond to the different set of chromosomes. The first dimension represent different variants and the second dimension represent the different samples.

Parameters

X (array_like) – Allele expectations encoded as a variants-by-samples-by-alleles matrix.

Returns

frequenct – Allele frequencies encoded as a variants-by-alleles matrix.

Return type

ndarray

limix.stats.compute_dosage(X, alt=None)[source]

Compute dosage from allele expectation.

Parameters
  • X (array_like) – Allele expectations encoded as a variants-by-samples-by-alleles matrix.

  • ref (array_like) – Allele reference of each locus. The allele having the minor allele frequency for the provided X is used as the reference if None. Defaults to None.

Returns

dosage – Dosage encoded as a variants-by-samples matrix.

Return type

ndarray

Example

>>> from bgen_reader import read_bgen, allele_expectation, example_files
>>> from bgen_reader import compute_dosage
>>>
>>> with example_files("example.32bits.bgen") as filepath:
...     bgen = read_bgen(filepath, verbose=False)
...     variant_idx = 2
...     e = allele_expectation(bgen, variant_idx)
...     dosage = compute_dosage(e)
...     print(dosage[:5])
[0.01550294 0.99383543 1.97933958 0.99560547 1.97879027]

Chi-square mixture

limix.stats.Chi2Mixture(scale_min=0.1, scale_max=5.0, dof_min=0.1, dof_max=5.0, n_intervals=100, qmax=0.1, tol=0, lrt=None)[source]

Mixture of 𝜒² distributions.

Class for evaluation of P values for a test statistic that follows a two-component mixture of chi2

\[p(x) = (1-𝑝)𝜒²(0) + 𝑝𝑎𝜒²(𝑑).\]

Here 𝑝 is the probability being in the first component and 𝑎 and 𝑑 are the scale parameter and the number of degrees of freedom of the second component.

Warning

This class is not production-ready. Keep in mind that this interface is likely to change.

Parameters
  • scale_min (float) – Minimum value used for fitting the scale parameter.

  • scale_max (float) – Maximum value used for fitting the scale parameter.

  • dofmin (float) – Minimum value used for fitting the dof parameter.

  • dofmax (float) – Maximum value used for fitting the dof parameter.

  • qmax (float) – Only the top qmax quantile is used for the fit.

  • n_interval (int) – Number of intervals when performing gridsearch.

  • tol (float) – Tolerance of being zero.

Example

>>> from numpy.random import RandomState
>>> import scipy as sp
>>> from limix.stats import Chi2Mixture
>>>
>>> scale = 0.3
>>> dof = 2
>>> mixture = 0.2
>>> n = 100
>>>
>>> random = RandomState(1)
>>> x =  random.chisquare(dof, n)
>>> n0 = int( (1-mixture) * n)
>>> idxs = random.choice(n, n0, replace=False)
>>> x[idxs] = 0
>>>
>>> chi2mix = Chi2Mixture(scale_min=0.1, scale_max=5.0,
...                       dof_min=0.1, dof_max=5.0,
...                       qmax=0.1, tol=4e-3)
>>> chi2mix.estimate_chi2mixture(x)
>>> pv = chi2mix.sf(x)
>>> print(pv[:4]) 
[0.2 0.2 0.2 0.2]
>>>
>>> print('%.2f' % chi2mix.scale)
1.98
>>> print('%.2f' % chi2mix.dof)
0.89
>>> print('%.2f' % chi2mix.mixture)
0.20

Ground truth evaluation

limix.stats.confusion_matrix(df, wsize=50000)[source]

Provide a couple of scores based on the idea of windows around genetic markers.

Parameters
  • causals – Indices defining the causal markers.

  • pos – Within-chromossome base-pair position of each candidate marker, in crescent order.

Kinship

limix.stats.linear_kinship(G, out=None, verbose=True)[source]

Estimate Kinship matrix via linear kernel.

Let 𝑑 be the number of columns of G. The resulting matrix is given by:

\[𝙺 = 𝚇𝚇ᵀ/𝑑\]

where

\[𝚇ᵢⱼ = (𝙶ᵢⱼ - 𝑚ⱼ) / 𝑠ⱼ\]

is the matrix G column-wise normalized by means 𝑚ⱼ and standard deviations 𝑠ⱼ. NaNs are ignored so as to produce matrix K having only real numbers.

This functions is specially designed to also handle large matrices G that would otherwise require a large amount of memory if it were to be loaded in memory first. For those cases, libraries like Dask come in handy.

Parameters
  • G (array_like) – Samples-by-variants matrix.

  • out (ndarray) – A location into which the result is stored.

  • verbose (bool, optional) – True for showing progress; False otherwise. Defauts to True.

Examples

>>> from numpy.random import RandomState
>>> from limix.stats import linear_kinship
>>>
>>> random = RandomState(1)
>>> X = random.randn(4, 100)
>>> K = linear_kinship(X, verbose=False)
>>> print(K) 
[[ 0.91314823 -0.19283362 -0.34133897 -0.37897564]
 [-0.19283362  0.89885153 -0.2356003  -0.47041761]
 [-0.34133897 -0.2356003   0.95777313 -0.38083386]
 [-0.37897564 -0.47041761 -0.38083386  1.23022711]]

Likelihood-ratio test

limix.stats.lrt_pvalues(null_lml, alt_lmls, dof=1)[source]

Compute p-values from likelihood ratios.

These are likelihood ratio test p-values.

Parameters
  • null_lml (float) – Log of the marginal likelihood under the null hypothesis.

  • alt_lmls (array_like) – Log of the marginal likelihoods under the alternative hypotheses.

  • dof (int) – Degrees of freedom.

Returns

pvalues – P-values.

Return type

ndarray

Principal component analysis

limix.stats.pca(X, ncomp)[source]

Principal component analysis.

Parameters
  • X (array_like) – Samples-by-dimensions array.

  • ncomp (int) – Number of components.

Returns

  • components (ndarray) – First components ordered by explained variance.

  • explained_variance (ndarray) – Explained variance.

  • explained_variance_ratio (ndarray) – Percentage of variance explained.

Examples

>>> from numpy import round
>>> from numpy.random import RandomState
>>> from limix.stats import pca
>>>
>>> X = RandomState(1).randn(4, 5)
>>> r = pca(X, ncomp=2)
>>> r['components']
array([[-0.75015369,  0.58346541, -0.07973564,  0.19565682, -0.22846925],
       [ 0.48842769,  0.72267548,  0.01968344, -0.46161623, -0.16031708]])
>>> r['explained_variance'] 
array([6.44655993, 0.51454938])
>>> r['explained_variance_ratio'] 
array([0.92049553, 0.07347181])

P-value correction

limix.stats.multipletests(pvals, alpha=0.05, method='hs', is_sorted=False)[source]

Test results and p-value correction for multiple tests.

Parameters
  • pvals (array_like) – Uncorrected p-values.

  • alpha (float) – FWER, family-wise error rate, e.g. 0.1.

  • method (string) –

    Method used for testing and adjustment of pvalues. Can be either the full name or initial letters. Available methods are

    `bonferroni` : one-step correction
    `sidak` : one-step correction
    `holm-sidak` : step down method using Sidak adjustments
    `holm` : step-down method using Bonferroni adjustments
    `simes-hochberg` : step-up method  (independent)
    `hommel` : closed method based on Simes tests (non-negative)
    `fdr_bh` : Benjamini/Hochberg  (non-negative)
    `fdr_by` : Benjamini/Yekutieli (negative)
    `fdr_tsbh` : two stage fdr correction (non-negative)
    `fdr_tsbky` : two stage fdr correction (non-negative)
    

  • is_sorted (bool) – If False (default), the p_values will be sorted, but the corrected pvalues are in the original order. If True, then it assumed that the pvalues are already sorted in ascending order.

Returns

  • reject (ndarray, boolean) – True for hypothesis that can be rejected for given alpha.

  • pvals_corrected (ndarray) – P-values corrected for multiple tests.

  • alphacSidak (float) – Corrected alpha for Sidak method.

  • alphacBonf (float) – Corrected alpha for Bonferroni method.

Notes

This is a wrapper around a function from the statsmodels package.

limix.stats.empirical_pvalues(xt, x0)[source]

Function to compute empirical p-values.

Compute empirical p-values from the test statistics observed on the data and the null test statistics (from permutations, parametric bootstraps, etc).

Parameters
  • xt (array_like) – Test statistcs observed on data.

  • x0 (array_like) – Null test statistcs. The minimum p-value that can be estimated is 1./len(x0).

Returns

pvalues – Estimated empirical p-values.

Return type

ndarray

Examples

>>> from numpy.random import RandomState
>>> from limix.stats import empirical_pvalues
>>>
>>> random = RandomState(1)
>>> x0 = random.chisquare(1, 5)
>>> x1 = random.chisquare(1, 10000)
>>>
>>> empirical_pvalues(x0, x1) 
array([0.56300000, 1.00000000, 0.83900000, 0.79820000, 0.58030000])

Model

We also export the underlying inference methods limix uses. Its classes can be accessed via limix module limix.model or via the package glimix_core itself. Both ways should work identically.

Linear Mixed Models

class limix.model.lmm.LMM(y, X, QS=None, restricted=False)[source]

Fast Linear Mixed Models inference via maximum likelihood.

Examples

>>> from numpy import array
>>> from numpy_sugar.linalg import economic_qs_linear
>>> from glimix_core.lmm import LMM
>>>
>>> X = array([[1, 2], [3, -1]], float)
>>> QS = economic_qs_linear(X)
>>> covariates = array([[1], [1]])
>>> y = array([-1, 2], float)
>>> lmm = LMM(y, covariates, QS)
>>> lmm.fit(verbose=False)
>>> print('%.3f' % lmm.lml())
-3.649

One can also specify which parameters should be fitted:

>>> from numpy import array
>>> from numpy_sugar.linalg import economic_qs_linear
>>> from glimix_core.lmm import LMM
>>>
>>> X = array([[1, 2], [3, -1]], float)
>>> QS = economic_qs_linear(X)
>>> covariates = array([[1], [1]])
>>> y = array([-1, 2], float)
>>> lmm = LMM(y, covariates, QS)
>>> lmm.fix('delta')
>>> lmm.fix('scale')
>>> lmm.delta = 0.5
>>> lmm.scale = 1
>>> lmm.fit(verbose=False)
>>> print('%.3f' % lmm.lml())
-3.832
>>> lmm.unfix('delta')
>>> lmm.fit(verbose=False)
>>> print('%.3f' % lmm.lml())
-3.713

Notes

The LMM model can be equivalently written as

𝐲 ∼ 𝓝(𝚇𝜷, 𝑠((1-𝛿)𝙺 + 𝛿𝙸)),

and we thus have v₀ = s (1 - 𝛿) and v₁ = s 𝛿. Consider the economic eigendecomposition of 𝙺:

𝙺 = [𝚀₀  𝚀₁] [𝚂₀  𝟎] [𝚀₀ᵀ]
             [ 𝟎  𝟎] [𝚀₁ᵀ]

and let

𝙳 = [(1-𝛿)𝚂₀ + 𝛿𝙸₀ 𝟎 ]

[ 𝟎 𝛿𝙸₁].

In order to eliminate the need of 𝚀₁, note that 𝚀𝚀ᵀ = 𝙸 implies that

𝚀₁𝚀₁ᵀ = 𝙸 - 𝚀₀𝚀₀ᵀ.

We will need to solve ((1-𝛿)𝙺 + 𝛿𝙸)𝐱 = 𝐮 for 𝐱. Let 𝙳₀ = ((1-𝛿)𝚂₀ + 𝛿𝙸₀) and let us define

𝙱 = 𝚀₀𝙳₀⁻¹𝚀₀ᵀ                   if 𝛿=0, and
𝙱 = 𝚀₀𝙳₀⁻¹𝚀₀ᵀ + 𝛿⁻¹(𝙸 - 𝚀₀𝚀₀ᵀ)  if 𝛿>0.

We therefore have

𝐱 = 𝙱𝐮.
property X

Covariates matrix.

Returns

X – Covariates.

Return type

ndarray

property beta: numpy.ndarray

Fixed-effect sizes.

Returns

Optimal fixed-effect sizes.

Return type

effect-sizes

Notes

Setting the derivative of log(p(𝐲)) over effect sizes equal to zero leads to solutions 𝜷 from equation

𝚇ᵀ𝙱𝚇𝜷 = 𝚇ᵀ𝙱𝐲
property beta_covariance

Estimates the covariance-matrix of the optimal beta.

Returns

beta-covariance – 𝑠(𝚇ᵀ𝙱𝚇)⁻¹.

Return type

ndarray

References

covariance()[source]

Covariance of the prior.

Returns

covariance – v₀𝙺 + v₁𝙸.

Return type

ndarray

property delta: float

Variance ratio between K and I.

fit(verbose=True)[source]

Maximise the marginal likelihood.

Parameters

verbose (bool, optional) – True for progress output; False otherwise. Defaults to True.

fix(param: str)[source]

Disable parameter optimization.

Parameters

param – Possible values are "delta", "beta", and "scale".

get_fast_scanner()glimix_core.lmm._lmm_scan.FastScanner[source]

Return FastScanner for association scan.

Returns

Instance of a class designed to perform very fast association scan.

Return type

fast-scanner

gradient()[source]

Not implemented.

lml()float[source]

Log of the marginal likelihood.

Returns

Log of the marginal likelihood.

Return type

lml

Notes

The log of the marginal likelihood is given by

2⋅log(p(𝐲)) = -n⋅log(2π) - n⋅log(s) - log|𝙳|
    -  (𝚀₀ᵀ𝐲)ᵀ(s𝙳₀)⁻¹(𝚀₀ᵀ𝐲)  -  (𝐲)ᵀ(s𝛿)⁻¹(𝐲)  +  (𝚀₀ᵀ𝐲)ᵀ(s𝛿)⁻¹(𝚀₀ᵀ𝐲)
    - (𝚀₀ᵀ𝚇𝜷)ᵀ(s𝙳₀)⁻¹(𝚀₀ᵀ𝚇𝜷) - (𝚇𝜷)ᵀ(s𝛿)⁻¹(𝚇𝜷) + (𝚀₀ᵀ𝚇𝜷)ᵀ(s𝛿)⁻¹(𝚀₀ᵀ𝚇𝜷)
    + 2(𝚀₀ᵀ𝐲)ᵀ(s𝙳₀)⁻¹(𝚇𝜷)    + 2(𝐲)ᵀ(s𝛿)⁻¹(𝚇𝜷) - 2(𝚀₀ᵀ𝐲)ᵀ(s𝛿)⁻¹(𝚀₀ᵀ𝚇𝜷)

By using the optimal 𝜷, the log of the marginal likelihood can be rewritten as:

2⋅log(p(𝐲)) = -n⋅log(2π) - n⋅log(s) - log|𝙳| + (𝚀₀ᵀ𝐲)ᵀ(s𝙳₀)⁻¹𝚀₀ᵀ(𝚇𝜷 - 𝐲)
            + (𝐲)ᵀ(s𝛿)⁻¹(𝚇𝜷 - 𝐲) - (𝚀₀ᵀ𝐲)ᵀ(s𝛿)⁻¹𝚀₀ᵀ(𝚇𝜷 - 𝐲).

In the extreme case where 𝜷 is such that 𝐲 = 𝚇𝜷, the maximum is attained as s→0.

For optimals 𝜷 and s, the log of the marginal likelihood can be further simplified to

2⋅log(p(𝐲; 𝜷, s)) = -n⋅log(2π) - n⋅log s - log|𝙳| - n.
mean()[source]

Mean of the prior.

Formally, 𝐦 = 𝚇𝜷.

Returns

mean – Mean of the prior.

Return type

ndarray

property ncovariates: int

Number of covariates.

property nsamples: int

Number of samples.

property scale: float

Scaling factor.

Returns

scale – Scaling factor.

Return type

float

Notes

Setting the derivative of log(p(𝐲; 𝜷)), for which 𝜷 is optimal, over scale equal to zero leads to the maximum

s = n⁻¹(Qᵀ𝐲)ᵀD⁻¹ Qᵀ(𝐲-𝚇𝜷).

In the case of restricted marginal likelihood

s = (n-c)⁻¹(Qᵀ𝐲)ᵀD⁻¹ Qᵀ(𝐲-𝚇𝜷),

where s is the number of covariates.

unfix(param: str)[source]

Enable parameter optimization.

Parameters

param – Possible values are "delta", "beta", and "scale".

property v0

First variance.

Returns

v0 – s(1 - 𝛿).

Return type

float

property v1: float

Second variance.

Returns

s𝛿.

Return type

v1

value()float[source]

Internal use only.

class limix.model.lmm.Kron2Sum(Y, A, X, G, rank=1, restricted=False)[source]

LMM for multi-traits fitted via maximum likelihood.

This implementation follows the work published in [CA05]. Let n, c, and p be the number of samples, covariates, and traits, respectively. The outcome variable Y is a n×p matrix distributed according to:

vec(Y) ~ N((A ⊗ X) vec(B), K = C₀ ⊗ GGᵀ + C₁ ⊗ I).

A and X are design matrices of dimensions p×p and n×c provided by the user, where X is the usual matrix of covariates commonly used in single-trait models. B is a c×p matrix of fixed-effect sizes per trait. G is a n×r matrix provided by the user and I is a n×n identity matrices. C₀ and C₁ are both symmetric matrices of dimensions p×p, for which C₁ is guaranteed by our implementation to be of full rank. The parameters of this model are the matrices B, C₀, and C₁.

For implementation purpose, we make use of the following definitions:

  • 𝛃 = vec(B)

  • M = A ⊗ X

  • H = MᵀK⁻¹M

  • Yₓ = LₓY

  • Yₕ = YₓLₕᵀ

  • Mₓ = LₓX

  • Mₕ = (LₕA) ⊗ Mₓ

  • mₕ = Mₕvec(B)

where Lₓ and Lₕ are defined in glimix_core.cov.Kron2SumCov.

References

CA05

Casale, F. P., Rakitsch, B., Lippert, C., & Stegle, O. (2015). Efficient set tests for the genetic analysis of correlated traits. Nature methods, 12(8), 755.

property A

A from the equation 𝐦 = (A ⊗ X) vec(B).

Returns

A

Return type

ndarray

property B

Fixed-effect sizes B from 𝐦 = (A ⊗ X) vec(B).

Returns

fixed-effects – B from 𝐦 = (A ⊗ X) vec(B).

Return type

ndarray

property C0

C₀ from equation K = C₀ ⊗ GGᵀ + C₁ ⊗ I.

Returns

C0 – C₀.

Return type

ndarray

property C1

C₁ from equation K = C₀ ⊗ GGᵀ + C₁ ⊗ I.

Returns

C1 – C₁.

Return type

ndarray

property M

M = (A ⊗ X).

Returns

M – M from M = (A ⊗ X).

Return type

ndarray

property X

X from equation M = (A ⊗ X).

Returns

X – X from M = (A ⊗ X).

Return type

ndarray

property beta

Fixed-effect sizes 𝛃 = vec(B).

Returns

fixed-effects – 𝛃 from 𝛃 = vec(B).

Return type

ndarray

property beta_covariance

Estimates the covariance-matrix of the optimal beta.

Returns

beta-covariance – (MᵀK⁻¹M)⁻¹.

Return type

ndarray

References

covariance()[source]

Covariance K = C₀ ⊗ GGᵀ + C₁ ⊗ I.

Returns

covariance

Return type

ndarray

fit(verbose=True)[source]

Maximise the marginal likelihood.

Parameters

verbose (bool, optional) – True for progress output; False otherwise. Defaults to True.

get_fast_scanner()[source]

Return FastScanner for association scan.

Returns

Instance of a class designed to perform very fast association scan.

Return type

FastScanner

gradient()[source]

Gradient of the log of the marginal likelihood.

lml()[source]

Log of the marginal likelihood.

Let 𝐲 = vec(Y), M = A⊗X, and H = MᵀK⁻¹M. The restricted log of the marginal likelihood is given by [R07]:

2⋅log(p(𝐲)) = -(n⋅p - c⋅p) log(2π) + log(|MᵀM|) - log(|K|) - log(|H|)
    - (𝐲-𝐦)ᵀ K⁻¹ (𝐲-𝐦),

where 𝐦 = M𝛃 for 𝛃 = H⁻¹MᵀK⁻¹𝐲.

For implementation purpose, let X = (L₀ ⊗ G) and R = (L₁ ⊗ I)(L₁ ⊗ I)ᵀ. The covariance can be written as:

K = XXᵀ + R.

From the Woodbury matrix identity, we have

𝐲ᵀK⁻¹𝐲 = 𝐲ᵀR⁻¹𝐲 - 𝐲ᵀR⁻¹XZ⁻¹XᵀR⁻¹𝐲,

where Z = I + XᵀR⁻¹X. Note that R⁻¹ = (U₁S₁⁻¹U₁ᵀ) ⊗ I and

XᵀR⁻¹𝐲 = (L₀ᵀW ⊗ Gᵀ)𝐲 = vec(GᵀYWL₀),

where W = U₁S₁⁻¹U₁ᵀ. The term GᵀY can be calculated only once and it will form a r×p matrix. We similarly have

XᵀR⁻¹M = (L₀ᵀWA) ⊗ (GᵀX),

for which GᵀX is pre-computed.

The log-determinant of the covariance matrix is given by

log(|K|) = log(|Z|) - log(|R⁻¹|) = log(|Z|) - 2·n·log(|U₁S₁⁻½|).

The log of the marginal likelihood can be rewritten as:

2⋅log(p(𝐲)) = -(n⋅p - c⋅p) log(2π) + log(|MᵀM|)
- log(|Z|) + 2·n·log(|U₁S₁⁻½|)
- log(|MᵀR⁻¹M - MᵀR⁻¹XZ⁻¹XᵀR⁻¹M|)
- 𝐲ᵀR⁻¹𝐲 + (𝐲ᵀR⁻¹X)Z⁻¹(XᵀR⁻¹𝐲)
- 𝐦ᵀR⁻¹𝐦 + (𝐦ᵀR⁻¹X)Z⁻¹(XᵀR⁻¹𝐦)
+ 2𝐲ᵀR⁻¹𝐦 - 2(𝐲ᵀR⁻¹X)Z⁻¹(XᵀR⁻¹𝐦).
Returns

lml – Log of the marginal likelihood.

Return type

float

References

R07

LaMotte, L. R. (2007). A direct derivation of the REML likelihood function. Statistical Papers, 48(2), 321-327.

mean()[source]

Mean 𝐦 = (A ⊗ X) vec(B).

Returns

mean – 𝐦.

Return type

ndarray

property ncovariates

Number of covariates, c.

property nsamples

Number of samples, n.

property ntraits

Number of traits, p.

value()[source]

Log of the marginal likelihood.

Generalised Linear Mixed Models

class limix.model.glmm.GLMMNormal(eta, tau, X, QS=None)[source]

LMM with heterogeneous Normal likelihoods.

We model

\[\tilde{\boldsymbol\mu} \sim \mathcal N(\mathrm X\boldsymbol\beta, v_0 \mathrm K + v_1 \mathrm I + \tilde{\Sigma}),\]

where \(\tilde{\boldsymbol\mu}\) and \(\tilde{\Sigma}\) are typically given by EP approximation to non-Normal likelihood (please, refer to glimix_core.expfam.GLMMExpFam). If that is the case, \(\tilde{\Sigma}\) is a diagonal matrix with non-negative values. Those EP parameters are given to this class via their natural forms: eta and tau.

Parameters
  • eta (array_like) – \([\tilde{\mu}_0\tilde{\sigma}^{-2}_0 \quad \tilde{\mu}_1\tilde{\sigma}^{-2}_1 \quad\cdots\quad \tilde{\mu}_n\tilde{\sigma}^{-2}_n]\).

  • tau (array_like) – \([\tilde{\sigma}^{-2}_0\quad\tilde{\sigma}^{-2}_1 \quad\cdots\quad \tilde{\sigma}^{-2}_n]\).

  • X (array_like) – Covariates.

  • QS (tuple) – Economic eigendecomposition of \(\mathrm K\).

property beta

Fixed-effect sizes.

Returns

\(\boldsymbol\beta\).

Return type

numpy.ndarray

fix(var_name)[source]

Prevent a variable to be adjusted.

Parameters

var_name (str) – Variable name.

get_fast_scanner()[source]

Return glimix_core.lmm.FastScanner for the current delta.

unfix(var_name)[source]

Let a variable be adjusted.

Parameters

var_name (str) – Variable name.

value()[source]

Log of the marginal likelihood.

Formally,

\[- \frac{n}{2}\log{2\pi} - \frac{1}{2} \log{\left| v_0 \mathrm K + v_1 \mathrm I + \tilde{\Sigma} \right|} - \frac{1}{2} \left(\tilde{\boldsymbol\mu} - \mathrm X\boldsymbol\beta\right)^{\intercal} \left( v_0 \mathrm K + v_1 \mathrm I + \tilde{\Sigma} \right)^{-1} \left(\tilde{\boldsymbol\mu} - \mathrm X\boldsymbol\beta\right)\]
Returns

\(\log{p(\tilde{\boldsymbol\mu})}\)

Return type

float

class limix.model.glmm.GLMMExpFam(y, lik, X, QS=None, n_int=1000, rtol=1.49e-05, atol=1.49e-08)[source]

Generalised Linear Gaussian Processes implementation.

It implements inference over GLMMs via the Expectation Propagation [Min01] algorithm. It currently supports the "Bernoulli", "Probit", "Binomial", and "Poisson" likelihoods. (For heterogeneous Normal likelihood, please refer to glimix_core.glmm.GLMMNormal for a closed-form inference.)

Parameters
  • y (array_like) – Outcome variable.

  • lik (tuple) – Likelihood definition. The first item is one of the following likelihood names: "Bernoulli", "Binomial", "Normal", and "Poisson". For Binomial, the second item is an array of outcomes.

  • X (array_like) – Covariates.

  • QS (tuple) – Economic eigendecomposition.

Example

>>> from numpy import dot, sqrt, zeros
>>> from numpy.random import RandomState
>>>
>>> from numpy_sugar.linalg import economic_qs
>>>
>>> from glimix_core.glmm import GLMMExpFam
>>>
>>> random = RandomState(0)
>>> nsamples = 10
>>>
>>> X = random.randn(nsamples, 2)
>>> G = random.randn(nsamples, 100)
>>> K = dot(G, G.T)
>>> ntrials = random.randint(1, 100, nsamples)
>>> z = dot(G, random.randn(100)) / sqrt(100)
>>>
>>> successes = zeros(len(ntrials), int)
>>> for i in range(len(ntrials)):
...    successes[i] = sum(z[i] + 0.2 * random.randn(ntrials[i]) > 0)
>>>
>>> QS = economic_qs(K)
>>>
>>> glmm = GLMMExpFam(successes, ('binomial', ntrials), X, QS)
>>> print('Before: %.2f' % glmm.lml())
Before: -16.40
>>> glmm.fit(verbose=False)
>>> print('After: %.2f' % glmm.lml())
After: -13.43
property beta

Fixed-effect sizes.

Returns

\(\boldsymbol\beta\).

Return type

numpy.ndarray

covariance()[source]

Covariance of the prior.

Returns

\(v_0 \mathrm K + v_1 \mathrm I\).

Return type

numpy.ndarray

fit(verbose=True, factr=100000.0, pgtol=1e-07)[source]

Maximise the marginal likelihood.

Parameters
  • verbose (bool) – True for progress output; False otherwise. Defaults to True.

  • factr (float, optional) – The iteration stops when (f^k - f^{k+1})/max{|f^k|,|f^{k+1}|,1} <= factr * eps, where eps is the machine precision.

  • pgtol (float, optional) – The iteration will stop when max{|proj g_i | i = 1, ..., n} <= pgtol where pg_i is the i-th component of the projected gradient.

Notes

Please, refer to scipy.optimize.fmin_l_bfgs_b() for further information about factr and pgtol.

fix(var_name)[source]

Prevent a variable to be adjusted.

Parameters

var_name (str) – Variable name.

gradient()[source]

Gradient of the log of the marginal likelihood.

Returns

Map between variables to their gradient values.

Return type

dict

posteriori_covariance()[source]

Covariance of the estimated posteriori.

posteriori_mean()[source]

Mean of the estimated posteriori.

This is also the maximum a posteriori estimation of the latent variable.

unfix(var_name)[source]

Let a variable be adjusted.

Parameters

var_name (str) – Variable name.

Gaussian Process

class limix.model.gp.GP(y, mean, cov)[source]

Gaussian Process inference via maximum likelihood.

Parameters
  • y (array_like) – Outcome variable.

  • mean (function) – Mean function. (Refer to Mean functions.)

  • cov (function) – Covariance function. (Refer to Covariance functions.)

Example

>>> from numpy.random import RandomState
>>>
>>> from glimix_core.example import offset_mean
>>> from glimix_core.example import linear_eye_cov
>>> from glimix_core.gp import GP
>>> from glimix_core.random import GPSampler
>>>
>>> random = RandomState(94584)
>>>
>>> mean = offset_mean()
>>> cov = linear_eye_cov()
>>>
>>> y = GPSampler(mean, cov).sample(random)
>>>
>>> gp = GP(y, mean, cov)
>>> print('Before: %.4f' % gp.lml())
Before: -15.5582
>>> gp.fit(verbose=False)
>>> print('After: %.4f' % gp.lml())
After: -13.4791
>>> print(gp)  
GP(...)
  lml: -13.47907874997517
  OffsetMean(): OffsetMean
    offset: 0.7755803668772308
  SumCov(covariances=...): SumCov
    LinearCov(): LinearCov
      scale: 2.061153622438558e-09
    EyeCov(dim=10): EyeCov
      scale: 0.8675680523425126
fit(verbose=True, factr=100000.0, pgtol=1e-07)[source]

Maximise the marginal likelihood.

Parameters
  • verbose (bool) – True for progress output; False otherwise. Defaults to True.

  • factr (float, optional) – The iteration stops when (f^k - f^{k+1})/max{|f^k|,|f^{k+1}|,1} <= factr * eps, where eps is the machine precision.

  • pgtol (float, optional) – The iteration will stop when max{|proj g_i | i = 1, ..., n} <= pgtol where pg_i is the i-th component of the projected gradient.

Notes

Please, refer to scipy.optimize.fmin_l_bfgs_b() for further information about factr and pgtol.

lml()[source]

Log of the marginal likelihood.

Returns

\(\log p(\mathbf y)\)

Return type

float

Generalised Gaussian Process

class limix.model.ggp.ExpFamGP(y, lik, mean, cov)[source]

Expectation Propagation for Generalised Gaussian Processes.

Parameters
  • y (array_like) – Outcome variable.

  • lik_name (str) – Likelihood name.

  • mean (function) – Mean function. (Refer to Mean functions.)

  • cov (function) – Covariance function. (Refer to Covariance functions.)

Example

>>> from numpy.random import RandomState
>>>
>>> from glimix_core.example import offset_mean
>>> from glimix_core.example import linear_eye_cov
>>> from glimix_core.ggp import ExpFamGP
>>> from glimix_core.lik import BernoulliProdLik
>>> from glimix_core.link import LogitLink
>>> from glimix_core.random import GGPSampler
>>>
>>> random = RandomState(1)
>>>
>>> lik = BernoulliProdLik(LogitLink())
>>> mean = offset_mean()
>>> cov = linear_eye_cov()
>>>
>>> y = GGPSampler(lik, mean, cov).sample(random)
>>>
>>> ggp = ExpFamGP(y, 'bernoulli', mean, cov)
>>> print('Before: %.4f' % ggp.lml())
Before: -6.9802
>>> ggp.fit(verbose=False)
>>> print('After: %.2f' % ggp.lml())
After: -6.55
fit(verbose=True, factr=100000.0, pgtol=1e-07)[source]

Maximise the marginal likelihood.

Parameters
  • verbose (bool) – True for progress output; False otherwise. Defaults to True.

  • factr (float, optional) – The iteration stops when (f^k - f^{k+1})/max{|f^k|,|f^{k+1}|,1} <= factr * eps, where eps is the machine precision.

  • pgtol (float, optional) – The iteration will stop when max{|proj g_i | i = 1, ..., n} <= pgtol where pg_i is the i-th component of the projected gradient.

Notes

Please, refer to scipy.optimize.fmin_l_bfgs_b() for further information about factr and pgtol.

lml()[source]

Log of the marginal likelihood.

Returns

\(\log p(\mathbf y)\)

Return type

float

Covariance

class limix.model.cov.EyeCov(dim)[source]

Identity covariance function, K = s·I.

The parameter s is the scale of the matrix.

Example

>>> from glimix_core.cov import EyeCov
>>>
>>> cov = EyeCov(2)
>>> cov.scale = 2.5
>>> print(cov.value())
[[2.5 0. ]
 [0.  2.5]]
>>> g = cov.gradient()
>>> print(g['logscale'])
[[2.5 0. ]
 [0.  2.5]]
>>> cov.name = "I"
>>> print(cov)
EyeCov(dim=2): I
  scale: 2.5
Parameters

dim (int) – Matrix dimension, d.

property dim

Dimension of the matrix, d.

It corresponds to the number of rows and to the number of columns.

gradient()[source]

Derivative of the covariance matrix over log(s), s⋅I.

Returns

logscale – s⋅I, for scale s and a d×d identity matrix I.

Return type

ndarray

property scale

Scale parameter.

value()[source]

Covariance matrix.

Returns

K – s⋅I, for scale s and a d×d identity matrix I.

Return type

ndarray

class limix.model.cov.FreeFormCov(dim)[source]

General definite positive matrix, K = LLᵀ + ϵI.

A d×d covariance matrix K will have ((d+1)⋅d)/2 parameters defining the lower triangular elements of a Cholesky matrix L such that:

K = LLᵀ + ϵI,

for a very small positive number ϵ. That additional term is necessary to avoid singular and ill conditioned covariance matrices.

Example

>>> from glimix_core.cov import FreeFormCov
>>>
>>> cov = FreeFormCov(2)
>>> cov.L = [[1., .0], [0.5, 2.]]
>>> print(cov.gradient()["Lu"])
[[[0.  2.  0. ]
  [1.  0.5 0. ]]

 [[1.  0.5 0. ]
  [1.  0.  8. ]]]
>>> cov.name = "K"
>>> print(cov)
FreeFormCov(dim=2): K
  L: [[1.  0. ]
      [0.5 2. ]]
property L

Lower-triangular matrix L such that K = LLᵀ + ϵI.

Returns

L – Lower-triangular matrix.

Return type

(d, d) ndarray

property Lu

Lower-triangular, flat part of L.

eigh()[source]

Eigen decomposition of K.

Returns

  • S (ndarray) – The eigenvalues in ascending order, each repeated according to its multiplicity.

  • U (ndarray) – Normalized eigenvectors.

fix()[source]

Disable parameter optimisation.

gradient()[source]

Derivative of the covariance matrix over the parameters of L.

Returns

Lu – Derivative of K over the lower triangular part of L.

Return type

ndarray

listen(func)[source]

Listen to parameters change.

Parameters

func (callable) – Function to be called when a parameter changes.

logdet()[source]

Log of |K|.

Returns

Log-determinant of K.

Return type

float

property nparams

Number of parameters.

property shape

Array shape.

unfix()[source]

Enable parameter optimisation.

value()[source]

Covariance matrix.

Returns

K – Matrix K = LLᵀ + ϵI, for a very small positive number ϵ.

Return type

ndarray

class limix.model.cov.GivenCov(K0)[source]

Given covariance function, K = s⋅K₀.

The covariance matrix is the provided matrix K₀ scaled by s: K = s⋅K₀.

Example

>>> from glimix_core.cov import GivenCov
>>> from numpy import dot
>>> from numpy.random import RandomState
>>>
>>> G = RandomState(0).randn(5, 3)
>>> K0 = dot(G, G.T)
>>> cov = GivenCov(K0)
>>> cov.scale = 1.3
>>> cov.name = "K"
>>> print(cov)
GivenCov(K0=...): K
  scale: 1.3
gradient()[source]

Derivative of the covariance matrix over log(s).

Returns

logscale – s⋅K₀.

Return type

float

property scale

Scale parameter, s.

value()[source]

Covariance matrix, s⋅K₀.

Returns

K – s⋅K₀.

Return type

ndarray

class limix.model.cov.LinearCov(X)[source]

Linear covariance function, K = s⋅XXᵀ.

The mathematical representation is s⋅XXᵀ, for a n×r matrix X provided by the user and a scaling parameter s.

Example

>>> from glimix_core.cov import LinearCov
>>> from numpy import dot
>>> from numpy.random import RandomState
>>>
>>> X = RandomState(0).randn(2, 3)
>>> cov = LinearCov(X)
>>> cov.scale = 1.3
>>> cov.name = "K"
>>> print(cov)
LinearCov(): K
  scale: 1.3
property X

Matrix X from K = s⋅XXᵀ.

fix()[source]

Prevent s update during optimization.

gradient()[source]

Derivative of the covariance matrix over log(s).

Returns

logscale – s⋅XXᵀ.

Return type

ndarray

property scale

Scale parameter.

unfix()[source]

Enable s update during optimization.

value()[source]

Covariance matrix.

Returns

K – s⋅XXᵀ.

Return type

ndarray

class limix.model.cov.SumCov(covariances)[source]

Sum of covariance functions, K = K₀ + K₁ + ⋯.

Example

>>> from glimix_core.cov import LinearCov, SumCov
>>> from numpy.random import RandomState
>>>
>>> random = RandomState(0)
>>> cov_left = LinearCov(random.randn(4, 20))
>>> cov_right = LinearCov(random.randn(4, 15))
>>> cov_left.scale = 0.5
>>>
>>> cov = SumCov([cov_left, cov_right])
>>> cov_left.name = "A"
>>> cov_right.name = "B"
>>> cov.name = "A+B"
>>> print(cov)
SumCov(covariances=...): A+B
  LinearCov(): A
    scale: 0.5
  LinearCov(): B
    scale: 1.0
gradient()[source]

Sum of covariance function derivatives.

Returns

∂K₀ + ∂K₁ + ⋯

Return type

dict

value()[source]

Sum of covariance matrices.

Returns

K – K₀ + K₁ + ⋯

Return type

ndarray

class limix.model.cov.LRFreeFormCov(n, m)[source]

General semi-definite positive matrix of low rank, K = LLᵀ.

The covariance matrix K is given by LLᵀ, where L is a n×m matrix and n≥m. Therefore, K will have rank(K) ≤ m.

Example

>>> from glimix_core.cov import LRFreeFormCov
>>> cov = LRFreeFormCov(3, 2)
>>> print(cov.L)
[[1. 1.]
 [1. 1.]
 [1. 1.]]
>>> cov.L = [[1, 2], [0, 3], [1, 3]]
>>> print(cov.L)
[[1. 2.]
 [0. 3.]
 [1. 3.]]
>>> cov.name = "F"
>>> print(cov)
LRFreeFormCov(n=3, m=2): F
  L: [[1. 2.]
      [0. 3.]
      [1. 3.]]
property L

Matrix L from K = LLᵀ.

Returns

L – Parametric matrix.

Return type

(n, m) ndarray

property Lu

Lower-triangular, flat part of L.

fix()[source]

Disable parameter optimisation.

gradient()[source]

Derivative of the covariance matrix over the lower triangular, flat part of L.

It is equal to

∂K/∂Lᵢⱼ = ALᵀ + LAᵀ,

where Aᵢⱼ is an n×m matrix of zeros except at [Aᵢⱼ]ᵢⱼ=1.

Returns

Lu – Derivative of K over the lower-triangular, flat part of L.

Return type

ndarray

listen(func)[source]

Listen to parameters change.

Parameters

func (callable) – Function to be called when a parameter changes.

property nparams

Number of parameters.

property shape

Array shape.

unfix()[source]

Enable parameter optimisation.

value()[source]

Covariance matrix.

Returns

K – K = LLᵀ.

Return type

(n, n) ndarray

class limix.model.cov.Kron2SumCov(G, dim, rank)[source]

Implements K = C₀ ⊗ GGᵀ + C₁ ⊗ I.

C₀ and C₁ are d×d symmetric matrices. C₀ is a semi-definite positive matrix while C₁ is a positive definite one. G is a n×m matrix and I is a n×n identity matrix. Let M = Uₘ Sₘ Uₘᵀ be the eigen decomposition for any matrix M. The documentation and implementation of this class make use of the following definitions:

  • X = GGᵀ = Uₓ Sₓ Uₓᵀ

  • C₁ = U₁ S₁ U₁ᵀ

  • Cₕ = S₁⁻½ U₁ᵀ C₀ U₁ S₁⁻½

  • Cₕ = Uₕ Sₕ Uₕᵀ

  • D = (Sₕ ⊗ Sₓ + Iₕₓ)⁻¹

  • Lₓ = Uₓᵀ

  • Lₕ = Uₕᵀ S₁⁻½ U₁ᵀ

  • L = Lₕ ⊗ Lₓ

The above definitions allows us to write the inverse of the covariance matrix as:

K⁻¹ = LᵀDL,

where D is a diagonal matrix.

Example

>>> from numpy import array
>>> from glimix_core.cov import Kron2SumCov
>>>
>>> G = array([[-1.5, 1.0], [-1.5, 1.0], [-1.5, 1.0]])
>>> Lr = array([[3], [2]], float)
>>> Ln = array([[1, 0], [2, 1]], float)
>>>
>>> cov = Kron2SumCov(G, 2, 1)
>>> cov.C0.L = Lr
>>> cov.C1.L = Ln
>>> print(cov)
Kron2SumCov(G=..., dim=2, rank=1): Kron2SumCov
  LRFreeFormCov(n=2, m=1): C₀
    L: [[3.]
        [2.]]
  FreeFormCov(dim=2): C₁
    L: [[1. 0.]
        [2. 1.]]
property C0

Semi-definite positive matrix C₀.

property C1

Definite positive matrix C₁.

property D

(Sₕ ⊗ Sₓ + Iₕₓ)⁻¹.

property G

User-provided matrix G, n×m.

Ge

Result of US from the SVD decomposition G = USVᵀ.

LdKL_dot(v)[source]

Implements L(∂K)Lᵀv.

The array v can have one or two dimensions and the first dimension has to have size n⋅p.

Let vec(V) = v. We have

L(∂K)Lᵀ⋅v = ((Lₕ∂C₀Lₕᵀ) ⊗ (LₓGGᵀLₓᵀ))vec(V) = vec(LₓGGᵀLₓᵀVLₕ∂C₀Lₕᵀ),

when the derivative is over the parameters of C₀. Similarly,

L(∂K)Lᵀv = ((Lₕ∂C₁Lₕᵀ) ⊗ (LₓLₓᵀ))vec(V) = vec(LₓLₓᵀVLₕ∂C₁Lₕᵀ),

over the parameters of C₁.

property Lh

Lₕ.

property Lx

Lₓ.

gradient()[source]

Gradient of K.

Returns

  • C0 (ndarray) – Derivative of C₀ over its parameters.

  • C1 (ndarray) – Derivative of C₁ over its parameters.

gradient_dot(v)[source]

Implements ∂K⋅v.

Parameters

v (array_like) – Vector from ∂K⋅v.

Returns

  • C0.Lu (ndarray) – ∂K⋅v, where the gradient is taken over the C₀ parameters.

  • C1.Lu (ndarray) – ∂K⋅v, where the gradient is taken over the C₁ parameters.

listen(func)[source]

Listen to parameters change.

Parameters

func (callable) – Function to be called when a parameter changes.

logdet()[source]

Implements log|K| = - log|D| + n⋅log|C₁|.

Returns

logdet – Log-determinant of K.

Return type

float

logdet_gradient()[source]

Implements ∂log|K| = Tr[K⁻¹∂K].

It can be shown that:

∂log|K| = diag(D)ᵀdiag(L(∂K)Lᵀ) = diag(D)ᵀ(diag(Lₕ∂C₀Lₕᵀ)⊗diag(LₓGGᵀLₓᵀ)),

when the derivative is over the parameters of C₀. Similarly,

∂log|K| = diag(D)ᵀdiag(L(∂K)Lᵀ) = diag(D)ᵀ(diag(Lₕ∂C₁Lₕᵀ)⊗diag(I)),

over the parameters of C₁.

Returns

  • C0 (ndarray) – Derivative of C₀ over its parameters.

  • C1 (ndarray) – Derivative of C₁ over its parameters.

property nparams

Number of parameters.

solve(v)[source]

Implements the product K⁻¹⋅v.

Parameters

v (array_like) – Array to be multiplied.

Returns

x – Solution x to the equation K⋅x = y.

Return type

ndarray

value()[source]

Covariance matrix K = C₀ ⊗ GGᵀ + C₁ ⊗ I.

Returns

K – C₀ ⊗ GGᵀ + C₁ ⊗ I.

Return type

ndarray

Mean

class limix.model.mean.OffsetMean(n)[source]

Offset mean function, θ⋅𝟏.

It represents a mean vector 𝐦 = θ⋅𝟏 of size n. The offset is given by the parameter θ.

Example

>>> from glimix_core.mean import OffsetMean
>>>
>>> mean = OffsetMean(3)
>>> mean.offset = 2.0
>>> print(mean.value())
[2. 2. 2.]
>>> print(mean.gradient())
{'offset': array([1., 1., 1.])}
>>> mean.name = "𝐦"
>>> print(mean)
OffsetMean(): 𝐦
  offset: 2.0
fix_offset()[source]

Prevent θ update during optimization.

gradient()[source]

Gradient of the offset function.

Returns

offset – Vector 𝟏.

Return type

(n,) ndarray

property offset

Offset parameter.

unfix_offset()[source]

Enable θ update during optimization.

value()[source]

Offset mean.

Returns

𝐦 – θ⋅𝟏.

Return type

(n,) ndarray

class limix.model.mean.LinearMean(X)[source]

Linear mean function, X𝜶.

It defines X𝜶, for which X is a n×m matrix provided by the user and 𝜶 is a vector of size m.

Example

>>> from glimix_core.mean import LinearMean
>>>
>>> mean = LinearMean([[1.5, 0.2], [0.5, 0.4]])
>>> mean.effsizes = [1.0, -1.0]
>>> print(mean.value())
[1.3 0.1]
>>> print(mean.gradient()["effsizes"])
[[1.5 0.2]
 [0.5 0.4]]
>>> mean.name = "𝐦"
>>> print(mean)
LinearMean(m=2): 𝐦
  effsizes: [ 1. -1.]
property X

An n×m matrix of covariates.

property effsizes

Effect-sizes parameter, 𝜶, of size m.

gradient()[source]

Gradient of the linear mean function over the effect sizes.

Returns

effsizes

Return type

(n, m) ndarray

value()[source]

Linear mean function.

Returns

𝐦 – X𝜶.

Return type

(n,) ndarray

class limix.model.mean.SumMean(means)[source]

Sum mean function, 𝐟₀ + 𝐟₁ + ….

The mathematical representation is

𝐦 = 𝐟₀ + 𝐟₁ + …

In other words, it is a sum of mean vectors.

Example

>>> from glimix_core.mean import OffsetMean, LinearMean, SumMean
>>>
>>> X = [[5.1, 1.0],
...      [2.1, -0.2]]
>>>
>>> mean0 = LinearMean(X)
>>> mean0.effsizes = [-1.0, 0.5]
>>>
>>> mean1 = OffsetMean(2)
>>> mean1.offset = 2.0
>>>
>>> mean = SumMean([mean0, mean1])
>>>
>>> print(mean.value())
[-2.6 -0.2]
>>> g = mean.gradient()
>>> print(g["SumMean[0].effsizes"])
[[ 5.1  1. ]
 [ 2.1 -0.2]]
>>> print(g["SumMean[1].offset"])
[1. 1.]
>>> mean0.name = "A"
>>> mean1.name = "B"
>>> mean.name = "A+B"
>>> print(mean)
SumMean(means=...): A+B
  LinearMean(m=2): A
    effsizes: [-1.   0.5]
  OffsetMean(): B
    offset: 2.0
gradient()[source]

Sum of mean function derivatives.

Returns

∂𝐦 – ∂𝐟₀ + ∂𝐟₁ + ….

Return type

dict

value()[source]

Sum of mean vectors, 𝐟₀ + 𝐟₁ + ….

Returns

𝐦 – 𝐟₀ + 𝐟₁ + ….

Return type

ndarray

class limix.model.mean.KronMean(A, X)[source]

Kronecker mean function, (A⊗X)vec(B).

Let

  • n be the number of samples;

  • p the number of traits; and

  • c the number of covariates.

The mathematical representation is

𝐦 = (A⊗X)vec(B)

where A is a p×p trait design matrix of fixed effects and X is a n×c sample design matrix of fixed effects. B is a c×p matrix of fixed-effect sizes.

property A

Matrix A.

property AX

A ⊗ X.

property B

Effect-sizes parameter, B.

property X

Matrix X.

gradient()[source]

Gradient of the linear mean function.

Returns

vecB – Derivative of M over vec(B).

Return type

ndarray

property nparams

Number of parameters.

value()[source]

Kronecker mean function.

Returns

𝐦 – (A⊗X)vec(B).

Return type

ndarray

API reference

I/O module

limix.io.bgen.read(filepath[, …])

Read a given BGEN file.

limix.io.bimbam.read_phenotype(filepath[, …])

Read a BIMBAM phenotype file.

limix.io.csv.read(filename[, sep, header, …])

Read a CSV file.

limix.io.gen.read(prefix[, verbose])

Read GEN files into Pandas data frames.

limix.io.hdf5.read_limix(filepath)

Read the HDF5 limix file format.

limix.io.npy.read(filepath[, verbose])

Read NumPy arrays saved in a file.

limix.io.plink.read(prefix[, verbose])

Read PLINK files into Pandas data frames.

Quality control

limix.qc.boxcox(x)

Box-Cox transformation for normality conformance.

limix.qc.compute_maf(X)

Compute minor allele frequencies.

limix.qc.count_missingness(X)

Count the number of missing values per column.

limix.qc.indep_pairwise(X, window_size, …)

Determine pair-wise independent variants.

limix.qc.mean_impute(X[, axis, inplace])

Impute NaN values.

limix.qc.mean_standardize(X[, axis, inplace])

Zero-mean and one-deviation normalisation.

limix.qc.normalise_covariance(K[, out])

Variance rescaling of covariance matrix 𝙺.

limix.qc.quantile_gaussianize(X[, axis, inplace])

Normalize a sequence of values via rank and Normal c.d.f.

limix.qc.remove_dependent_cols(X[, tol])

Remove dependent columns.

limix.qc.unique_variants(X)

Filters out variants with the same genetic profile.

Statistics

limix.stats.Chi2Mixture([scale_min, …])

Mixture of 𝜒² distributions.

limix.stats.allele_expectation(p, nalleles, …)

Allele expectation.

limix.stats.allele_frequency(X)

Compute allele frequency from its expectation.

limix.stats.compute_dosage(X[, alt])

Compute dosage from allele expectation.

limix.stats.confusion_matrix(df[, wsize])

Provide a couple of scores based on the idea of windows around genetic markers.

limix.stats.empirical_pvalues(xt, x0)

Function to compute empirical p-values.

limix.stats.linear_kinship(G[, out, verbose])

Estimate Kinship matrix via linear kernel.

limix.stats.lrt_pvalues(null_lml, alt_lmls)

Compute p-values from likelihood ratios.

limix.stats.multipletests(pvals[, alpha, …])

Test results and p-value correction for multiple tests.

limix.stats.multivariate_normal(random, …)

Draw random samples from a multivariate normal distribution.

limix.stats.pca(X, ncomp)

Principal component analysis.

Heritability estimation

limix.her.estimate(y, lik, K[, M, verbose])

Estimate the so-called narrow-sense heritability.

Variance decomposition

limix.vardec.VarDec(y[, lik, M])

Variance decompositon through GLMMs.

Quantitative trait loci

limix.qtl.scan(G, Y[, lik, K, M, idx, A, …])

Multi-trait association and interaction testing via linear mixed models.

limix.qtl.iscan(G, y[, lik, K, M, idx, E0, …])

Single-trait association with interaction test via generalized linear mixed models.

Plotting & Graphics

limix.plot.box_aspect([ax])

Change to box aspect considering the plotted points.

limix.plot.ConsensusCurve()

Consolidate multiple curves in a single one.

limix.plot.image(file[, ax])

Show an image.

limix.plot.kinship(K[, nclusters, img_kws, ax])

Plot heatmap of a kinship matrix.

limix.plot.load_dataset(name)

Example datasets.

limix.plot.manhattan(data[, colora, colorb, …])

Produce a manhattan plot.

limix.plot.normal(x[, bins, nstd, ax])

Plot a fit of a normal distribution to the data in x.

limix.plot.pca(X[, pts_kws, ax])

Plot the first two principal components of a design matrix.

limix.plot.power(pv[, label, alphas, …])

Plot number of hits across significance levels.

limix.plot.qqplot(a[, label, alpha, cutoff, …])

Quantile-Quantile plot of observed p-values versus theoretical ones.

limix.plot.image(file[, ax])

Show an image.

limix.plot.get_pyplot()

Get matplotlib.pyplot.

limix.plot.show()

Shell utilities

limix.sh.filehash(filepath)

Compute sha256 from a given file.

limix.sh.download(url[, dest, verbose])

Download file.

limix.sh.extract(filepath[, verbose])

Extract a compressed file.

limix.sh.remove(filepath)

Remove file.

Comments and bugs

You can get the source and open issues on Github.