PKwKm$!gat-latest/.buildinfo# Sphinx build info version 1 # This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done. config: tags: PKwKJssgat-latest/objects.inv# Sphinx inventory version 2 # Project: gat # Version: 1.0 # The remainder of this file is compressed using zlib. xڥWr +I2.6鮻$3xS;kY 4^@NbD66sW+;j G mHyI*)  gg8-WqNJrqqQP!b2;Z ZSrfq9)#Qu^"Ml{`!y[ke:ϭA&tw=Bet^NJ݊pW:(y[ʫ'~T&W[R"4cwN!=2 7wT,:f ^lVOO:e[n ]}boZBXX _D˜=QeJ}&B0 Űwr{R=V/RF"tU^=[.+Uҽ~/t"^8!-fm^9*: Rt:o[ȍ=Gӻ̅V*NEr4>R9S^?p0dS=ܧyS:Y*Bhfk{*p֗XHEpRcߓQ2XHRx5; }ՎSO_bSx2&Πe @ Mx!؃A _&o?Ot,c~?6}> vϲ-N[&vd5xnM3Ls|)ig련頣b/;ra,.FPKwKZsrrgat-latest/contents.html gat 1.0 documentation

Genomic Association Tester (GAT)

Welcome to the home page of the Genomic Association Tester (GAT).

Overview

A common question in genomic analysis is whether two sets of genomic intervals overlap significantly. This question arises, for example, in the interpretation of ChIP-Seq or RNA-Seq data. Because of complex genome organization, its answer is non-trivial.

The Genomic Association Tester (GAT) is a tool for computing the significance of overlap between multiple sets of genomic intervals. GAT estimates significance based on simulation and can take into account genome organization like isochores and correct for regions of low mapability.

GAT accepts as input standard genomic file formats and can be used in large scale analyses, comparing the association of multiple sets of genomic intervals simultaneously. Multiple testing is controlled using the false discovery rate.

In this manual, the Introduction covers the basic concepts of GAT. In order to get an idea of typical use cases, see the Tutorials section. The Usage instructions section contains a complete usage reference.

Contents

Introduction

A common question in genomic analysis is whether two sets of genomic intervals overlap significantly. This question arises, for example, in the interpretation of ChIP-Seq or RNA-Seq data.

The Genomic Association Tester (GAT) is a tool for computing the significance of overlap between multiple sets of genomic intervals. GAT estimates significance based on simulation.

This introduction covers the Method basics and describes the Sampling method. It introduces the concept of the Effective Genome and explains how to account for G+C bias

Method basics

Gat implemements a sampling algorithm. Given a chromosome (workspace) and segments of interest, for example from a ChIP-Seq experiment, gat creates randomized version of the segments of interest falling into the workspace. These sampled segments are then compared to existing genomic annotations.

The example below introduces the three sets of genomic intervals. The workspace, in this case a single chromosome, is grey. The segments of interest and sampled segments derived from it are red and light red, (workspace) and segments of interest, respectively. In this analysis, the annotations are the location of introns (green) and promotors (blue) and we use gat to test if the intervals in the ChIP-Seq experiment are enriched in promotors and/or introns.

_images/introduction_sampling.png

In the example above, five sets of sampled segments have been created. Usually, the number of samples would be much larger (>1000).

Based on the sampled segments, the observed overlap is contrasted with the expected overlap and an empirical p-value is determined. The sampled segments provide an expectation of the overlap between the segments of interest with a particular annotation. The distribution of the overlap is computed and the empirical p-value is defined as the number of samples that show an equal or greater overlap than the observed overlap.

_images/introduction_pvalues.png

In the previous example, the overlap with promotors is significant (right), while the overlap with introns (left) is not significantly different from the expectation. We would thus conclude, that our ChIP-Seq intervals are significantly enriched in promotors, but not enriched in introns. Testing for depletion works similarly.

Sampling method

The sampling method is conceptually simple. Randomized samples of the segments of interest are created in a two-step procedure.

Firstly, a segment size is selected from to same size distribution as the original segments of interest. Secondly, a random position is assigned to the segment. The sampling stops when exactly the same number of nucleotides have been sampled.

To improve the speed of sampling, segment overlap is not resolved until the very end of the sampling procedure. Conflicts are then resolved by randomly removing and re-sampling segments until a covering set has been achieved.

Because the size of randomized segments is derived from the observed segment size distribution of the segments of interest, the actual segment sizes in the sampled segments are usually not exactly identical to the ones in the segments of interest. This is in contrast to a sampling method that permutes segment positions within the workspace.

Effective genome

Not all regions of the genome are equally accessible to the segments of interest. For example, reads from a ChIP-Seq experiment will never be mapped to a region that is an assembly gap. Failing to account for inaccessible regions causes inflated estimates of fold change and statistical significance.

To illustrate this, assume that a particular set of segments of interest can only be present on chrX. If we included all chromosomes in the workspace, any overlap between the segments of interest and any annotations on chrX would be overestimated as the sampled segments are spread over both chrX and the autosomes.

Such a chromosomal bias is strong and obvious. Often the bias is much more distributed. For example, when comparing orthologous positions between two genomes in a comparative analysis, it is important that the workspace is restricted to only those regions where there is genomic alignment, as orthologous positions will not be found outside of regions that can be aligned.

In order to account for the effective genome, GAT limits the accessible space for simulated segments to a workspace, which can be restricted to exclude all regions not appropriate for the analysis. Randomly sampled segments will not fall outside the workspace:

_images/introduction_workspace.png

What regions need to be excluded from the analysis depends very much on the test being performed. Commonly excluded are for example assembly gaps and regions of low mapability for NGS experiments.

The definition of a workspace is crucial and often several workspace restrictions need to be combined. For example, in order to test if human long non-coding RNA that are shared with mouse overlap with certain chromatin marks, the workspace should be restricted to only those genomic regions in human that align in mouse and also be limited to regions of high mapability.

G+C bias

The human genome (together with many others) is not uniform. Best known is its division into regions of low and high G+C content (isochores). Other genomic properties correlate with these, such as gene density, substitution rates, etc. Plus, next-generation sequencing methods display their own G+C biases.

GAT can control for these biases by splitting the genomic region accessible for simulated segments (workspace) into smaller regions (isochores). Segments are sampled on a per-isochore basis, thus preserving any confounding effects due to different G+C content, before the overall enrichment is computed by combining results from all isochores.

_images/introduction_isochores.png

In the example above, G+C content correlates with the density of segments of interest. Regions of low G+C (orange) contain fewer segments than regions of high G+C (purple). Sampling within separate isochores preserves the difference in density.

The workspace can be divided into an arbitrary amount of different isochores. This is a general technique that can be used to control for different types of bias.

Installation

Requirements

GAT has been written in python and has been tested with the following python versions:

  • python 2.7.3
  • python 2.6.8

GAT requires the following modules to be installed at installation:

The plotting and unit test modules also require scipy and matplotlib.

Installing from PyPi

GAT is available at the python package index and can be installed using pip or setuptools.

To install via pip, type:

pip install gat

To install on OS X, we suggest to begin by installing homebrew by following these instructions

Follow then by:

brew install python --with-brewed-openssl
pip install numpy
pip install cython
pip install gat

Installing via source

The latest changes can be obtained by cloning the repository on github_:

git clone https://github.com/AndreasHeger/gat.git

To install, type:

python setup.py install

in the package directory.

Release History

1.2.2 Minor features
  • Added –random-seed as option.
  • moved documentation to read-the-docs.
1.2.1 Bugfix release:
  • added missing files requires.txt to tarball
1.2 Bugfix release:
  • Command line options renamed for CGAT compatibility
  • minor bugfixes
1.1 Bugfix release: easier Galaxy integration
  • Changed to distutils (from distribute)
  • Changed /bin/env to /usr/bin/env

1.0 Release coinciding with publication

0.2
  • First release
0.1
  • Alpha release

Tutorials

Please see below a collection of tutorials that introduce gat and how it can be used to answer a variety of questions in computational genomics.

Tutorial - Interval overlap

This tutorial demonstrates the usage of gat with a simple example - do the binding sites of a transcription factor overlap with DNAse hypersensitive sites?

This tutorial uses the SRF data set described in Valouev et al. (2008). The data sets used in this tutorial are available at:

http://www.cgat.org/~andreas/documentation/gat-examples/TutorialIntervalOverlap.tar.gz

The data is in srf.hg19.bed. This bed formatted file contains 556 high confidence peaks from the analysis of Valouev et al. (2008) mapped to human chromosome hg19.

This tutorial concentrates on obtaining the data required for a GAT analysis.

First analysis

gat requires three sets of intervals:

  1. a set of segments delineating the active part of the genome (workspace), and
  2. a set of segments of interest (tracks), and
  3. a set of segments with annotations.

gat accepts bed formatted files as input.

As segments of interest we will be using the srf.hgf19.bed containing the results of the ChIP-Seq experiment:

chr5    60627981        60628031        SRF.1
chr5    137801055       137801105       SRF.2
chr5    137800766       137800816       SRF.3
chr7    5570273 5570323 SRF.4
chr5    137827838       137827888       SRF.5
...

As our workspace we will for now use the bed formatted contigs.bed, which simply lists all chromosomes in hg19:

chr13   0       115169878       ws
chr12   0       133851895       ws
chr11   0       135006516       ws
chr10   0       135534747       ws
chr17   0       81195210        ws
...

The question we ask is whether within the genome, SRF binding events are in regions of open chromatin as identified by DNAse I hypersensitive sites.

There are many sources for bed files. Not having the data ourselves, we make use of data deposited by the ENCODE project within the UCSC genome browser.

To find the relevant data, we start by searching for the term Jurkat with track search. We find the page with a list of ENCODE datasets:

http://genome.ucsc.edu/cgi-bin/hgTrackUi?g=wgEncodeUwDnase&hgsid=337398699

Selecting Jurkat, we can then open the table browser and download bed formatted coordinates directly:

_images/ucsc_tablebrowser.png

Alternative ways to obtain and manipulate bed-files are galaxy and bedtools.

We can now run gat by giving specifying the three input files:

gat-run.py
   --segments=srf.hg19.bed.gz
   --annotations=jurkat.hg19.dhs.bed.gz
   --workspace=contigs.bed.gz
   --ignore-segment-tracks
   --num-samples=1000 --log=gat.log > gat.tsv

The option –ignore-segment-tracks tells gat to ignore the fourth column in the tracks file and assume that all intervals in this file belong to the same track. If not given, each interval would be treated separately.

The above statement finishes in a few seconds. With large interval collections or many annotations, gat might take a while. It is thus good practice to always save the output in a file. The option –log tells gat to save information or warning messages into a separate log file.

The first 10 columns of the output file are the most informative:

track annotation observed expected CI95low CI95high stddev fold l2fold pvalue
merged tb_wgEncodeUwDnaseJurkatPkRep1 20183 246.5650 96.0000 444.0000 105.5933 81.5301 6.3493 1.0000e-03

The table states that we observe an overlap of 20,183 nucleotides, but would expect an overlap of 247 nucleotides, which is an 82 fold enrichment. This is highly significant (p-value of 0.001).

Note that the number of simulations determines the minimum P-value that can be reported. Here, we did 1,000 simulations, thus the minimum P-value we can obtain is 0.001. Usually a few simulations (100) are required to get a good idea about enrichment. For publication, more simulations are required (>10,000) to get a good idea of the statistical significance.

More samples increases memory and runtime requirements of GAT. The computation above took 11 seconds on our local system. Increasing the number of simulations to 10.000 increases the runtime to 103 seconds. Note how the runtime increases linearly with the number of samples.

Do the results make sense? Instead of the Jurkat cells, we can test for enrichment with DHS sites in hepatocytes.

We obtain a bed-file as before from UCSC and ENCODE (cell line hepg2) and save it as hepg2.hg19.dhs.bed.gz. Next, we run GAT with this file instead:

gat-run.py --segments=srf.hg19.bed.gz
           --annotations=hepg2.hg19.dhs.bed.gz
           --workspace=contigs.bed.gz
           --ignore-segment-tracks
           --num-samples=1000
           --log=gat-hepg-unique.tsv.log

GAT reports:

track annotation observed expected CI95low CI95high stddev fold l2fold pvalue
merged tb_wgEncodeUwDnaseHepg2HotspotsRep1 18965 597.1380 339.0000 883.0000 166.9945 31.7084 4.9868 1.0000e-03

Note how the fold enrichment is now less (32 fold), though still highly significant. This is the expected result, DHS sensitive sites are shared among different tissue types.

We can test this by comparing the two different DHS sets against each other:

gat-run.py --segments=hepg2.hg19.dhs.bed.gz
           --annotations=jurkat.hg19.dhs.bed.gz
           --workspace=contigs.bed.gz --ignore-segment-tracks --num-samples=1000 > dhs.tsv
track annotation observed expected CI95low CI95high stddev fold l2fold pvalue qvalue
merged tb_wgEncodeUwDnaseJurkatPkRep1 6163503 456928.2770 443565.0000 470129.0000 8119.7800 13.4890 3.7537 1.0000e-03 1.0000e-03

Indeed, the overlap between DHS sites is significant (pvalue = 0.001). We observe a 6.2Mb overlap, but expect only a 0.5Mb overlap. This is a 13-fold enrichment.

The runtime has increased from 11s to 308s. Apart from the number of samples, the number of segments in the segments of interest are a major determinant of the time it takes to complete a run. More information about memory and time requirement of GAT are in the section about GAT Performance.

Let us try removing all intervals from hepg2.hg19.dhs.bed.gz that overlap sites found in Jurkat cells using bedtools:

intersectBed -a hepg2.hg19.dhs.bed.gz -b jurkat.hg19.dhs.bed.gz -wa -v | gzip  > hepg2_unique.dhs.bed.gz

106,308 segments out of 144,172 remain.

Next, we re-run the GAT analysis:

gat-run.py --segments=srf.hg19.bed.gz --annotations=hepg2-unique.hg19.dhs.bed.gz --workspace=contigs.bed.gz  \
           --ignore-segment-tracks      \
            --num-samples=1000 --log=gat-hepg-unique.tsv.log
track annotation observed expected CI95low CI95high stddev fold l2fold pvalue qvalue
merged . 425 324.6790 143.0000 539.0000 117.8233 1.3080 0.3874 1.8500e-01 1.8500e-01

Now, we observe only a 30% enrichment and this is not significant (P-value 0.19). The observed overlap of 425 nucleotides is very close to the expected 325 nucleotides. We conclude, that the overlap of SRF between DHS sites in hepatocytes is due to those DHS sites that are shared between hepatocytes and Jurkat cells.

This example showed how GAT can be used in a very simple scenario to test if two genomic features are associated with each other. The following tutorials will introduce more complex usage, for example using the effective genome and testing multiple annotations simultaneously.

Tutorial - Genomic annotation

This tutorial demonstrates the usage of gat with a simple example - where does a transcription factor bind in the genome?

As opposed to Tutorial - Interval overlap we will not be looking at the overlap between one set of intervals with another, but at the overlap of one set of intervals with multiple others.

This tutorial uses the SRF data set described in Valouev et al. (2008). The data sets used in this tutorial are available at:

http://www.cgat.org/~andreas/documentation/gat-examples/TutorialGenomicAnnotation.tar.gz

The data is in srf.hg19.bed.gz. This bed formatted file contains 556 high confidence peaks from the analysis of Valouev et al. (2008) mapped to human chromosome hg19.

We build the analysis in multiple steps. First, we will perform a simple analysis, which will also motivate the use of gat. Later, we will build more sophisticated analyses that take into account the effective genome.

First analysis

gat requires three sets of intervals:

  1. a set of segments delineating the active part of the genome (workspace), and
  2. a set of segments of interest (tracks), and
  3. a set of segments with annotations.

gat accepts bed formatted files as input.

As segments of interest we will be using the srf.hgf19.bed.gz containing the results of the ChIP-Seq experiment:

chr5    60627981        60628031        SRF.1
chr5    137801055       137801105       SRF.2
chr5    137800766       137800816       SRF.3
chr7    5570273 5570323 SRF.4
chr5    137827838       137827888       SRF.5
...

As our workspace we will for now use the bed formatted contigs.bed, which simply lists all chromosomes in hg19:

chr13   0       115169878       ws
chr12   0       133851895       ws
chr11   0       135006516       ws
chr10   0       135534747       ws
chr17   0       81195210        ws
...

As our annotations file, we will use the annotations_geneset.bed.gz.

This file required a little more effort to build. We took all protein genes of Ensembl (release 67) and merged the exons of all transcripts of a gene. Based on these gene definitions we then divided genomic regions into intergenic, intronic and exonic regions. We also annotated the UTR (5’ and 3’), and the 5kb flank upstream and downstream. The result is a set of non-overlapping intervals covering the full genome:

...
chr1    362640  367639  5flank
chr1    367640  367658  UTR5
chr1    367659  368594  CDS
chr1    368595  368634  UTR3
chr1    368635  373634  3flank
chr1    373635  616058  intergenic
chr1    616059  621058  3flank
chr1    621059  621098  UTR3
chr1    621099  622034  CDS
chr1    622035  622053  UTR5
chr1    622054  627053  5flank
...

We can now run gat by giving specifying the three input files:

gat-run.py --ignore-segment-tracks --segments=srf.hg19.bed.gz
   --annotations=annotations_geneset.bed.gz --workspace=contigs.bed.gz
--num-samples=1000 --log=gat.log > gat.out

The option –ignore-segment-tracks tells gat to ignore the fourth column in the tracks file and assume that all intervals in this file belong to the same track. If not given, each interval would be treated separately.

The above statement finishes in a few seconds. With large interval collections or many annotations, gat might take a while. It is thus good practice to always save the output in a file. The option –log tells gat to save information or warning messages into a separate log file.

The first 11 columns of the output file are the most informative:

track annotation observed expected Ci95low CI95high stddev fold l2fold pvalue qvalue
merged intergenic 5800 14056.3300 13100.0000 15000.0000 583.7181 0.4126 -1.2771 1.0000e-03 1.5714e-03
merged intronic 8816 10633.8530 9665.0000 11602.0000 592.7589 0.8291 -0.2705 1.0000e-03 1.5714e-03
merged UTR3 233 278.0720 100.0000 493.0000 117.3112 0.8379 -0.2551 3.6500e-01 4.4611e-01
merged 3flank 800 659.6560 400.0000 1000.0000 175.0544 1.2128 0.2783 2.3100e-01 3.1762e-01
merged CDS 754 360.7680 161.0000 580.0000 127.2204 2.0900 1.0635 1.0000e-03 1.5714e-03
merged flank 1334 167.8620 50.0000 350.0000 91.4581 7.9470 2.9904 1.0000e-03 1.5714e-03
merged 5flank 6524 691.5400 400.0000 1000.0000 185.0053 9.4340 3.2379 1.0000e-03 1.5714e-03
merged UTR5 3441 87.0110 0.0000 200.0000 60.9119 39.5467 5.3055 1.0000e-03 1.5714e-03

The first two columns contain the name of the track and annotation that are being compared. The columns observed and expected give the observed and expected nucleotide overlap, respectively, between the track and annotation.

The following columns CI95low, CI95high, stddev give 95% confidence intervals and the standard deviation of the sample distribution, respectively.

The fold column is the fold enrichment or depletion and is computed as the ratio of observed over expected. The column l2fold is the log2 of this ratio.

The column pvalue gives the empirical p-value, i.e. in what proportion of samples was a higher enrichment or lower depletion found than the one that was observed.

The column qvalue lists a multiple testing corrected p-value. Setting a qvalue threshold and accepting only those comparisons with a qvalue below that threshold corresponds to controlling the false discovery rate at that particular level.

What does this table tell us? Looking at the column observed only, we see that most binding of SRF occurs in intronic and intergenic regions:

_images/genomic_annotation_piechart.png

Strictly speaking, this is a a naive analysis that does not require gat. The observed overlap alone does not tell us if the overlap we see is more or less than we expect. We do know that there are much more and larger intronic regions than there are UTRs, for example.

More instructive is to look at the enrichment within the various genomic regions, which is given by the fold change.

Here, we clearly see that SRF binds preferentially at transcription start sites (UTR5 and 5flank), while its binding is actually less than expected in introns and intergenic regions.

_images/genomic_annotation_foldchange.png

Binding distribution of SRF with respect to known protein coding genes. Plotted is the log2(fold change). Value not significant are transparent.

The effective genome

In the previous analysis we used the complete genome (3.1Gb) as the workspace. However, that is not realistic. For example, SRF will not be predicted in regions that are assembly gaps. Generally speaking, if the workspace is too large, fold enrichment values will be too optimistic.

To get a more accurate estimate of the enrichment in various regions, we should exclude assembly gaps.

The bed formatted file contigs_ungapped.bed.gz contains only those genomic regions that are not assembly gaps (2.86Gb). We can use this file instead:

gat-run.py --ignore-segment-tracks --segments=srf.hg19.bed.gz
   --annotations=annotations_geneset.bed.gz --workspace=contigs_ungapped.bed.gz
   --num-samples=1000 --log=gat.log > gat.out
annotation observed expected fold l2fold pvalue qvalue
intergenic 5800 13806.4540 0.4201 -1.2512 1.0000e-03 2.2000e-03
UTR3 233 303.6340 0.7674 -0.3820 2.5300e-01 3.9757e-01
intronic 8816 11473.2200 0.7684 -0.3801 1.0000e-03 2.2000e-03
3flank 800 713.4290 1.1213 0.1652 3.4000e-01 4.6750e-01
CDS 754 391.1840 1.9275 0.9467 5.0000e-03 9.1667e-03
flank 1334 182.0200 7.3289 2.8736 1.0000e-03 2.2000e-03
5flank 6524 761.1600 8.5711 3.0995 1.0000e-03 2.2000e-03
UTR5 3441 97.3670 35.3405 5.1433 1.0000e-03 2.2000e-03

The associated fold changes change, albeit not much. But have we done enough? The SRF intervals are the result of a ChIP-Seq experiment. Because these were short reads (25bp), not all can be unambiguously mapped to a unique genomic location. This again effectively removes some genomic regions from the analysis.

The bed formatted mapability_36.filtered.bed.gz contains all those genomic regions, that are uniquely mapable with reads of 24 bases. These regions have been derived from the UCSC mapability tracks and reduce the effective genome considerably (1.96Gb).

We could intersect the two bed files ourselves, but we can also supply multiple workspaces to gat. gat will automatically intersect multiple workspaces:

gat-run.py --ignore-segment-tracks --segments=srf.hg19.bed.gz
   --annotations=annotations_geneset.bed.gz
   --workspace=contigs_ungapped.bed.gz
   --workspace=mapability_36.filtered.bed.gz
   --num-samples=1000 --log=gat.log > gat.out

As a consequence of reducing the workspace the fold changes change:

annotation observed expected fold l2fold pvalue qvalue
intergenic 5800 12531.2490 0.4628 -1.1114 1.0000e-03 1.6000e-03
UTR3 233 385.1620 0.6049 -0.7251 1.1000e-01 1.2571e-01
intronic 8816 10942.7440 0.8056 -0.3118 1.0000e-03 1.6000e-03
3flank 800 625.3780 1.2792 0.3553 1.6500e-01 1.6500e-01
CDS 754 540.3700 1.3953 0.4806 8.2000e-02 1.0933e-01
flank 1334 166.6400 8.0053 3.0010 1.0000e-03 1.6000e-03
5flank 6524 638.2110 10.2223 3.3537 1.0000e-03 1.6000e-03
UTR5 3441 122.2010 28.1585 4.8155 1.0000e-03 1.6000e-03

Tutorial - Functional annnotation

This tutorial demonstrates the use of gat for functional annotation. The tutorial follows the analysis in the MacLean et al. (2010) paper introducing GREAT.

The data sets used in this tutorial are available at:

http://www.cgat.org/~andreas/documentation/gat-examples/TutorialFunctionalAnnotation.tar.gz

We are interested if the binding sites of SRF predicted from our ChIP-Seq experiment are enriched in the regulatory domains of genes of specific functions.

The data is in srf.hg19.bed. This bed formatted file contains 556 high confidence peaks from the analysis of Valouev et al. (2008) mapped to human chromosome hg19.

Functional annotation with GREAT

gat comes with a tool called gat-great that computes enrichment statistics using the binomial test implemented in GREAT.

To do an analysis as implemented in GREAT, we have prepared a bed formatted file (regulatory_domains.bed) with regulatory domains using GREAT’s basal+extension rule.

In GREAT’s definition, the regulatory domain of a gene contains a basal region and an optional extension. The basal region is defined as the region 5kb upstream and 1kb downstream of the transcription start site of a representative transcript. The basal region is then extended up to 1Mb in either direction but only up to the basal region of the closest domains.

In this example, we have used the transcription start sites of the ENSEMBL human protein coding gene set of Ensembl (release 67).

Each gene was replaced with GO terms associated with the gene obtained from the GO Gene Ontology definitions. GO terms were mapped via uniprot identifiers to genes and ancestral ontology terms were inferred.

GREAT excludes assembly gaps in the genome from the analysis. The bed formatted contigs_ungapped.bed contains all genomic regions exclusive of assembyl gaps.

We have now all data in place to run a GREAT analysis:

gat-great.py --verbose=5 \
             --log=great.log \
             --segments=srf.hg19.bed \
             --annotations=regulatory_domains.bed \
             --workspace=contigs_ungapped.bed \
             --ignore-segment-tracks \
             --qvalue-method=BH \
             --descriptions=go2description.tsv \
>& great.tsv

We also added a file go2description.tsv that contains a table with descriptions for GO identifiers.

We inserted the table into an SQL database for easy analysis. These are the top 10 results:

annotation pvalue observed expected description
GO:0015629 1.84357379006e-11 52 18.0074449104 “actin cytoskeleton”
GO:0032796 4.22734641877e-09 5 0.0557976925372 “uropod organization”
GO:0070688 1.05832288791e-07 7 0.358045266782 “MLL5-L complex”
GO:0043229 2.56771180099e-07 424 369.125027893 “intracellular organelle”
GO:0044424 3.38999777659e-07 457 406.64284565 “intracellular part”
GO:0043226 3.77332356747e-07 424 369.977837368 organelle
GO:0072303 1.28008163954e-06 3 0.0198635059219 “positive regulation of glomerular metanephric mesangial cell proliferation”
GO:0001725 1.96123184157e-06 12 2.08962341062 “stress fiber”
GO:0005634 2.30813240931e-06 295 240.687702571 nucleus
GO:0045214 2.78146278125e-06 11 1.79113154372 “sarcomere organization”
Validation against GREAT

We can check if the results are comparable to the GREAT server. We submitted our segments to GREAT, downloaded all the results into srf.great.all.tsv and loaded them into an SQL database. These are the top 10 results:

ID BinomP ObsRegions ExpRegions Desc
GO:0015629 3.064707e-11 51 17.68622 “actin cytoskeleton”
GO:0032796 4.223825e-09 5 0.05578831 “uropod organization”
GO:0045214 1.057722e-08 10 0.7800466 “sarcomere organization”
GO:0030863 2.205659e-08 16 2.666392 “cortical cytoskeleton”
GO:0001725 1.786459e-07 13 1.991518 “stress fiber”
GO:0044424 2.007378e-07 479 431.1647 “intracellular part”
GO:0070688 2.151319e-07 7 0.3981916 “MLL5-L complex”
GO:0005634 2.556219e-07 311 251.391 nucleus
GO:0032432 2.931657e-07 13 2.081847 “actin filament bundle”
GO:0043229 3.557439e-07 443 390.9065 “intracellular organelle”

The top 10 results are comparable. The same holds generally for fold change values:

_images/compare_great_fold.png

and pvalues:

_images/compare_great_pvalue.png

Some differences are to be expected:

  1. We use the ENSEMBL gene set, while GREAT uses Refseq.
  2. We use a different definition of representative transcripts.
  3. Our coordinates of alignment gaps might differ.
  4. The assignment of GO terms to genes differ.
  5. The implementations might differ in some details.
Functional annotation with gat

Gat can be run with the same input as we used for great:

gat-run.py --verbose=5 \
           --log=gatnormed.tsv.log \
            --segments=srf.hg19.bed \
            --annotations=regulatory_domains.bed \
            --workspace=contigs_ungapped.bed \
            --ignore-segment-tracks \
            --qvalue-method=BH \
            --descriptions=go2description.tsv \
            --pvalue-method=norm \
            >& gatnormed.tsv

Fold changes are very similar:

_images/compare_great_vs_gat_fold.png

but the p-value comparison shows interesting pattern:

_images/compare_great_vs_gat_pvalue.png

The pattern is explained easily. GREAT computes only the P-Value for enrichment, while GAT computes P-Value both for enrichment and depletion. Indeed, if we only plot p-values for annotations that are enriched, the values are comparable:

_images/compare_great_vs_gat_pvalue_only_enrichment.png

Note how the p-values are very well correlated above 10E-3:

_images/compare_great_vs_gat_pvalue_only_enrichment_truncated.png

Below a p-Value of 10E-3 the correlation breaks down. Unfortunately, the lowest empirical p-value is determined by the number of simulations performed. Adding more simulations will allow us to estimate lower p-values, but will also increase the runtime. A short-cut is to extrapolate from lower p-values by adding the option --pvalue-method=norm:

_images/compare_great_vs_gatnormed_pvalue_only_enrichment.png

The table with enriched categories is dominated by small categories with very little expected overlap leading to very large fold changes:

annotation pvalue observed expected fold description
GO:0043495 0.0 200 10.8 18.5185 “protein anchor”
GO:0070688 0.0 350 16.7 20.9581 “MLL5-L complex”
GO:0000212 0.0 200 9.25 21.6216 “meiotic spindle organization”
GO:0045896 0.0 150 4.65 32.2581 “regulation of transcription during mitosis”
GO:0045897 0.0 150 4.65 32.2581 “positive regulation of transcription during mitosis”
GO:0046022 0.0 150 4.65 32.2581 “positive regulation of transcription from RNA polymerase II promoter during mitosis”
GO:0046021 0.0 150 4.65 32.2581 “regulation of transcription from RNA polymerase II promoter, mitotic”
GO:0071895 0.0 100 2.7 37.037 “odontoblast differentiation”
GO:0032796 0.0 250 6.65 37.594 “uropod organization”
GO:0021593 0.0 100 2.65 37.7358 “rhombomere morphogenesis”

For interpretation of the results it is often advisable to remove annotations that are rare.

annotation pvalue observed expected fold description
GO:0015629 3.3307e-16 2600 932.458 2.7883 “actin cytoskeleton”
GO:0044425 5.5445e-11 9620 13468.369 0.7143 “membrane part”
GO:0016021 5.7537e-11 7870 11699.204 0.6727 “integral to membrane”
GO:0031224 1.2393e-10 8220 11999.454 0.685 “intrinsic to membrane”
GO:0016020 5.5517e-08 12970 16059.768 0.8076 membrane
GO:0004888 1.0775e-06 1000 2534.458 0.3946 “transmembrane signaling receptor activity”
GO:0030029 1.1332e-06 2450 1278.537 1.9163 “actin filament-based process”
GO:0030036 1.3077e-06 2300 1174.087 1.959 “actin cytoskeleton organization”
GO:0003779 3.2187e-06 1900 963.501 1.972 “actin binding”
GO:0005886 5.2155e-06 7720 10170.117 0.7591 “plasma membrane”

Currently, we estimate fold enrichment for categories within a workspace that excludes ungapped regions. As before (Tutorial - Interval overlap), a thorough analysis should also exclude regions of low mapability.

Usage instructions

This page describes basic and advanced usage of GAT.

A list of all command-line options is available via:

gat-run.py --help

Basic usage

The gat tool is controlled via the gat-run.py script. This script requires the following input:

  1. A set of intervals S with segments of interest to test.
  2. A set of intervals A with annotations to test against.
  3. A set of intervals W describing a workspace

GAT requires bed formatted files. In its simplest form, GAT is then run as:

gat-run.py
   --segment-file=segments.bed.gz
   --workspace-file=workspace.bed.gz
   --annotation-file=annotations.bed.gz

The script recognizes gzip compressed files by the suffix .gz.

The principal output is a tab-separated table of pairwise comparisons between each segments of interest and annotations. The table will be written to stdout, unless the option --stdout is given with a filename to which output should be redirected.

The main columns in the table are:

track
the segments of interest track
annotation
the annotations track
observed
the observed count
expected
the expected count based on the sampled segments
CI95low
the value at the 5% percentile of samples
CI95high
the value at the 95% percentile of samples
stddev
the standard deviation of samples
fold
the fold enrichment, given by the ratio observed / expected
l2fold
log2 of the fold enrichment value
pvalue
the p-value of enrichment/depletion
qvalue
a multiple-testing corrected p-value. See multiple testing correction.
Additionally, there are the following columns:
track_nsegments
number of segments in track in segments of interest
track_size
number of residues in covered by track in segments of interest within the workspace
track_density
fraction of residues in track in segments of interest within the workspace
annotation_nsegments
number of segments in track in annotations.
annotation_size
number of residues in covered by track in annotations within the workspace
annotation_density
number of residues in covered by track in annotations within the workspace
overlap_nsegments
number of segments in overlapping between segments of interest and annotations
overlap_size
number of nucleotides overlapping between segments of interest and annotations
overlap_density
fraction of residues overlapping between segments of interest and annotations within workspace
percent_overlap_nsegments_track
percentage of segments in segments of interest overlapping annotations
percent_overlap_size_track
percentage of nucleotides in segments of interest overlapping annotations
percent_overlap_nsegments_annotation
percentage of segments in annotations overlapping segments of interest
percent_overlap_size_annotation
percentage of nucleotides in annotations overlapping segments of interest
description
additional description of track (requires --descriptions to be set).

Further output files such as auxiliary summary statistics go to files named according to --filename-output-pattern. The argument to filename-output-pattern should contain one %s placeholder, which is then substituted with section names.

Count here denotes the measure of association and defaults to number of overlapping nucleotides.

Advanced Usage

Submitting multiple files

All of the options –segment-file, –workspace-file, –annotation-file can be used several times on the command line. What happens with multiple files depends on the file type:

  1. Multiple –segment-file entries are added to the list of segments of interest to test with.
  2. Multiple –annotation-file entries are added to the list of annotations to test against.
  3. Multiple –workspace entries are intersected to create a single workspace.

Generally, gat will test m segments of interest lists against n annotations lists in all m * n combinations.

Within a bed formatted file, different tracks can be separated using a UCSC formatted track line, such as this:

track name="segmentset1"
chr1  23     100
chr3  50     2000
track name="segmentset2"
chr1  1000   2000
chr3  4000   5000

or alternatively, using the fourth column in a bed formatted file:

chr1 23      100     segmentset1
chr3 50      2000    segmentset1
chr1 1000    2000    segmentset2
chr3 4000    5000    segmentset2

The latter takes precedence. The option –ignore-segment-tracks` forces gat to ignore the fourth column and consider all intervals to be from a single interval set.

Note

Be careful with bed-files where each interval gets a unique identifier. Gat will interprete each interval as a separate segment set to read. This is usually not intended and causes gat to require a very large amount of memory. (see the option --ignore-segment-tracks

By default, tracks can not be split over multiple files. The option --enable-split-tracks permits this.

Adding isochores

Isochores are genomic segments with common properties that are potentially correlated with the segments of interest and the annotations, but the correlation is not of interest here. For example, consider a CHiP-Seq experiment and the testing if CHiP-Seq intervals are close to genes. G+C rich regions in the genome are gene rich, while at the same time there is possibly a nucleotide composition bias in the CHiP-Seq protocol depleting A+T rich sequence. An association between genes and CHiP-Seq intervals might simply be due to the G+C effect. Using isochores can control for this effect to some extent.

Isochores split the workspace into smaller workspaces of similar properties, so called isochore workspaces. Simulations are performed for each isochore workspaces separately. At the end, results for each all isochore workspaces are aggregated.

In order to add isochores, use the –isochore-file command line option.

Choosing measures of association

Counters describe the measure of association that is tested. Counters are selected with the command line option --counter. Available counters are:

  1. nucleotide-overlap: number of bases overlapping [default]
  2. segment-overlap: number of intervals intervals in the segments of interest overlapping annotations. A single base-pair overlap is sufficient.
  3. segment-mid-overlap: number of intervals in the segments of interest overlapping at their midpoint annotations.
  4. annotations-overlap: number of intervals in the annotations overlapping segments of interest. A single base-pair overlap is sufficient.
  5. segment-mid-overlap: number of intervals in the annotations overlapping at their midpoint segments of interest

Multiple counters can be given. If only one counter is provided, the output will be to stdout. Otherwise, separate output files will be created each counter. The filename can be controlled with the --output-table-pattern option.

Changing the PValue method

By default, gat returns the empirical p-value based on the sampling procedure. The minimum p-value is 1 / number of samples.

Sometimes the lower bound on p-values causes methods that estimate the FDR to fail as the distribution of p-values is atypical. In order to estimate lower pvalues, the number of samples needs to be increased. Unfortunately, the run-time of gat is directly proportional to the number of samples.

A solution is to set the option --pvalue-method to --pvalue-method=norm. In that case, pvalues are estimated by fitting a normal distribution to the samples. Small p-values are obtained by extrapolating from this fit.

Multiple testing correction

gat provides several methods for controlling the false discovery rate. The default is to use the Benjamini-Hochberg procedure. Different methods can be chosen with the --qvalue-method option.

--qvalue-method=storey uses the procedure by Storey et al. (2002) to compute a q-value for each pairwise comparison. The implementation is in its functionality equivalent to the qvalue package implemented in R.

Other options are equivalent to the methods as implemented in the R function p.adjust.

Caching sampling results

gat can save and retrieve samples from a cache --cache=cache_filename. If cache_filename does not exist, samples will be saved to the cache after computation. If cache_filename does already exist, samples will be retrieved from the cache instead of being re-computed. Using cached samples is useful when trying different counters (see Choosing measures of association).

If the option --counts-file is given, gat will skip the sampling and counting step completely and read observed counts from --count-file=counts_filename.

Using multiple CPU/cores

GAT can make use of several available CPU/cores if available. Use the --num-threads=# option in order to specify how many CPU/cores GAT will make use of. The default --num-threads=0 means that GAT will not use any multiprocessing.

Outputting intermediate results

A variety of options govern the output of intermediate results by gat. These options usually accept patterns that represent filenames with a %s as a wild card character. The wild card is replaced with various keys. Note that the amount of data output can be substantial.

--output-counts-pattern
output counts. One file is created for each counter. Counts output files are required for gat-compare.
--output-plots-pattern
create plots (requires matplotlib). One plot for each annotation is created showing the distribution of expected counts and the observed count. Also, outputs the distribution of p-values and q-values.
--output-samples-pattern
output bed formatted files with individual samples.

Other tools

gat-compare

The gat-compare tool can be used to test if the fold changes found in two or more different gat experiments are significantly different from each other.

This tool requires the output files with counts created using the --output-counts-pattern option.

For example, to compare if fold changes are signficantly different between two cell lines, execute:

gat-run.py --segments=CD4.bed.gz <...>
--output-counts-pattern=CD4.%s.overlap.counts.tsv.gz
gat-run.py --segments=CD14.bed.gz <...>
--output-counts-pattern=CD14.%s.overlap.counts.tsv.gz

gat-compare.py CD4.nucleotide-overlap.counts.tsv.gz CD14.nucleotide-overlap.counts.tsv.gz
gat-plot

Plot gat results.

gat-great

Perform a GREAT analysis:

gat-great.py
   --segment-file=segments.bed.gz
   --workspace-file=workspace.bed.gz
   --annotation-file=annotations.bed.gz

Interpreting GAT results

Fold change

Gat reports fold changes. Fold change is simply expressed as a ratio of an observed metric compared to the expected value of the metric based on randomizations.

Fold changes of a single set of segments of interest against various annotations can be compared directly. Indeed, the primary objective of randomization is to remove the differences between number and segment sizes of different annotations in order to make them comparable. For example, the fold change of a set of transcription factor sites in promotors can directly be contrasted with the fold change of the same sites in introns.

When comparing the fold enrichment of multiple sets of segments of interest against the same annotation, some caution must be exercised. Here, the question is to compare the fold change of sites of transcription factor A in promotors with the fold change of sites of transcription factor B in promotors.

The difference becomes apparent when there is no observed overlap between the segments of interest and the annotations to be compared. In order to avoid division by 0, gat adds a pseudocount of 1 to observed and expected values: fc = (observed + 1) / (expected + 1). With no observed overlap, this becomes fc = (1 / expected + 1 ). The amount of expected overlap correlates with the number and size of segments in the segments of interest. If there are more sites for A than for B, the expectation of overlap is larger for A than for B. Thus, even if there is no overlap in both cases, the fold change values reported will be different. In the case of multiple annotations of different sizes, the annotations with no overlap for both A and B will lie along a straight line.

There is no such bias if there is overlap between the segments of interest and the annotations. As both the observed and the expected overlap depend upon the segment size and number of segments of the segments of interest, the effect cancels itself out.

The plot below shows the log2 fold change values for the same annotations between two sets of segments of interest (CD4 and CD14).

_images/fold_change1.png

The clouds on the upper left and lower right correspond to annotations which have no overlap with CD4 but with CD14, and vice versa. The cloud around the origin of the plot shows fold changes where both overlap for CD4 and CD14 is observed. There is no bias.

The bias can be corrected by applying a constant factor that reflects the difference in the segment sizes between the two segments of interest.

Note that the pseudo-count method works well when comparing fold depletion within a single set segments of interest. Here, the intuition is that no overlap with a larger set of annotations should give higher fold depletion than if there is no overlap with a small set of annotations.

It is possible to swap the segments of interest with the annotations. However, there is a down-side. The time consuming step in gat is the randomization of the segments of interest. Thus it is benefical to test few segments of interest against many annotations. When swopped, each set in the annotations will be randomized separately and compared to a single set of segments of interest.

Note that a P-value can be prone to misinterpretation. In particular, the P-value only indicates if an observed overlap is statistically significant different from the expectation. The P-value makes no inferences about the size of the effect and if it is biologically consequential. In particular, with increasing sample size, the expectation can be measured with higher accuracy leading to smaller differences to be detectable.

Effect size

The effect size is a measure of the strength of a phenomenom. In the context of gat a useful measure of the effect size makes use of the number of segments explained by an overlap between the segments of interest and a particular annotation.

For example, an association between transcription factor binding sites and a particular annotation is likely to be more believable if it explains 20% of all transcription factor binding sites, compared to one which explains only 1% of all transcription factor binding sites. Similarly, an association that accounts for 80% of all annotations is likely to more effectual than one that accounts for only 1% of annotations.

Difference between fold changes

Gat computes assigns a statistical signficance to a fold change. The P-value, adjusted for multiple testing, reports the chance of observing the same or more extreme fold change given a neutral model. When using multiple annotations, it is thus meaningful to report the annotations for which statistically significant enrichment was found.

However, the P-value does not permit to make any inferences about the difference between two fold changes. For example, we might find that a transcription factor is 2-fold statistically enriched in promotors and 3-fold in UTRs, but we can not say with statistical confidence that the enrichment in UTRs is larger than in promotors.

Similarly, we might find that transcription factor A is 2-fold statistically enriched in promotors and transcription factor B is 3-fold enriched in promotors. Again, we can not say with statistical confidence that there is a difference in promotor binding between the two transcription factors.

The latter case, in which we have two different segments of interest and we want to contrast their fold enrichment against the same annotations is implemented in gat using the gat-compare.py command.

Briefly, the gat-compare.py command takes the observed and simulated counts from one or more gat-run.py runs. For each sample, it computes a fold change ratio as rfc = fc1 / fc2 providing a distribution of expected fold change ratios. The tool checks if the null hypothesis of no fold change difference (rfc = 1) can be rejected.

Conceptually, this works by running two simulations in-sync, one for trancription factor A and one for transcription factor B. At each iteration, the fold change is computed for each annotation between the sampled segments and the observed segments. At the same time, the fold change difference in the fold change between set A and set B is recorded. Thus, at the end of the simulations gat gets a distribution of expected fold change differences between samples and computes an empirical P-Value for the observed fold change difference between A and B.

Performance

Memory usage

GAT is limited by the amount of memory available. Factors affecting GAT’s memory usage are:

  • the number of segments. Memory requirements grow linearly with the number of sets and the number of intervals in the sets segments of interest, annotations and workspace. The total number of segments
  • the number of samples. For each sample, statistics on nucleotide overlap are being kept. Thus memory requirements grow linearly with the number of samples.

If memory consumption is problematic, the following might help:

Memory consumption is usually not problematic and for typical sets reaches a few Gigabytes. If memory consumption explodes it is usually a problem with the input adat.

GAT has not been optimized for memory usage. If memory consumption is a problem, please contact the developers.

Runtime performance

As with memory usage, run-time performance of GAT is linearly related to the number of segments and the number of samples.

  • the number of segments in the segments of interest. More segments require more sampling steps. Usually, an input size with twice as many segments will take twice as long. Runtime behaviour might be worse in extreme cases where the sampler has difficulties placing the last segments, for example if the segment density is high.
  • the number of samples. Each additional sample will require an additional amount of time.
  • the number of segments in the annotations. Usually less of a factor, but it becomes a factor if overlap statisticts need to be computed for many different annotations. In this case, generating a single sample might be quicker than computing overlap statisticts of that sample with

As runtime performance is a linear function of most variables, runtime can be reduced by using multiple CPUs or cores (see Using multiple CPU/cores).

Sampling performance

In order to check for biases in the sampling procedure, we run automated tests to check for even coverage of nucleotides by the sampler and absence of edge effects.

Background

History

This module has been inspired by the TheAnnotator tool first used in the analysis by Ponjavic et al (2007) by Gerton Lunter and early work of Caleb Webber.

The differences are:

  • permit more measures of association. The original Annotator used nucleotide overlap, but other measures might be useful like number of elements overlapping by at least x nucleotides, proximity to closest element, etc.
  • easier user interface and using standard formats.
  • permit incremental runs. Annotations can be added without recomputing the samples.
  • faster.

Comparison to other methods

Testing for the association between genomic features is a field of long-standing interest in genomics and has gained considerable traction with the publication of large scale genomic data sets such as the ENCODE data.

Generally we believe that the problem of testing for association has not been fully resolved and advise every genomicist to apply several methods. The list of tools/services below is not exhaustive:

GREAT (MacLean et al. (2010)) uses a binomial test to test if transcription factor binding sites are associated with regulatory domains of genes. GREAT has a convenient web interface with many annotations pre-loaded. Compared to GREAT, GAT can measure depletion and can

GenometriCorr (Favorov et al. (2012)) compute a variety of distance metrics when comparing two interval sets and then apply a set of statistical tests to measure association. GenomicCorr is a good exploratory tool to generate hypotheses about the relationships of two genomic sets of intervals. Compared to GenometriCorr, GAT can simulate more realistic genomic scenarios, for example, segments might not occur in certain regions (due to mapping problems) or occur at reduced frequency (G+C biases).

The GSC (The Encode Project Consortium (2012)) metric (for Genome Structure Correlation) is inspired by the analysis of approximately piecewise stationary time series . The GSC metric estimates the significance of an association metric by estimating the random expectation of the association metric using randomly chosen intervals on the genome. This expectation is then used to test if the observed value of the metric (nucleotide overlap, region overlap, ...) is higher than expected. The method is described in the supplemental details of the first and recent ENCODE papers (Birney et al. (2007)_, ...) and here.

BITS (Layer et al. (2013)) (Binary Interval Search) is a method to perform quick overlap queries between genomic data sets. It implements a Monte-Carlo method for simulation that is particularly suited towards making all on all comparisons between a large number data sets.

Benchmark

We used the example from Tutorial - Interval overlap to perform a rough comparison between various methods. In all cases, we used n = 1000 for simulations. Times are wall-clock times. Please note that this is not a rigorous benchmark.

Method Set1 Set2 Observed Expected P-value Time
BITS srf jurkat 450 5.24 <0.001 43s
BITS srf hepg2 381 9.87 <0.001 39s
BITS srf hepg2/jurkat 9 5.7 0.13 28s
BITS jurkat hepg2 47237 3548 <0.001 106s
GSC srf jurkat     0.0004 58s
GSC srf hepg2     6.9E-11 54s
GSC srf hepg2/jurkat     5.23E-7 40s
GSC jurkat hepg2     0 159s
GAT srf jurkat 20183 247.6 <0.001 11s
GAT srf hepg2 18965 601.4 <0.001 11s
GAT srf hepg2/jurkat 425 327.3 0.21 11s
GAT jurkat hepg2 6163503 457332.8 <0.001 316s

BITS and GAT are fairly comparable, even though they use different metrics for the association (number of segments overlapping versus number of nucleotides overlapping). GAT is quicker on smaller data sets, while BITS outperforms on large datasets.

GSC reports a significant association in the comparison between srf and dhs intervals specific to hepg2 cells, while the other two tools do not, which is the biologically plausible result. It is difficult to say if there indeed is an association, or GSC is overestimating association.

Glossary

Interval
a (genomic) segment with a chromosomal location (contig,start and end).
Intervals
a set of one or more intervals.
workspace
the genomic regions accessible for simulation.
annotations
sets of intervals annotating various regions of the genome.
segments of interest
sets of intervals whose association is tested with annotations
sampled segments
randomized versions of the segments of interest. A set of sampled segments is one sample.
sample
one set of sampled segments.
isochore
Usually, isochores are defined as genomic regions of homogeneous G+C content. In GAT, isochores can mean any genomic regions with a shared property. Isochores are used to eliminate the effect of confounders in an analysis.
isochores
plural of isochore.
isochore workspaces
the workspace split according to isochores.
bed
an interval format. Intervals are in a tab-separated list denoted as contig, start, end. A bed file can contain several tracks which will be treated independently. Tracks are either delineated by a line of the format: track name=<trackname> or by an optional fourth column (field name). See UCSC for more information about bed files.
fold
fold change, computed as the ratio of observed over expected
p-value
significance of association. The P-value is an estimate of the probability to obtain an observed (or larger) overlap between two segment sets by chance.
q-value
multiple testing corrected p-value. The qvalue can either be computed using the q-value procedure suggested by Storey et al. (2002) or using other FDR approaches such as Bonferroni, Benjamini-Hochberg, etc. implemented in the R function p.adjust.
pvalue
the column containing the p-value.
qvalue
the column containing the q-value.
track
a set of segments within a bed formatted file. Tracks are either grouped using the track name=<trackname> prefix or by an optional fourth column (the name column).
annotation
a genomic annotation, which is a collection of intervals.
observed
the observed value of a metric.
expected
the expected value of a metric as determined by simulations.
fold
the fold change, defined as the ratio of observed over expected.
l2fold
the logarithm (base 2) of fold change.
tracks
plural of track
samples
a set of samples (see sample)

Release Notes

1.3.5:
  • Bugfixes - do not delete during container iteration
1.3.4:
  • Bugifxes to (hopefully) complete python 2/3 compatibility.

1.3.0:

  • GAT now works under python 2 and python 3
  • #3: gat-compare iterates now over shared tracks, not just the ‘merged’ track.
  • #2: do not split isochore in chrom.iso if isochore is ”.”

Developers’ notes

The following section contains notes for developers.

Notes

Sampling strategies

Sampling creates a new set of interval P. There are several different strategies possible.

Annotator strategy

In the original Annotator strategy, samples are created in a two step procedure. First, the length of a sample segment is chosen randomly from the empirical segment length distribution. Then, a random coordinate is chosen. If the sampled segment does not overlap with the workspace it is rejected.

Before adding the segment to the sampled set, it is truncated to the workspace.

If it overlaps with a previously sampled segment, the segments are merged. Thus, bases shared between two segments are not counted twice.

Sampling continues, until exactly the same number of bases overlap between the P and the W as do between S and W.

Note that the length distribution of the intervals in P might be different from S.

The length is sampled from the empirical histogram of segment sizes. The granularity of the histogram can be controlled by the options -bucket-size and --nbuckets. The largest segment should be smaller than bucket-size * nbuckets. Increase either if you have large segments in your data set, but smaller values of nbuckets are quicker.

This method is quick if the workspace is large. If it is small, a large number of samples will be rejected and the procedure slows down.

This sampling method is compatible with both distance and overlap based measures of associaton.

Workspaces and isochores

Workspaces limit the genomic space available for segments and annotations. Isochores split a workspace into smaller parts that permit to control for confounding variables like G+C content.

The simplest workspace is the full genome. For some analyses it might be better to limit to analysis to autosomes.

Examples for the use of isochores could be to analyze chromosomes or chromosomal arms separately.

If isochores are used, the length distribution and nucleotide overlaps are counted per isochore ensuring that the same number of nucleotides overlap each isochore in P and S and the length distributions per isochore are comparable.

Empirical length distribution

The empirical length distribution is created from all intervals in S. The full segment length is chosen even if there is partial overlap. Optionally, the segment can be truncated. From Gerton:

What is the best choice depends on the data. Not truncating can lead
to a biased length distribution if it is expected that segments that
only partially overlap the workspace have very different lengths. However,
truncating can lead to spurious short segments.

Benchmarking

This section contains quality control plots from the unit testing.

Position sampling

The following section looks at the position sampling algorithms.

Position sampling

The TestSamplerPosition tests if the position sampling works as expected. In particular, look out for edge effects.

../test/test_testMultipleWorkspaces__main__.TestSamplerPosition.png

Multiple work spaces. 10 workspaces of size 100, spaced every 1000 nucleotides

../test/test_testPositionSamplingSingleWorkspace__main__.TestSamplerPosition.png

A single work space.

../test/test_testPositionSamplingSplitWorkspace__main__.TestSamplerPosition.png

A workspace split in the middle without a gap.

../test/test_testPositionSamplingSplitWorkspace2__main__.TestSamplerPosition.png

A workspace split in the middle with a gap in between.

../test/test_testSNPPositionSampling__main__.TestSamplerPosition.png

10 workspaces of size 100, segment size of 1 (SNP).

Segment sampling algorithms

The following plots benchmark the segment sampling behaviour of the various segment sampling algorithms implemented in GAT.

Statistics

For 1-sized fragments (i.e. SNPs), the statistics can be checked against a hypergeometric distribution (sampling without replacement). All the tests below use a single continuous workspace of 1000 nucleotides seeded with a varying number of SNPs.

../test/test_testSingleSNP.TestStatsSNPSampling.png

Test with a single SNP. Here, there are no issues with multiple hits. The workspace contains a single annotation of increasing size (1,3,5,...,99)

../test/test_testMultipleSNPsFullOverlap.TestStatsSNPSampling.png

In this test, 10 SNPs are in the segment list. The workspace contains a single annotation of size (10, 15, ..., 105). All SNPs overlap the annotated part of the workspace and hence all results are highly signficant.

../test/test_testMultipleSNPsPartialOverlap.TestStatsSNPSampling.png

In this test, 10 SNPs are in the segment list. The workspace contains a single annotation. Annotations are all of size 10, but the overlap of SNPs with annotations varies from 0 to 10.

Statistics

Gat
SNPs

For 1-sized fragments (i.e. SNPs), the statistics can be checked against a hypergeometric distribution (sampling without replacement). All the tests below use a single continuous workspace of 1000 nucleotides seeded with a varying number of SNPs.

../test/testSingleSNP.TestStatsGat.png

Test with a single SNP. Here, there are no issues with multiple hits. The workspace contains a single annotation of increasing size (1,3,5,...,99)

../test/testMultipleSNPsFullOverlap.TestStatsGat.png

In this test, 10 SNPs are in the segment list. The workspace contains a single annotation of size (10, 15, ..., 105). All SNPs overlap the annotated part of the workspace and hence all results are highly signficant.

../test/testMultipleSNPsPartialOverlap.TestStatsGat.png

In this test, 10 SNPs are in the segment list. The workspace contains a single annotation. Annotations are all of size 10, but the overlap of SNPs with annotations varies from 0 to 10.

../test/testWorkspaces.TestStatsGat.png

workspace = 500 segments of size 1000, separated by a gap of 1000 annotations = 500 segments of size 1000, separated by a gap of 1000, shifted up 100 bases segments = a SNP every 100 bp

Intervals
../test/testIntervalsPartialOverlap.TestStatsGat.png

In this test, there is one segment of size 100. Annotations are of size 100 with decreasing overlap.

Annotator
SNPs

For 1-sized fragments (i.e. SNPs), the statistics can be checked against a hypergeometric distribution (sampling without replacement). All the tests below use a single continuous workspace of 1000 nucleotides seeded with a varying number of SNPs.

../test/testSingleSNP.TestStatsTheAnnotator.png

Test with a single SNP. Here, there are no issues with multiple hits. The workspace contains a single annotation of increasing size (1,3,5,...,99)

../test/testMultipleSNPsFullOverlap.TestStatsTheAnnotator.png

In this test, 10 SNPs are in the segment list. The workspace contains a single annotation of size (10, 15, ..., 105). All SNPs overlap the annotated part of the workspace and hence all results are highly signficant.

../test/testMultipleSNPsPartialOverlap.TestStatsTheAnnotator.png

In this test, 10 SNPs are in the segment list. The workspace contains a single annotation. Annotations are all of size 10, but the overlap of SNPs with annotations varies from 0 to 10.

../test/testWorkspaces.TestStatsTheAnnotator.png

workspace = 500 segments of size 1000, separated by a gap of 1000 annotations = 500 segments of size 1000, separated by a gap of 1000, shifted up 100 bases segments = a SNP every 100 bp

Intervals
../test/testIntervalsPartialOverlap.TestStatsTheAnnotator.png

In this test, 10 SNPs are in the segment list. The workspace contains a single annotation. Annotations are all of size 10, but the overlap of SNPs with annotations varies from 0 to 10.

Simulation algorithms

Gat implements a variety of simulation algorithms. Not all of them are of practical use.

Indices and tables

Read the Docs v: latest
Versions
latest
stable
Downloads
pdf
htmlzip
epub
On Read the Docs
Project Home
Builds

Free document hosting provided by Read the Docs.
PKwK])V77gat-latest/index.html GAT - Genomic Association Tester — gat 1.0 documentation

GAT - Genomic association tester

The genomic association tester (GAT) tests for association between sets of genomic intervals.

GAT tests if two sets of genomic intervals are associated more than expected by chance. Assocation typically means nucleotide overlap, but other measures such as the distance between elements or the number of overlapping segments can be used.

The tests are performed by simulation within a genomic context. The simulation part takes care of segment length distribution. The genomic context takes into account chromosomal location and optionally isochores.

Documentation

Get gat

Gat is available as an easy-installable package on the Python Package Index.

The code can be found at github.

Read the Docs v: latest
Versions
latest
stable
Downloads
pdf
htmlzip
epub
On Read the Docs
Project Home
Builds

Free document hosting provided by Read the Docs.
PKvK>ugat-latest/_static/up.pngPNG  IHDR7IDATx@ez $& 8:& :Kpwn}O<:!!{G@Dz?"̧ S{g<ݢ lMQwy|? 0 pq8q` pL-'SBNAwTń|U VIENDB`PKvKS^[WW"gat-latest/_static/jquery-3.1.0.js/*eslint-disable no-unused-vars*/ /*! * jQuery JavaScript Library v3.1.0 * https://jquery.com/ * * Includes Sizzle.js * https://sizzlejs.com/ * * Copyright jQuery Foundation and other contributors * Released under the MIT license * https://jquery.org/license * * Date: 2016-07-07T21:44Z */ ( function( global, factory ) { "use strict"; if ( typeof module === "object" && typeof module.exports === "object" ) { // For CommonJS and CommonJS-like environments where a proper `window` // is present, execute the factory and get jQuery. // For environments that do not have a `window` with a `document` // (such as Node.js), expose a factory as module.exports. // This accentuates the need for the creation of a real `window`. // e.g. var jQuery = require("jquery")(window); // See ticket #14549 for more info. module.exports = global.document ? factory( global, true ) : function( w ) { if ( !w.document ) { throw new Error( "jQuery requires a window with a document" ); } return factory( w ); }; } else { factory( global ); } // Pass this if window is not defined yet } )( typeof window !== "undefined" ? window : this, function( window, noGlobal ) { // Edge <= 12 - 13+, Firefox <=18 - 45+, IE 10 - 11, Safari 5.1 - 9+, iOS 6 - 9.1 // throw exceptions when non-strict code (e.g., ASP.NET 4.5) accesses strict mode // arguments.callee.caller (trac-13335). But as of jQuery 3.0 (2016), strict mode should be common // enough that all such attempts are guarded in a try block. "use strict"; var arr = []; var document = window.document; var getProto = Object.getPrototypeOf; var slice = arr.slice; var concat = arr.concat; var push = arr.push; var indexOf = arr.indexOf; var class2type = {}; var toString = class2type.toString; var hasOwn = class2type.hasOwnProperty; var fnToString = hasOwn.toString; var ObjectFunctionString = fnToString.call( Object ); var support = {}; function DOMEval( code, doc ) { doc = doc || document; var script = doc.createElement( "script" ); script.text = code; doc.head.appendChild( script ).parentNode.removeChild( script ); } /* global Symbol */ // Defining this global in .eslintrc would create a danger of using the global // unguarded in another place, it seems safer to define global only for this module var version = "3.1.0", // Define a local copy of jQuery jQuery = function( selector, context ) { // The jQuery object is actually just the init constructor 'enhanced' // Need init if jQuery is called (just allow error to be thrown if not included) return new jQuery.fn.init( selector, context ); }, // Support: Android <=4.0 only // Make sure we trim BOM and NBSP rtrim = /^[\s\uFEFF\xA0]+|[\s\uFEFF\xA0]+$/g, // Matches dashed string for camelizing rmsPrefix = /^-ms-/, rdashAlpha = /-([a-z])/g, // Used by jQuery.camelCase as callback to replace() fcamelCase = function( all, letter ) { return letter.toUpperCase(); }; jQuery.fn = jQuery.prototype = { // The current version of jQuery being used jquery: version, constructor: jQuery, // The default length of a jQuery object is 0 length: 0, toArray: function() { return slice.call( this ); }, // Get the Nth element in the matched element set OR // Get the whole matched element set as a clean array get: function( num ) { return num != null ? // Return just the one element from the set ( num < 0 ? this[ num + this.length ] : this[ num ] ) : // Return all the elements in a clean array slice.call( this ); }, // Take an array of elements and push it onto the stack // (returning the new matched element set) pushStack: function( elems ) { // Build a new jQuery matched element set var ret = jQuery.merge( this.constructor(), elems ); // Add the old object onto the stack (as a reference) ret.prevObject = this; // Return the newly-formed element set return ret; }, // Execute a callback for every element in the matched set. each: function( callback ) { return jQuery.each( this, callback ); }, map: function( callback ) { return this.pushStack( jQuery.map( this, function( elem, i ) { return callback.call( elem, i, elem ); } ) ); }, slice: function() { return this.pushStack( slice.apply( this, arguments ) ); }, first: function() { return this.eq( 0 ); }, last: function() { return this.eq( -1 ); }, eq: function( i ) { var len = this.length, j = +i + ( i < 0 ? len : 0 ); return this.pushStack( j >= 0 && j < len ? [ this[ j ] ] : [] ); }, end: function() { return this.prevObject || this.constructor(); }, // For internal use only. // Behaves like an Array's method, not like a jQuery method. push: push, sort: arr.sort, splice: arr.splice }; jQuery.extend = jQuery.fn.extend = function() { var options, name, src, copy, copyIsArray, clone, target = arguments[ 0 ] || {}, i = 1, length = arguments.length, deep = false; // Handle a deep copy situation if ( typeof target === "boolean" ) { deep = target; // Skip the boolean and the target target = arguments[ i ] || {}; i++; } // Handle case when target is a string or something (possible in deep copy) if ( typeof target !== "object" && !jQuery.isFunction( target ) ) { target = {}; } // Extend jQuery itself if only one argument is passed if ( i === length ) { target = this; i--; } for ( ; i < length; i++ ) { // Only deal with non-null/undefined values if ( ( options = arguments[ i ] ) != null ) { // Extend the base object for ( name in options ) { src = target[ name ]; copy = options[ name ]; // Prevent never-ending loop if ( target === copy ) { continue; } // Recurse if we're merging plain objects or arrays if ( deep && copy && ( jQuery.isPlainObject( copy ) || ( copyIsArray = jQuery.isArray( copy ) ) ) ) { if ( copyIsArray ) { copyIsArray = false; clone = src && jQuery.isArray( src ) ? src : []; } else { clone = src && jQuery.isPlainObject( src ) ? src : {}; } // Never move original objects, clone them target[ name ] = jQuery.extend( deep, clone, copy ); // Don't bring in undefined values } else if ( copy !== undefined ) { target[ name ] = copy; } } } } // Return the modified object return target; }; jQuery.extend( { // Unique for each copy of jQuery on the page expando: "jQuery" + ( version + Math.random() ).replace( /\D/g, "" ), // Assume jQuery is ready without the ready module isReady: true, error: function( msg ) { throw new Error( msg ); }, noop: function() {}, isFunction: function( obj ) { return jQuery.type( obj ) === "function"; }, isArray: Array.isArray, isWindow: function( obj ) { return obj != null && obj === obj.window; }, isNumeric: function( obj ) { // As of jQuery 3.0, isNumeric is limited to // strings and numbers (primitives or objects) // that can be coerced to finite numbers (gh-2662) var type = jQuery.type( obj ); return ( type === "number" || type === "string" ) && // parseFloat NaNs numeric-cast false positives ("") // ...but misinterprets leading-number strings, particularly hex literals ("0x...") // subtraction forces infinities to NaN !isNaN( obj - parseFloat( obj ) ); }, isPlainObject: function( obj ) { var proto, Ctor; // Detect obvious negatives // Use toString instead of jQuery.type to catch host objects if ( !obj || toString.call( obj ) !== "[object Object]" ) { return false; } proto = getProto( obj ); // Objects with no prototype (e.g., `Object.create( null )`) are plain if ( !proto ) { return true; } // Objects with prototype are plain iff they were constructed by a global Object function Ctor = hasOwn.call( proto, "constructor" ) && proto.constructor; return typeof Ctor === "function" && fnToString.call( Ctor ) === ObjectFunctionString; }, isEmptyObject: function( obj ) { /* eslint-disable no-unused-vars */ // See https://github.com/eslint/eslint/issues/6125 var name; for ( name in obj ) { return false; } return true; }, type: function( obj ) { if ( obj == null ) { return obj + ""; } // Support: Android <=2.3 only (functionish RegExp) return typeof obj === "object" || typeof obj === "function" ? class2type[ toString.call( obj ) ] || "object" : typeof obj; }, // Evaluates a script in a global context globalEval: function( code ) { DOMEval( code ); }, // Convert dashed to camelCase; used by the css and data modules // Support: IE <=9 - 11, Edge 12 - 13 // Microsoft forgot to hump their vendor prefix (#9572) camelCase: function( string ) { return string.replace( rmsPrefix, "ms-" ).replace( rdashAlpha, fcamelCase ); }, nodeName: function( elem, name ) { return elem.nodeName && elem.nodeName.toLowerCase() === name.toLowerCase(); }, each: function( obj, callback ) { var length, i = 0; if ( isArrayLike( obj ) ) { length = obj.length; for ( ; i < length; i++ ) { if ( callback.call( obj[ i ], i, obj[ i ] ) === false ) { break; } } } else { for ( i in obj ) { if ( callback.call( obj[ i ], i, obj[ i ] ) === false ) { break; } } } return obj; }, // Support: Android <=4.0 only trim: function( text ) { return text == null ? "" : ( text + "" ).replace( rtrim, "" ); }, // results is for internal usage only makeArray: function( arr, results ) { var ret = results || []; if ( arr != null ) { if ( isArrayLike( Object( arr ) ) ) { jQuery.merge( ret, typeof arr === "string" ? [ arr ] : arr ); } else { push.call( ret, arr ); } } return ret; }, inArray: function( elem, arr, i ) { return arr == null ? -1 : indexOf.call( arr, elem, i ); }, // Support: Android <=4.0 only, PhantomJS 1 only // push.apply(_, arraylike) throws on ancient WebKit merge: function( first, second ) { var len = +second.length, j = 0, i = first.length; for ( ; j < len; j++ ) { first[ i++ ] = second[ j ]; } first.length = i; return first; }, grep: function( elems, callback, invert ) { var callbackInverse, matches = [], i = 0, length = elems.length, callbackExpect = !invert; // Go through the array, only saving the items // that pass the validator function for ( ; i < length; i++ ) { callbackInverse = !callback( elems[ i ], i ); if ( callbackInverse !== callbackExpect ) { matches.push( elems[ i ] ); } } return matches; }, // arg is for internal usage only map: function( elems, callback, arg ) { var length, value, i = 0, ret = []; // Go through the array, translating each of the items to their new values if ( isArrayLike( elems ) ) { length = elems.length; for ( ; i < length; i++ ) { value = callback( elems[ i ], i, arg ); if ( value != null ) { ret.push( value ); } } // Go through every key on the object, } else { for ( i in elems ) { value = callback( elems[ i ], i, arg ); if ( value != null ) { ret.push( value ); } } } // Flatten any nested arrays return concat.apply( [], ret ); }, // A global GUID counter for objects guid: 1, // Bind a function to a context, optionally partially applying any // arguments. proxy: function( fn, context ) { var tmp, args, proxy; if ( typeof context === "string" ) { tmp = fn[ context ]; context = fn; fn = tmp; } // Quick check to determine if target is callable, in the spec // this throws a TypeError, but we will just return undefined. if ( !jQuery.isFunction( fn ) ) { return undefined; } // Simulated bind args = slice.call( arguments, 2 ); proxy = function() { return fn.apply( context || this, args.concat( slice.call( arguments ) ) ); }; // Set the guid of unique handler to the same of original handler, so it can be removed proxy.guid = fn.guid = fn.guid || jQuery.guid++; return proxy; }, now: Date.now, // jQuery.support is not used in Core but other projects attach their // properties to it so it needs to exist. support: support } ); if ( typeof Symbol === "function" ) { jQuery.fn[ Symbol.iterator ] = arr[ Symbol.iterator ]; } // Populate the class2type map jQuery.each( "Boolean Number String Function Array Date RegExp Object Error Symbol".split( " " ), function( i, name ) { class2type[ "[object " + name + "]" ] = name.toLowerCase(); } ); function isArrayLike( obj ) { // Support: real iOS 8.2 only (not reproducible in simulator) // `in` check used to prevent JIT error (gh-2145) // hasOwn isn't used here due to false negatives // regarding Nodelist length in IE var length = !!obj && "length" in obj && obj.length, type = jQuery.type( obj ); if ( type === "function" || jQuery.isWindow( obj ) ) { return false; } return type === "array" || length === 0 || typeof length === "number" && length > 0 && ( length - 1 ) in obj; } var Sizzle = /*! * Sizzle CSS Selector Engine v2.3.0 * https://sizzlejs.com/ * * Copyright jQuery Foundation and other contributors * Released under the MIT license * http://jquery.org/license * * Date: 2016-01-04 */ (function( window ) { var i, support, Expr, getText, isXML, tokenize, compile, select, outermostContext, sortInput, hasDuplicate, // Local document vars setDocument, document, docElem, documentIsHTML, rbuggyQSA, rbuggyMatches, matches, contains, // Instance-specific data expando = "sizzle" + 1 * new Date(), preferredDoc = window.document, dirruns = 0, done = 0, classCache = createCache(), tokenCache = createCache(), compilerCache = createCache(), sortOrder = function( a, b ) { if ( a === b ) { hasDuplicate = true; } return 0; }, // Instance methods hasOwn = ({}).hasOwnProperty, arr = [], pop = arr.pop, push_native = arr.push, push = arr.push, slice = arr.slice, // Use a stripped-down indexOf as it's faster than native // https://jsperf.com/thor-indexof-vs-for/5 indexOf = function( list, elem ) { var i = 0, len = list.length; for ( ; i < len; i++ ) { if ( list[i] === elem ) { return i; } } return -1; }, booleans = "checked|selected|async|autofocus|autoplay|controls|defer|disabled|hidden|ismap|loop|multiple|open|readonly|required|scoped", // Regular expressions // http://www.w3.org/TR/css3-selectors/#whitespace whitespace = "[\\x20\\t\\r\\n\\f]", // http://www.w3.org/TR/CSS21/syndata.html#value-def-identifier identifier = "(?:\\\\.|[\\w-]|[^\0-\\xa0])+", // Attribute selectors: http://www.w3.org/TR/selectors/#attribute-selectors attributes = "\\[" + whitespace + "*(" + identifier + ")(?:" + whitespace + // Operator (capture 2) "*([*^$|!~]?=)" + whitespace + // "Attribute values must be CSS identifiers [capture 5] or strings [capture 3 or capture 4]" "*(?:'((?:\\\\.|[^\\\\'])*)'|\"((?:\\\\.|[^\\\\\"])*)\"|(" + identifier + "))|)" + whitespace + "*\\]", pseudos = ":(" + identifier + ")(?:\\((" + // To reduce the number of selectors needing tokenize in the preFilter, prefer arguments: // 1. quoted (capture 3; capture 4 or capture 5) "('((?:\\\\.|[^\\\\'])*)'|\"((?:\\\\.|[^\\\\\"])*)\")|" + // 2. simple (capture 6) "((?:\\\\.|[^\\\\()[\\]]|" + attributes + ")*)|" + // 3. anything else (capture 2) ".*" + ")\\)|)", // Leading and non-escaped trailing whitespace, capturing some non-whitespace characters preceding the latter rwhitespace = new RegExp( whitespace + "+", "g" ), rtrim = new RegExp( "^" + whitespace + "+|((?:^|[^\\\\])(?:\\\\.)*)" + whitespace + "+$", "g" ), rcomma = new RegExp( "^" + whitespace + "*," + whitespace + "*" ), rcombinators = new RegExp( "^" + whitespace + "*([>+~]|" + whitespace + ")" + whitespace + "*" ), rattributeQuotes = new RegExp( "=" + whitespace + "*([^\\]'\"]*?)" + whitespace + "*\\]", "g" ), rpseudo = new RegExp( pseudos ), ridentifier = new RegExp( "^" + identifier + "$" ), matchExpr = { "ID": new RegExp( "^#(" + identifier + ")" ), "CLASS": new RegExp( "^\\.(" + identifier + ")" ), "TAG": new RegExp( "^(" + identifier + "|[*])" ), "ATTR": new RegExp( "^" + attributes ), "PSEUDO": new RegExp( "^" + pseudos ), "CHILD": new RegExp( "^:(only|first|last|nth|nth-last)-(child|of-type)(?:\\(" + whitespace + "*(even|odd|(([+-]|)(\\d*)n|)" + whitespace + "*(?:([+-]|)" + whitespace + "*(\\d+)|))" + whitespace + "*\\)|)", "i" ), "bool": new RegExp( "^(?:" + booleans + ")$", "i" ), // For use in libraries implementing .is() // We use this for POS matching in `select` "needsContext": new RegExp( "^" + whitespace + "*[>+~]|:(even|odd|eq|gt|lt|nth|first|last)(?:\\(" + whitespace + "*((?:-\\d)?\\d*)" + whitespace + "*\\)|)(?=[^-]|$)", "i" ) }, rinputs = /^(?:input|select|textarea|button)$/i, rheader = /^h\d$/i, rnative = /^[^{]+\{\s*\[native \w/, // Easily-parseable/retrievable ID or TAG or CLASS selectors rquickExpr = /^(?:#([\w-]+)|(\w+)|\.([\w-]+))$/, rsibling = /[+~]/, // CSS escapes // http://www.w3.org/TR/CSS21/syndata.html#escaped-characters runescape = new RegExp( "\\\\([\\da-f]{1,6}" + whitespace + "?|(" + whitespace + ")|.)", "ig" ), funescape = function( _, escaped, escapedWhitespace ) { var high = "0x" + escaped - 0x10000; // NaN means non-codepoint // Support: Firefox<24 // Workaround erroneous numeric interpretation of +"0x" return high !== high || escapedWhitespace ? escaped : high < 0 ? // BMP codepoint String.fromCharCode( high + 0x10000 ) : // Supplemental Plane codepoint (surrogate pair) String.fromCharCode( high >> 10 | 0xD800, high & 0x3FF | 0xDC00 ); }, // CSS string/identifier serialization // https://drafts.csswg.org/cssom/#common-serializing-idioms rcssescape = /([\0-\x1f\x7f]|^-?\d)|^-$|[^\x80-\uFFFF\w-]/g, fcssescape = function( ch, asCodePoint ) { if ( asCodePoint ) { // U+0000 NULL becomes U+FFFD REPLACEMENT CHARACTER if ( ch === "\0" ) { return "\uFFFD"; } // Control characters and (dependent upon position) numbers get escaped as code points return ch.slice( 0, -1 ) + "\\" + ch.charCodeAt( ch.length - 1 ).toString( 16 ) + " "; } // Other potentially-special ASCII characters get backslash-escaped return "\\" + ch; }, // Used for iframes // See setDocument() // Removing the function wrapper causes a "Permission Denied" // error in IE unloadHandler = function() { setDocument(); }, disabledAncestor = addCombinator( function( elem ) { return elem.disabled === true; }, { dir: "parentNode", next: "legend" } ); // Optimize for push.apply( _, NodeList ) try { push.apply( (arr = slice.call( preferredDoc.childNodes )), preferredDoc.childNodes ); // Support: Android<4.0 // Detect silently failing push.apply arr[ preferredDoc.childNodes.length ].nodeType; } catch ( e ) { push = { apply: arr.length ? // Leverage slice if possible function( target, els ) { push_native.apply( target, slice.call(els) ); } : // Support: IE<9 // Otherwise append directly function( target, els ) { var j = target.length, i = 0; // Can't trust NodeList.length while ( (target[j++] = els[i++]) ) {} target.length = j - 1; } }; } function Sizzle( selector, context, results, seed ) { var m, i, elem, nid, match, groups, newSelector, newContext = context && context.ownerDocument, // nodeType defaults to 9, since context defaults to document nodeType = context ? context.nodeType : 9; results = results || []; // Return early from calls with invalid selector or context if ( typeof selector !== "string" || !selector || nodeType !== 1 && nodeType !== 9 && nodeType !== 11 ) { return results; } // Try to shortcut find operations (as opposed to filters) in HTML documents if ( !seed ) { if ( ( context ? context.ownerDocument || context : preferredDoc ) !== document ) { setDocument( context ); } context = context || document; if ( documentIsHTML ) { // If the selector is sufficiently simple, try using a "get*By*" DOM method // (excepting DocumentFragment context, where the methods don't exist) if ( nodeType !== 11 && (match = rquickExpr.exec( selector )) ) { // ID selector if ( (m = match[1]) ) { // Document context if ( nodeType === 9 ) { if ( (elem = context.getElementById( m )) ) { // Support: IE, Opera, Webkit // TODO: identify versions // getElementById can match elements by name instead of ID if ( elem.id === m ) { results.push( elem ); return results; } } else { return results; } // Element context } else { // Support: IE, Opera, Webkit // TODO: identify versions // getElementById can match elements by name instead of ID if ( newContext && (elem = newContext.getElementById( m )) && contains( context, elem ) && elem.id === m ) { results.push( elem ); return results; } } // Type selector } else if ( match[2] ) { push.apply( results, context.getElementsByTagName( selector ) ); return results; // Class selector } else if ( (m = match[3]) && support.getElementsByClassName && context.getElementsByClassName ) { push.apply( results, context.getElementsByClassName( m ) ); return results; } } // Take advantage of querySelectorAll if ( support.qsa && !compilerCache[ selector + " " ] && (!rbuggyQSA || !rbuggyQSA.test( selector )) ) { if ( nodeType !== 1 ) { newContext = context; newSelector = selector; // qSA looks outside Element context, which is not what we want // Thanks to Andrew Dupont for this workaround technique // Support: IE <=8 // Exclude object elements } else if ( context.nodeName.toLowerCase() !== "object" ) { // Capture the context ID, setting it first if necessary if ( (nid = context.getAttribute( "id" )) ) { nid = nid.replace( rcssescape, fcssescape ); } else { context.setAttribute( "id", (nid = expando) ); } // Prefix every selector in the list groups = tokenize( selector ); i = groups.length; while ( i-- ) { groups[i] = "#" + nid + " " + toSelector( groups[i] ); } newSelector = groups.join( "," ); // Expand context for sibling selectors newContext = rsibling.test( selector ) && testContext( context.parentNode ) || context; } if ( newSelector ) { try { push.apply( results, newContext.querySelectorAll( newSelector ) ); return results; } catch ( qsaError ) { } finally { if ( nid === expando ) { context.removeAttribute( "id" ); } } } } } } // All others return select( selector.replace( rtrim, "$1" ), context, results, seed ); } /** * Create key-value caches of limited size * @returns {function(string, object)} Returns the Object data after storing it on itself with * property name the (space-suffixed) string and (if the cache is larger than Expr.cacheLength) * deleting the oldest entry */ function createCache() { var keys = []; function cache( key, value ) { // Use (key + " ") to avoid collision with native prototype properties (see Issue #157) if ( keys.push( key + " " ) > Expr.cacheLength ) { // Only keep the most recent entries delete cache[ keys.shift() ]; } return (cache[ key + " " ] = value); } return cache; } /** * Mark a function for special use by Sizzle * @param {Function} fn The function to mark */ function markFunction( fn ) { fn[ expando ] = true; return fn; } /** * Support testing using an element * @param {Function} fn Passed the created element and returns a boolean result */ function assert( fn ) { var el = document.createElement("fieldset"); try { return !!fn( el ); } catch (e) { return false; } finally { // Remove from its parent by default if ( el.parentNode ) { el.parentNode.removeChild( el ); } // release memory in IE el = null; } } /** * Adds the same handler for all of the specified attrs * @param {String} attrs Pipe-separated list of attributes * @param {Function} handler The method that will be applied */ function addHandle( attrs, handler ) { var arr = attrs.split("|"), i = arr.length; while ( i-- ) { Expr.attrHandle[ arr[i] ] = handler; } } /** * Checks document order of two siblings * @param {Element} a * @param {Element} b * @returns {Number} Returns less than 0 if a precedes b, greater than 0 if a follows b */ function siblingCheck( a, b ) { var cur = b && a, diff = cur && a.nodeType === 1 && b.nodeType === 1 && a.sourceIndex - b.sourceIndex; // Use IE sourceIndex if available on both nodes if ( diff ) { return diff; } // Check if b follows a if ( cur ) { while ( (cur = cur.nextSibling) ) { if ( cur === b ) { return -1; } } } return a ? 1 : -1; } /** * Returns a function to use in pseudos for input types * @param {String} type */ function createInputPseudo( type ) { return function( elem ) { var name = elem.nodeName.toLowerCase(); return name === "input" && elem.type === type; }; } /** * Returns a function to use in pseudos for buttons * @param {String} type */ function createButtonPseudo( type ) { return function( elem ) { var name = elem.nodeName.toLowerCase(); return (name === "input" || name === "button") && elem.type === type; }; } /** * Returns a function to use in pseudos for :enabled/:disabled * @param {Boolean} disabled true for :disabled; false for :enabled */ function createDisabledPseudo( disabled ) { // Known :disabled false positives: // IE: *[disabled]:not(button, input, select, textarea, optgroup, option, menuitem, fieldset) // not IE: fieldset[disabled] > legend:nth-of-type(n+2) :can-disable return function( elem ) { // Check form elements and option elements for explicit disabling return "label" in elem && elem.disabled === disabled || "form" in elem && elem.disabled === disabled || // Check non-disabled form elements for fieldset[disabled] ancestors "form" in elem && elem.disabled === false && ( // Support: IE6-11+ // Ancestry is covered for us elem.isDisabled === disabled || // Otherwise, assume any non-