Welcome to ipyrad

Documentation

The ipyrad ethos

Welcome to ipyrad, an interactive assembly and analysis toolkit for restriction-site associated DNA (RAD-seq) and related data types. Please explore the documentation to find out more about the features of ipyrad.

Our goals:

  • Simple: Easy to install, easy to use.
  • Resourceful: Documentation, tutorials, cookbooks, and help forums available.
  • Reproducible: Promoting the use of Jupyter Notebooks to organize workflows.
  • Flexible: API access to functions and data to build custom assemblies.
  • Transparent: Providing human readable code and data files.

Contact us:

  • Questions? Join the conversation on gitter.
  • Have a feature request? Raise a ticket on github.

Try it now

Try ipyrad now in the cloud

You can easily try ipyrad without even having to install it by connecting to an interactive jupyter notebook running in the cloud: binder link.

This service only provides access to a single computing core, so ipyrad will obviously run much faster on a larger workstation or server. But this example is sufficient to explore the ipyrad Python API for assembly and data analysis.

Installation

ipyrad can be installed using pip or conda. We strongly recommend the conda version. If you are not familiar with conda then please check out our long-form installation instructions below to start by installing the conda package manager.

Conda install

ipyrad is available for Python >=3.5.

conda install ipyrad -c conda-forge -c bioconda

Alternative: install from GitHub

You can alternatively install ipyrad from its source code on GitHub. This is not recommended unless you’re involved in development.

or alternatively (for version 0.9.56, for example):

Details: dependencies:

The following Python packages are installed as dependencies of ipyrad:

  • numpy
  • scipy
  • pandas
  • h5py
  • mpi4py
  • numba
  • ipyparallel
  • pysam
  • cutadapt
  • requests
  • muscle
  • samtools
  • bedtools
  • bwa
  • vsearch

Details: Long-form instructions

We put significant effort into making the installation process for ipyrad as easy as possible, whether you are working on your own desktop computer, or remotely on a large computing cluster. Simply copy and paste a few lines of code below and you will be ready to go.

The easiest way to install ipyrad and all of its dependencies is with conda, a command line program for installing Python packages. Follow these instructions to first install conda for Python 2 or 3 on your system (the code below is for Python3 since this is now recommended).

Conda comes in two flavors, anaconda and miniconda. The only difference between the two is that anaconda installs a large suite of commonly used Python packages along with the base installer, whereas miniconda installs only a bare bones version that includes just the framework for installing new packages. I recommend miniconda, and that’s what we’ll use here.

The code below includes a line that will download the conda installer. Make sure you follow either the Linux or Mac instructions, whichever is appropriate for your system. If you are working on an HPC cluster it is almost certainly Linux.

While conda is installing it will ask you to answer yes to a few questions. This includes whether it can append the newly created miniconda/ (or anaconda/) directory to your $PATH, say yes. What this does is add a line to your ~/.bashrc (or ~/.bash_profile on Mac) file so that the software in your conda directory can be automatically found by the systems whenever you login.

Mac install instructions for conda
# The curl command is used to download the installer from the web.
# Take note that the -O flag is a capital o not a zero.
curl -O https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh

# Install miniconda into $HOME/miniconda3
#  * Type 'yes' to agree to the license
#  * Press Enter to use the default install directory
#  * Type 'yes' to initialize the conda install
bash Miniconda3-latest-Linux-x86_64.sh

# Refresh your terminal session to see conda
bash

# test that conda is installed. Will print info about your conda install.
conda info
Linux install instructions for conda
# Fetch the miniconda installer with wget
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

# Install miniconda into $HOME/miniconda3
#  * Type 'yes' to agree to the license
#  * Press Enter to use the default install directory
#  * Type 'yes' to initialize the conda install
bash Miniconda3-latest-Linux-x86_64.sh

# Refresh your terminal session to see conda
bash

# test that conda is installed. Will print info about your conda install.
conda info
Details: ipyrad on HPC

If you’re working on an HPC cluster we still recommend that you follow the instructions above to install your own local miniconda directory that you can use to install local software into. However, you can alternatively ask your administrator to install ipyrad into a system-wide conda distribution (and a specific conda environment) which you and many other users can then use. The drawback of this approach is that if you want to upgrade or install additional software tools you need to ask your administrator and this will likely cause delays.

Getting Started: Files and Data Types

What kind of data can ipyrad assemble?

ipyrad can assemble any type of data that is generated using a restriction digest method (RAD, ddRAD, GBS) or related amplification-based process (e.g., NextRAD, RApture), all of which yield data that is (mostly) anchored on at least one side so that reads align fairly closely. ipyrad is not optimized for constructing long contigs from shotgun sequence data (i.e., genome assembly), but can construct reasonably sized contigs from partially merged paired-end reads, or partially overlapping reads. ipyrad is flexible to different data types and can combine reads of various lengths, so that data from different sequencing runs or projects can be easily combined.

Filtering/Trimming data

It is generally good practice to run the program fastqc on your raw data when you first get it to obtain an idea of the quality of your reads, and the presence of adapter contamination. You do not need to trim your reads before starting an assembly, since ipyrad includes a built-in and recommended trimming step during step 2 of assembly (using the software tool cutadapt). If you do choose to trim your data beforehand, however, it should not cause any problems.

Step 2 of the ipyrad assembly will apply different filters depending on your parameter settings to filter and trim data based on quality scores and/or the occurrence of barcode+adapter combinations. For paired-end data ipyrad will merge overlapping reads (using vsearch for denovo assembly or simply based on mapping positions for reference-mapped assembly).

Fastq Data Files and File Names

Depending how and where your sequence data were generated you may receive data as one giant file, or in many smaller files. The files may contain data from all of your individuals mixed together, or as separate files for each Sample. If they are mixed up then the data need to be demultiplexed based on barcodes or indices. Step 1 of ipyrad can take data of either format, and will either demultiplex the reads or simply count/load the pre-demultiplexed data. See the Demultiplexing section for details. lessq

Supported data types

There is increasingly a large variety of ways to generate reduced representation genomic data sets using either restriction digestion or primer sets, and ipyrad aims to be flexible enough to handle all of these types. Because it is difficult to keep up with all of the names, we use our own terminology, described below, to group together data types that can be analyzed using the same bioinformatic methods. If you have a data type that is not described below and you’re not sure if it can be analyzed in ipyrad let us know here.

rad – This category includes data types which use a single cutter to generate DNA fragments for sequencing based on a single cut site. e.g., RAD-seq, NextRAD.

ddrad – This category is very similar data types which select fragments that were digested by two different restriction enzymes which cut the fragment on either end. During assembly this type of data is analyzed differently from the rad data type by more stringent filtering that looks for occurrences of the second (usually more common) cutter. e.g., double-digest RAD-seq.

gbs – This category includes any data type which selects fragments that were digested by a single enzyme that cuts both ends of DNA fragments. This data type requires reverse-complement clustering because the forward vs reverse adapters can attach to either end of each fragment, and thus when shorter fragments are sequenced from either end the resulting reads often overlap partially or completely. When analyzing GBS data we strongly recommend using a stringent setting for the filters_adapters parameter. e.g., genotyping-by-sequencing (Elshire et al.), EZ-RAD (Toonin et al.).

pairddrad – This category is for paired-end data from fragments that were generated through restriction digestion using two different enzymes. During step 3 the paired-reads will be tested for paired read merging if they overlap partially. Because two different cutters are used reverse-complement clustering is not necessary. e.g., double-digest RAD-seq (w/ paired-end sequencing).

pairgbs – This category is for paired-end data from fragments that were generated by digestion with a single enzyme that cuts both ends of the fragment. Because the forward adapter might bind to either end of these fragments,approximately half of the matches are expected to be reverse-complemented with perfect overlap. Paired reads are checked for merging before clustering/mapping. e.g., genotyping-by-sequencing, EZ-RAD, (w/ paired-end sequencing).

2brad – This category is for a special class of sequenced fragments generated using a type IIb restriction enzyme. The reads are usually very short in length, and are treated slightly differently in steps 1, 3, and 6. Essentially it is treated like ‘gbs’ during steps 3 and 6 (reverse complement matching). (We are looking for people to do more testing of this method on empirical data).

pair3rad – This category is for 3Rad/RadCap data that uses combinatorial barcodes and unique identifiers for removing PCR duplicates. This data is always paired end, since one barcode is ligated to each read. PCR clones are removed in step 3, after merging but before dereplication. The pair3rad datatype is used for both 3Rad and RadCap types because these datatypes only differ in how they are generated, not how they are demultiplexed and filtered. See Glenn et al 2016, and Hoffberg et al 2016

Getting Started: Demultiplexing

Demultiplexing is the process of sorting sequenced reads into separate files for each sample in a sequenced run. You may have received your data already demultiplexed, with a separate file for each sample. If so, then you can proceed to the next section. If your data are not yet sorted into separate files then you will need to perform demultiplexing during step 1 of the ipyrad assembly.

Multiplexing and Multiple Libraries

If your data are not yet sorted among individuals/samples then you will need to have barcode/index information organized into a barcodes file to sort data to separate files for each sample. ipyrad has several options for demultiplexing by internal barcodes or external i7 indices, and for combining samples from many different sequencing runs together into a single analysis, or splitting them into separate analyses, as well as for merging data from multiple sequenced lanes into the same sample names (e.g., technical replicates). See the Demultiplexing section for simple examples, and the Cookbook section for further detailed examples.

If demultiplexing then Sample names will be extracted from the barcodes files, whereas if your data are already demultiplexed then Sample names are extracted from the file names directly. Do not include spaces in file names. For paired-end data we need to be able to identify which R1 and R2 files go together, and so we require that every read1 file name contains the string _R1_ (with underscores before and after), and every R2 file name must match exactly the R1 file except that it has _R2_ in place of _R1_. See the tutorial data files for an example.

Note

Pay careful attention to file names at the very beginning of an analysis since these names, and any included typos, will be perpetuated through all the resulting data files. Do not include spaces in file names.

Sample Names

When demultiplexing Sample names will be extracted from the barcodes files whereas if your data are already demultiplexed then Sample names are extracted from file names directly. Do not include spaces in file names. For paired-end data we need to be able to identify which R1 and R2 files go together, and so we require that every read1 file name contains the string _R1_ (with underscores before and after), and every R2 file name must match exactly the R1 file except that it has _R2_ in place of _R1_. See the example data for an example.

Note

Pay careful attention to file names at the very beginning of an analysis since these names, and any included typos, will be perpetuated through all the resulting data files. Do not include spaces in file names.

Barcodes file

The barcodes file is a simple table linking barcodes to samples. Barcodes can be of varying lengths. Each line should have one name and then one barcode, separated by whitespace (a tab or spaces).

sample1     ACAGG
sample2     ATTCA
sample3     CGGCATA
sample4     AAGAACA

Combinatorial indexing

To perform combinatorial indexing you will need to enter two barcodes for each sample name. These should be ordered so that the barcode on read1 is first and the barcode on read2 second. A simple way to ensure that barcodes are attached to your reads in the way that you expect is to look at the raw data files (e.g., use the command line tool less) and check for the barcode sequences.

sample1     ACAGG   TTCCG
sample2     ATTCA   CCGGAA
sample3     CGGCAT  GAGTCC
sample4     AAGAAC  CACCG

i7 indexing

External barcodes/indexes can also be attached external to the sequenced read on the Illumina adapters. This is often used to combine multiple plates together onto a single sequencing run. You can find the i7 index in the header line of each read in a fastq file. ipyrad can demultiplex using i7 indices if you turn on a special flag. An example of how to do this using the ipyrad API is available in the cookbook section.

lib1     CCGGAA
lib2     AATTCC

Combining multiple libraries

With ipyrad it is very easy to combine multiple sequenced libraries into a single assembly. This is accomplished by demultiplexing each lane of data separately and then combining the libraries using merging. See the merging section for details and examples in the cookbook section.

Assembly: Parameters

The parameters contained in a params file affect the actions that are performed during each step of an ipyrad assembly. The defaults that we chose are fairly reasonable values for most assemblies, however, you will always need to modify at least a few of them (for example, to indicate the location of your data), and often times you will want to modify many of the parameters. The ability to easily assemble your data set under a range of parameter settings is one of the main features of ipyrad.

Below is an explanation of each parameter setting, the steps of the assembly that it effects, and example entries for the parameter into a params.txt file.

Parameters (Params File)

The parameter input file, which typically includes params.txt in its name, can be created with the -n option from the ipyrad command line. This file lists all of the parameter settings necessary to complete an assembly. A description of how to create and use a params file can be found in the introductory tutorial.

# create a ipyrad params file from the command line
>>> ipyrad -n test
------- ipyrad params file (v.0.9.1-dev)----------------------------------------
test                           ## [0] [assembly_name]: Assembly name. Used to name output directories for assembly steps
/home/deren/ipyrad/tests       ## [1] [project_dir]: Project dir (made in curdir if not present)
                               ## [2] [raw_fastq_path]: Location of raw non-demultiplexed fastq files
                               ## [3] [barcodes_path]: Location of barcodes file
                               ## [4] [sorted_fastq_path]: Location of demultiplexed/sorted fastq files
denovo                         ## [5] [assembly_method]: Assembly method (denovo, reference, denovo+reference, denovo-reference)
                               ## [6] [reference_sequence]: Location of reference sequence file
rad                            ## [7] [datatype]: Datatype (see docs): rad, gbs, ddrad, etc.
TGCAG,                         ## [8] [restriction_overhang]: Restriction overhang (cut1,) or (cut1, cut2)
5                              ## [9] [max_low_qual_bases]: Max low quality base calls (Q<20) in a read
33                             ## [10] [phred_Qscore_offset]: phred Q score offset (33 is default and very standard)
6                              ## [11] [mindepth_statistical]: Min depth for statistical base calling
6                              ## [12] [mindepth_majrule]: Min depth for majority-rule base calling
10000                          ## [13] [maxdepth]: Max cluster depth within samples
0.85                           ## [14] [clust_threshold]: Clustering threshold for de novo assembly
0                              ## [15] [max_barcode_mismatch]: Max number of allowable mismatches in barcodes
0                              ## [16] [filter_adapters]: Filter for adapters/primers (1 or 2=stricter)
35                             ## [17] [filter_min_trim_len]: Min length of reads after adapter trim
2                              ## [18] [max_alleles_consens]: Max alleles per site in consensus sequences
0.05                           ## [19] [max_Ns_consens]: Max N's (uncalled bases) in consensus
0.05                           ## [20] [max_Hs_consens]: Max Hs (heterozygotes) in consensus
4                              ## [21] [min_samples_locus]: Min # samples per locus for output
0.2                            ## [22] [max_SNPs_locus]: Max # SNPs per locus
8                              ## [23] [max_Indels_locus]: Max # of indels per locus
0.5                            ## [24] [max_shared_Hs_locus]: Max # heterozygous sites per locus
0, 0, 0, 0                     ## [25] [trim_reads]: Trim raw read edges (R1>, <R1, R2>, <R2) (see docs)
0, 0, 0, 0                     ## [26] [trim_loci]: Trim locus edges (see docs) (R1>, <R1, R2>, <R2)
p, s, l                        ## [27] [output_formats]: Output formats (see docs)
                               ## [28] [pop_assign_file]: Path to population assignment file

0. Assembly name

The Assembly name is used as the prefix for all output files. It should be a unique identifier for the assembly, meaning the set of parameters you are using for the current data set. When I assemble multiple data with different parameter combinations I usually either name them consecutively (e.g., data1, data2), or with names indicating their parameter combinations (e.g., data_clust90, data_clust85). The Assembly name cannot be changed after an Assembly is created with the -n flag, but a new Assembly with a different name can be created by branching the Assembly (see branching workflow).

Affected steps: 1-7 Example: new Assemblies are created with the -n or -b options to ipyrad:

>>> ipyrad -n data1                       ## create a new assembly named data1
>>> ipyrad -p params-data1.txt -b data2   ## create a branch assembly named data2

1. Project dir

A project directory can be used to group together multiple related Assemblies. A new directory will be created at the given path if it does not already exist. A good name for Project_dir will generally be the name of the organism being studied. The project dir path should generally not be changed after an analysis is initiated, unless the entire directory is moved to a different location/machine.

Affected steps: 1-7 Example entries into params.txt:

/home/deren/ipyrad/tests/finches   ## [1] create/use project dir called finches
finches                            ## [1] create/use project dir called finches

2. Raw fastq path

This is a path to the location of raw (non-demultiplexed) fastq data files. If your data are already demultiplexed then this should be left blank. The input files can be gzip compressed (i.e., have name-endings with .gz). If you enter a path for raw data files then you should also enter a path to a barcodes file. To select multiple files, or all files in a directory, use a wildcard character (*).

Affected steps = 1 Example entries into params.txt:

/home/deren/ipyrad/tests/data/*.fastq.gz     ## [2] select all gzip data files
~/ipyrad/tests/data/*.fastq.gz               ## [2] select all gzip data files
./ipsimdata/rad_example*.fastq.gz            ## [2] select files w/ `rad_example` in name

3. Barcodes path

This is a path to the location of a barcodes file. This is used in step1 for demuliplexing, and can also be used in step2 to improve the detection of adapter/primer sequences that should be filtered out. If your data are already demultiplexed the barcodes path can be left blank.

Affected steps = 1-2. Example entries into params.txt:

/home/deren/ipsimdata/rad_example_barcodes.txt    ## [3] select barcode file
./ipsimdata/rad_example_barcodes.txt              ## [3] select barcode file

4. Sorted fastq path

This is a path to the location of sorted fastq data. If your data are already demultiplexed then this is the location from which data will be loaded when you run step 1. A wildcard character can be used to select multiple files in directory.

Affected steps = 1 Example entries into params.txt:

/home/deren/ipyrad/tests/ipsimdata/*.fastq.gz    ## [4] select all gzip data files
~/ipyrad/tests/ipsimdata/*.fastq                 ## [4] select all fastq data files
./ipsimdata/rad_example*.fastq.gz                ## [4] select files w/ `rad_example` in name

5. Assembly method

There are four Assembly_methods options in ipyrad: denovo, reference, denovo+reference, and denovo-reference. The latter three all require a reference sequence file (param #6) in fasta format. See the Assembly: Tutorials for an example.

Affected steps = 3, 6 Example entries into params.txt:

denovo                            ## [5] denovo assembly
reference                         ## [5] reference assembly
denovo+reference                  ## [5] reference addition assembly
denovo-reference                  ## [5] reference subtraction assembly

6. Reference sequence

The reference sequence file should be in fasta format. It does not need to be a complete nuclear genome, but could also be any other type of data that you wish to map RAD data to; for example plastome or transcriptome data.

~/ipyrad/tests/ipsimdata/rad_example_genome.fa   ## [6] select fasta file
./data/finch_full_genome.fasta                   ## [6] select fasta file

7. Datatype

There are now many forms of restriction-site associated DNA library preparation methods and thus many differently named data types. Currently, we categorize these into six data types. Follow the link to deteremine the appropriate category for your data type.

rad                       ## [7] rad data type (1 cutter, sonication)
pairddrad                 ## [7] paired ddrad type (2 different cutters)
pairgbs                   ## [7] paired gbs type (1 cutter cuts both ends)

8. Restriction_overhang

The restriction overhang is used during demultiplexing (step1) and also to detect and filter out adapters/primers (in step2), if the filter_adapters parameter is turned on. Identifying the correct sequence to enter for the restriction_overhang can be tricky. You do not enter the restriction recognition sequence, but rather the portion of this sequence that is left attached to the sequenced read after digestion. For example, the enzyme PstI has the following palindromic sequence, with ^ indicating the cut position.

5'...C TGCA^G...'3
3'...G^ACGT C...'5

Digestion with this enzyme results in DNA fragments with the sequence CTGCA adjacent to the cut site, which when sequenced results in the reverse complement TGCAG as the restriction overhang at the beginning of each read. The easiest way to identify the restriction overhang is simply to look at the raw (or demultiplexed) data files yourself. The restriction overhang will be the (mostly) invariant sequence that occurs at the very beginning of each read if the data are already demultiplexed, or right after the barcode in the case of non-demultplexed data. Use the command below to peek at the first few lines of your fastQ files to find the invariant sequence.

## gunzip decompresses the file,
## the flag -c means print to screen,
## and `less` tells it to only print the first 100 lines
zless 100 my_R1_input_file.fastq.gz

This will print something like the following. You can see that each of the lines of sequence data begins with TGCAG followed by variable sequence data. For data that used two-cutters (ddrad), you will likely not be able to see the second cutter overhang for single-end reads, but if your data are paired-end, then the _R2_ files will begin with the second restriction_overhang. The second restriction_overhang is only used to detect adapters/primers if the filter_adapters parameter is set > 1. The second restriction_overhang can optionally be left blank.

@HWI-ST609:152:D0RDLACXX:2:2202:18249:93964 1:N:0:
TGCAGCAGCAAGTGCTATTCGTACAGTCATCGATCAGGGTATGCAACGAGCAGAAGTCATGATAAAGGGTCCCGGTCTAGGAAGAGACGCAGCATTA
+
BDFFHHHHHJJJHIJJJJJIJIGJJJJJJJJJJJJJJJJDGGIJJJJJHIIIJJJJHIIHIGIGHHHHFFFFEDDDDDACCDCDDDDDDDDDBBDC:
@HWI-ST609:152:D0RDLACXX:2:2202:18428:93944 1:N:0:
TGCAGGATATATAAAGAATATACCAATCCTAAGGATCCATAGATTTAATTGTGGATCCAACAATAGAAACATCGGCTCAACCCTTTTAGTAAAAGAT
+
ADEFGHFHGIJJJJJJIJJIIJJJIJJIJGIJJJJJJJJIJJIJJJJIIIGGHIEGHJJJJJJG@@CG@@DDHHEFF>?A@;>CCCDC?CDDCCDDC
@HWI-ST609:152:D0RDLACXX:2:2202:18489:93970 1:N:0:
TGCAGCCATTATGTGGCATAGGGGTTACATCCCGTACAAAAGTTAATAGTATACCACTCCTACGAATAGCTCGTAATGCTGCGTCTCTTCCTAGACC
+
BDFFHHHHHJJJJIJJIJJJJJJJHIJJJJJJJJHIIJJJJIFHIJJJJFGIIJFHIJJIJJIFBHHFFDFEBACEDCDDDDBBBDCCCDDDCDDC:
@HWI-ST609:152:D0RDLACXX:2:2202:18714:93960 1:N:0:
TGCAGCATCTGGAAATTATGGGGTTATTTCACAGAAGCTGGAATCTCTTGGGCAATTTCACAGAATCTGGGAATATCTGGGGTAAATCTGCAAGATC
+
BDEFHHHHHJJJIJJJJJJJJJJCGIIJJGHJJJJJJJJJJIJJIJJJIHIJJJJJJJHHIIJJJJJJJHGHHHFEFFFDEDABDDFDDEDDDDDDA
@HWI-ST609:152:D0RDLACXX:2:2202:20484:93905 1:N:0:
TGCAGAGGGGGATTTTCTGGAGTTCTGAGCATGGACTCGTCCCGTTTGTGCTGCTCGAACACTGACGTTACTTCGTTGATCCCTATGGACTTGGTCA
+
ADEEHHHHGJJHIIJJJJJJIJFHJIIIJJJJIIIJJJJEHHIIGHIGGHGHHHHFFEEEDDDD;CBBDDDACC@DC?<CDDCCCCCCA@CCD>>@:
@HWI-ST609:152:D0RDLACXX:2:2202:20938:93852 1:N:0:
TGCAGAAGCTGGAGATTCTGGGGCAGCTTTGCAGCAAGCTGAAAATTCTGGGGGTCGATCTGCAAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAG

In some cases restriction enzymes can bind to more than one specific sequence, for example ApoI will bind to AATTY (i.e. AATTC and AATTT). If you used an enzyme with reduced specificity you can include ambiguity codes in the restriction overhang sequence.

Affected steps = 1,2. Example entries to params.txt file:

TGCAG                     ## [8] single cutter (e.g., rad, gbs)
TGCAG, AATT               ## [8] double digest (e.g., ddrad, pairddrad)
CWGC                      ## [8] single cutter w/ degenerate base

NB: 3RAD and SeqCap data can use up to 4 restriction enzymes. If you have this kind of data, simply list all the restriction overhangs for all your cutters.

CTAGA, CTAGC, AATTC               ## [8] 3rad data (multiple cutters)

9. max_low_qual_bases

During step 2 bases are trimmed from the 3’ end of reads when the quality score is consistently below 20 (which can be modified by modifying phred_Qscore_offset). However, your reads may still contain some number of ambiguous (N) sites that were not trimmed based on quality scores, and these will affect the efficiency and accuracy of clustering downstream. This parameter sets the upper limit on the number of Ns allowed in reads. The default value for max_low_qual_bases is 5. I would generally recommend against increasing this value greatly.

Affected steps = 2. Example entries to params.txt:

0                      ## [9] allow zero low quality bases in a read
5                      ## [9] allow up to five low quality bases in a read

10. Phred_Qscore_offset

Bases are trimmed from the 3’ end of reads if their quality scores is below this 20. The default offset for quality scores is 33. Some older data use a qscore offset of 64, but this is increasingly rare. You can toggle the offset number to change the threshold for trimming. For example, reducing the offset from 33 to 23 is equivalent to changing the minimum quality score from 20 to 10, which is approximately 95% probability of a correct base call.

Affected steps = 2. Example entries to params.txt:

33                 ## [10] default offset of 33, converts to min score=20
43                 ## [10] offset increased by 10, converts to min score=30
64                 ## [10] offset used by older data, converts to min score=20.

11. mindepth_statistical

This is the minimum depth at which statistical base calls will be made during step 5 consensus base calling. By default this is set to 6, which for most reasonable error rates estimates is approximately the minimum depth at which a heterozygous base call can be distinguished from a sequencing error.

Affected steps = 4, 5. Example entries to params.txt

6                 ## [11] set mindepth statistical to 6
10                ## [11] set to 10

12. mindepth_majrule

This is the minimum depth at which majority rule base calls are made during step 5 consensus base calling. By default this is set to the same value as mindepth_statistical, such that only statistical base calls are made. This value must be <= mindepth_statistical. If lower, then sites with coverage >= mindepth_majrule and < mindepth_statistical will make majority rule calls. If your data set is very low coverage such that many clusters are excluded due to low sequencing depth then lowering mindepth_majrule can be an effective way to increase the amount of usable information in your data set. However, you should be aware the majority rule consensus base calls will underestimate heterozygosity.

Affected steps = 4, 5. Example entries to params.txt:

6                 ## [12] set to relatively high value similar to mindepth_stat
2                 ## [12] set below the statistical limit for base calls.

13. maxdepth

Sequencing coverage is often highly uneven among due to differences in the rate at which fragments are amplified during library preparation, the extent to which varies across different library prep methods. Moreover, repetitive regions of the genome may appear highly similar and thus cluster as high depth clusters. Setting a maxdepth helps to remove the latter problem, but at the expense of potentially removing good clusters that simply were sequenced to high depth. The default maxdepth is set quite high (10,000), but you may change it as you see fit.

Affected steps = 4, 5. Example entries to params.txt:

10000             ## [13] maxdepth above which clusters are excluded.

14. clust_threshold

This the level of sequence similarity at which two sequences are identified as being homologous, and thus cluster together. The value should be entered as a decimal (e.g., 0.90). We do not recommend using values higher than 0.95, as homologous sequences may not cluster together at such high threshold due to the presence of Ns, indels, sequencing errors, or polymorphisms.

Affected steps = 3, 6. Example entries to params.txt:

0.90              ## [14] clust threshold set to 90%
0.85              ## [14] clust threshold set to 85%

15. max_barcodes_mismatch

The maximum number of allowed mismatches between the barcodes in the barcodes file and those found in the sequenced reads. Default is 0. Barcodes usually differ by a minimum of 2 bases, so I would not generally recommend using a value >2.

Affected steps = 1. Example entries to params.txt:

0              ## [15] allow no mismatches
1              ## [15] allow 1 mismatched base

16. filter_adapters

It is important to remove Illumina adapters from your data if present. Depending on the fidelity of the size selection procedure implemented during library preparation there is often at least some small proportion of sequences in which the read length is longer than the actual DNA fragment, such that the primer/adapter sequence ends up in the read. This occurs more commonly in double-digest (GBS, ddRAD) data sets that use a common cutter, and can be especially problematic for GBS data sets, in which short fragments are sequenced from either end. The filter_adapters parameter has three settings (0, 1, or 2). If 0, then reads are only removed if they contain more Ns than allowed by the max_low_qual_bases parameter. If 1, then reads are trimmed to the first base which has a Qscore < 20 (on either read for paired data), and also removed if there are too many Ns. If 2, then reads are searched for the common Illumina adapter, plus the reverse complement of the second cut site (if present), plus the barcode (if present), and this part of the read is trimmed. This filter is applied using code from the software cutadapt, which allows for errors within the adapter sequence.

Affected steps = 2. Example entries to params.txt:

0                ## [16] No adapter filtering
1                ## [16] filter based on quality scores
2                ## [16] strict filter for adapters

17. filter_min_trim_len

During step 2 if filter_adapters is > 0 reads may be trimmed to a shorter length if they are either low quality or contain Illumina adapter sequences. By default ipyrad will keep trimmed reads down to a minimum length of 35bp. If you want to set a higher limit you can do so here.

Affected steps = 2. Example entries to params.txt

50                ## [17] minimum trimmed seqlen of 50
75                ## [17] minimum trimmed seqlen of 75

18. max_alleles_consens:

This is the maximum number of unique alleles allowed in (individual) consens reads after accounting for sequencing errors. Default=2, which is fitting for diploids. At this setting any locus which has a sample with more than 2 alleles detected will be excluded/filtered out. If set to max_alleles_consens = 1 (haploid) then error-rate and heterozygosity are estimated with H fixed to 0.0 in step 4, and base calls are made with the estimated error rate, and any consensus reads with more than 1 allele present are excluded. If max_alleles_consens is set > 2 then more alleles are allowed, however, heterozygous base calls are still made under the assumption of diploidy i.e., hetero allele frequency=50%.

Affected steps = 4, 7. Example entries to params.txt

2                ## [18] diploid base calls, exclude if >2 alleles
1                ## [18] haploid base calls, exclude if >1 allele
4                ## [18] diploid-base calls, exclude if >4 alleles

19. max_Ns_consens:

The maximum fraction of uncalled bases allowed in consens seqs. If a base call cannot be made confidently (statistically) then it is called as ambiguous (N). You do not want to allow too many Ns in consensus reads or it will affect their ability to cluster with consensus reads from other Samples, and it may represent a poor alignment. Default is 0.05.

Affected steps = 5. Example entries to params.txt

0.1                ## [19] allow max of 10% Ns in a consensus seq

20. max_Hs_consens:

The maximum fraction of heterozygous bases allowed in consens seqs. This filter helps to remove poor alignments which will tend to have an excess of Hs. Default is 0.05.

Affected steps = 5. Example entries to params.txt

0.05                ## [20] allow max of 5% Hs in a consensus seq

21. min_samples_locus

The minimum number of Samples that must have data at a given locus for it to be retained in the final data set. If you enter a number equal to the full number of samples in your data set then it will return only loci that have data shared across all samples. Whereas if you enter a lower value, like 4, it will return a more sparse matrix, including any loci for which at least four samples contain data. This parameter is overridden if a min_samples values are entered in the popfile. Default value is 4.

Affected steps = 7. Example entries to params.txt

4                ## [21] create a min4 assembly
12               ## [21] create a min12 assembly

22. max_SNPs_locus

Maximum number of SNPs allowed in a final locus. This can remove potential effects of poor alignments in repetitive regions in a final data set by excluding loci with more than N snps. Setting lower values is likely only helpful for extra filtering of very messy data sets. The default is 0.2

Affected steps = 7. Example entries to params.txt

0.2               ## [22] allow max of 20% SNPs per locus.
0.05              ## [22] allow max of 5% SNPs locus.

23. max_Indels_locus

The maximum number of Indels allowed in a final locus. This helps to filter out poor final alignments, particularly for paired-end data. The default is 8.

Affected steps = 7. Example entries to params.txt

5                ## [23] allow max of 5 indels per locus.

24. max_shared_Hs_locus

Maximum number (or proportion) of shared polymorphic sites in a locus. This option is used to detect potential paralogs, as a shared heterozygous site across many samples likely represents clustering of paralogs with a fixed difference rather than a true heterozygous site. Default is 0.5.

Affected steps = 7. Example entries to params.txt

0.25             ## [24] allow hetero site to occur across max of 25% of Samples

25. trim_reads

Sometimes you can look at your fastq data files and see that there was a problem with the sequencing such that the cut site which should occur at the beginning of your reads is either offset by one or more bases, or contains many errors. You can trim off N bases from the beginning or end of R1 and R2 reads during step 2 by setting the number of bases here. This could similarly be used to trim all reads to a uniform length (though uniform read lengths are not required in ipyrad).

Affected steps = 2. Example entries to params.txt

0, 0, 0, 0       ## [25] does nothing
5, 0, 0, 0       ## [25] trims first 5 bases from R1s
5, -5, 0, 0      ## [25] trims first 5 bases and last five based from R1s
5, 80, 0, 0      ## [25] trims first 5 bases from R1s and trims maxlen to 80
5, 75, 5, 75     ## [25] trims first 5 from R1 and R1, and maxlen to 75.

26. trim_loci

Trim N bases from the edges of final aligned loci. This can be useful in denovo data sets in particular, where the 3’ edge of reads is less well aligned than the 5’ edge, and thus error rates are sometimes higher at the ends of reads.

Affected steps = 7. Example entries to params.txt

0, 0, 0, 0     ## [26] no locus edge trimming
5, 0, 0, 0     ## [26] trims first 5 bases from R1s in aligned locus
0, 5, 5, 0     ## [26] trims last 5 bases from R1s and first 5 from R2s

27. output_formats

Disk space is cheap, and these are quick to make, so by default we make all formats. More are coming (alleles, treemix, migrate-n, finestructure). The short list of available options is below but see output formats section for full descriptions of the available formats.

p: PHYLIP (Full dataset)
s: PHYLIP (SNPs only)
u: PHYLIP (One SNP per locus)
n: NEXUS
k: STRUCTURE
g: EIGENSTRAT .geno
G: G-PhoCS
v: VCF (SNPs only)

Affected steps = 7. Example entries to params.txt

*                     ## [27] Make all output datatypes
n, v, g               ## [27] Only write out nexus, vcf and geno formats
u,k                   ## [27] Only write out unlinked snps in phylip, and structure

28. pop_assign_file

Population assignment file for creating population output files, or assigning min_samples_locus value to each population. Enter a path to the file. (see below for details of the file).

Affected step: 7. Example entries to params.txt

/home/user/ipyrad/popfile.txt        ## [28] example...

The population assignment file should be formatted as a plain-txt, whitespace delimited list of individuals and population assignments. Care should be taken with spelling and capitalization. Each line should contain a sample name followed by a population name to which that sample is assigned. One or more additional lines should be included that start with one or more “#” characters. These special lines tell ipyrad how many samples must have data within each population for the locus to be retained in the final assembly, and thus assign different min_samples_locus values to each population. This will override the global min_samples_locus value.

See the example below.

Sample1 pop1
Sample2 pop1
Sample3 pop1
Sample4 pop2
Sample5 pop2

# pop1:2 pop2:2

Assembly: Seven Steps

The goal of the assembly process is to convet raw or sorted fastq data into assembled loci that can be formatted for downstream analyses in phylogenetic or population genetic inference software. In ipyrad we have purposefully atomized this process into seven sequential steps to create a modular workflow that can be easily restarted if interrupted, and can be branched at different points to create assemblies under different combinations of parameter settings.

Basic Assembly Workflow

The simplest use of ipyrad is to assemble a data set under a single set of parameters defined in a params file. Step 1 loads/assigns data to each sample; steps 2-5 process data for each sample; step 6 identifies orthologs across samples; and step 7 filters the orthologs and writes formatted files for downstream analyses.

_images/steps.png

The code to run a basic workflow is quite simple:

# create an initial Assembly params file
>>> ipyrad -n data1

# enter values into the params file using a text editor
## ... editing params-data1.txt

# select a params file (-p) and steps to run (-s) for this assembly
>>> ipyrad -p params-data1.txt -s 1234567

Advanced Branching workflow

A more effective way to use ipyrad can be to create branching assemblies in which multiple data sets are assembled under different parameter settings. The schematic below shows an example where an assembly is branched at step3. The new branch will inherit file paths and statistics from the first Assembly, but can then apply different parameters going forward. Branching does not create hard copies of existing data files, and so is not an “expensive” action in terms of disk space or time. We suggest it be used quite liberally whenever applying a new set of parameters.

_images/steps_branching.png

The code to run a branching workflow is only a bit more complex than the basic workflow. You can find more branching examples in the advanced tutorial and cookbook sections.

## create an initial Assembly and params file, here called 'data1'
>>> ipyrad -n data1

## edit the params file for data1 with your text editor
## ... editing params-data1.txt

## run steps 1-2 with the params file
>>> ipyrad -p params-data1.txt -s 12

## create a new branch of 'data1' before step3, here called 'data2'.
>>> ipyrad -p params-data1.txt -b data2

## edit the params file for data2 using a text editor
## ... editing params-data2.txt

## run steps 3-7 for both assemblies
>>> ipyrad -p params-data1.txt -s 34567
>>> ipyrad -p params-data2.txt -s 34567

Seven Steps

1. Demultiplexing / Loading fastq files

Step 1 loads sequence data into a named Assembly and assigns reads to Samples (individuals). If the data are not yet demultiplexed then step 1 uses information from a barcodes file to demultiplex the data, otherwise, it simply reads the data for each Sample.

The following parameters are potentially used or required (*) for step1:

2. Filtering / Editing reads

Step 2 uses the quality score recorded in the fastQ data files to filter low quality base calls. Sites with a score below a set value are changed into “N”s, and reads with more than the number of allowed “N”s are discarded. The threshold for inclusion is set with the phred_Qscore_offset parameter. An optional filter can be applied to remove adapters/primers (see filter_adapters), and there is an optional filter to clean up the edges of poor quality reads (see edit_cutsites).

The following parameters are potentially used or required (*) for step2:

3. Clustering / Mapping reads within Samples and alignment

Step 3 first dereplicates the sequences from step 2, recording the number of times each unique read is observed. If the data are paired-end, it then uses vsearch to merge paired reads which overlap. The resulting data are then either de novo clustered (using vsearch) or mapped to a reference genome (using bwa and bedtools), depending on the selected assembly method. In either case, reads are matched together on the basis of sequence similarity and the resulting clusters are aligned using muscle.

The following parameters are potentially used or required (*) for step3:

4. Joint estimation of heterozygosity and error rate

Step4 jointly estimates sequencing error rate and heterozygosity based on counts of site patterns across clustered reads. These estimates are used in step5 for consensus base calling. If the max_alleles_consens is set to 1 (haploid) then heterozygosity is fixed to 0 and only error rate is estimated. For all other settings of max_alleles_consens a diploid model is used (i.e., two alleles are expected to occur equally).

The following parameters are potentially used or required (*) for step4:

5. Consensus base calling and filtering

Step5 estimates consensus allele sequences from clustered reads given the estimated parameters from step 4 and a binomial model. During this step we filter for maximum number of undetermined sites (Ns) per locus (max_Ns_consens). The number of alleles at each locus is recorded, but a filter for max_alleles is not applied until step7. Read depth information is also stored at this step for the VCF output in step7.

The following parameters are potentially used or required (*) for step5:

6. Clustering / Mapping reads among Samples and alignment

Step6 clusters consensus sequences across Samples using the same assembly method as in step 3. One allele is randomly sampled before clustering so that ambiguous characters have a lesser effect on clustering, but the resulting data retain information for heterozygotes. The clustered sequences are then aligned using muscle.

The following parameters are potentially used or required (*) for step6:

7. Filtering and formatting output files

Step7 applies filters to the final alignments and saves the final data in a number of possible output formats. This step is most often repeated at several different settings for the parameter 21. min_samples_locus to create different assemblies with different proportions of missing data (see Assembly: Branching and Merging).

The following parameters are potentially used or required (*) for step7:

Example CLI branching workflow

## create a params.txt file and rename it data1, and then use a text editor
## to edit the parameter settings in data1-params.txt
ipyrad -n data1

## run steps 1-2 using the default settings
ipyrad -p params-data1.txt -s 12

## branch to create a 'copy' of this assembly named data2
ipyrad -p params-data1.txt -b data2

## edit data2-params.txt to a different parameter settings in a text editor,
## for example, change the clustering threshold from 0.85 to 0.90

## now run the remaining steps (3-7) on each data set
ipyrad -p params-data1.txt -s 34567
ipyrad -p params-data2.txt -s 34567

Example Python API branching workflow

## import ipyrad
import ipyrad as ip

## create an Assembly and modify some parameter settings
data1 = ip.Assembly("data1")
data1.set_params("project_dir", "example")
data1.set_params("raw_fastq_path", "data/*.fastq")
data1.set_params("barcodes_path", "barcodes.txt")

## run steps 1-2
data1.run("12")

## create a new branch of this Assembly named data2
## and change some parameter settings
data2 = data1.branch("data2")
data2.set_params("clust_threshold", 0.90)

## run steps 3-7 for the two Assemblies
data1.run("34567")
data2.run("34567")

Assembly: Branching and Merging

Branching can be used to create a new assembly with a different name and to which you can apply new parameters and create new downstream files …

Merging can be used to combine samples from multiple libraries.

Drop samples by branching

Another point where branching is useful is for adding or dropping samples from an assembly, either to analyze a subset of samples separately from others, or to exclude samples with low coverage. The branching and merging fuctions in ipyrad make this easy. By requiring a branching process in order to drop samples from an assembly ipyrad inherently forces you to retain the parent assembly as a copy. This provides a nice fail safe so that you can mess around with your new branched assembly without affecting it’s pre-branched parent assembly.

Examples using the ipyrad CLI

## branch and only keep 3 samples from assembly data1
>>> ipyrad -p params-data1 -b data2 1A0 1B0 1C0

## and/or, branch and only exclude 3 samples from assembly data1
>>> ipyrad -p params-data1 -b data3 - 1A0 1B0 1C0

Examples using the ipyrad Python API

## branch and only keep 3 samples from assembly data1
>>> data1.branch("data2", subsamples=["1A0", "1B0", "1C0"])

## and/or, branch and only exclude 3 samples from assembly data1
>>> keep_list = [i for i in data1.samples.keys() if i not in ["1A0", "1B0", "1C0"]]
>>> data1.branch("data3", subsamples=keep_list)

Merge Samples or Libraries

There are a number of ways to enter your data into ipyrad and we’ve tried to make it as easy as possible to combine data from multiple libraries and multiple plates in a simple and straightforward way. Here we demonstrate a number of ways to demultiplex and load data under different scenarios:

  1. One Library One Lane of sequencing
  2. One Library Multiple lanes of sequencing
  3. Multiple libraries Multiple lanes of sequencing
  4. Separate multiple libraries from one lane of sequencing
  5. Alternative: Doing all of this with the API instead of the CLI

1. One library One Lane of sequencing

First create a new Assembly, here we’ll call it demux1. Then use a text-editor to edit the params file to enter the raw_fastq_path and the barcodes_path information that is needed to demultiplex the data. To automate the process of editing the params file I use the command-line program sed here to substitute in the new values.

## create a new assembly
ipyrad -n demux1
New file 'params-demux1.txt' created in /home/deren/Documents/ipyrad/tests
## edit the params file to enter your raw_fastq_path and barcodes path
sed -i '/\[2] /c\ipsimdata/rad_example_R1_.fastq.gz  ## [2] ' params-demux1.txt
sed -i '/\[3] /c\ipsimdata/rad_example_barcodes.txt  ## [3] ' params-demux1.txt
## run step 1 to demultiplex the data
ipyrad -p params-demux1.txt -s 1
-------------------------------------------------------------
 ipyrad [v.0.5.15]
 Interactive assembly and analysis of RAD-seq data
-------------------------------------------------------------
 loading Assembly: demux1
 from saved path: ~/Documents/ipyrad/tests/demux1.json
 New Assembly: demux1
 host compute node: [40 cores] on tinus

 Step 1: Demultiplexing fastq data to Samples

 [####################] 100%  sorting reads         | 0:00:06
 [####################] 100%  writing/compressing   | 0:00:00

The demultiplexed data is now located in the directory <project_dir>/<assembly_name>/, which in this case is in ./demux1_fastqs/. The Assembly demux1 knows the location of the data, and so from here you can proceed in either of two ways. (1) You simply continue on to step 2 using this Assembly object (demux1), or (2) You create a new ‘branch’ of this Assembly, which will start by reading in the sorted_fastq_data. The latter is sometimes more clear in that you keep separate the demultiplexing steps from the assembly steps. It does not make a difference in this example, where we have only one library and one lane of data, but as you will see in the examples below, that it is sometimes easier to create multiple separate demux libraries that are then merged into a single Object for assembling.

## option 1: continue to assemble this data set
ipyrad -p params-demux1 -s 234567
## OR, option 2: create a new Assembly and enter path to the demux data
ipyrad -n New

## enter path to the 'sorted_fastq_data' in params
sed -i '/\[4] /c\./demux1_fastq/*.gz  ## [2] ' params-New.txt

## assemble this data set
ipyrad -p params-New.txt -s 1234567

2. One Library Multiple Lanes of Sequencing

There are two options for how to join multiple lanes of sequence data that are from the same library (i.e., there is only one barcodes file). (1) The simplest way is to simply put the multiple raw fastq data files into the same directory and select them all when entering the raw_fastq_path using a wildcard selector (e.g., “*.fastq.gz”). (2) The second way is to create two separate demux Assemblies and the merge them, which I demonstrate below. Because the two demultiplexed lanes each use the same barcodes file the Samples will have identical names. ipyrad will recognize this during merging and read both input files for each Sample in step 2.

## create demux Assembly object for lane 1
ipyrad -n lane1raws
New file 'params-lane1raws.txt' created in /home/deren/Documents/ipyrad/tests
## create demux Assembly object for lane 2
ipyrad -n lane2raws
New file 'params-lane2raws.txt' created in /home/deren/Documents/ipyrad/tests
## edit the params file for lane1 to enter its raw_fastq_path and barcodes file
sed -i '/\[2] /c\ipsimdata/rad_example_R1_.fastq.gz  ## [2] ' params-lane1raws.txt
sed -i '/\[3] /c\ipsimdata/rad_example_barcodes.txt  ## [3] ' params-lane1raws.txt

## edit the params file for lane2 to enter its raw_fastq_path and barcodes file
sed -i '/\[2] /c\ipsimdata/rad_example_R1_.fastq.gz  ## [2] ' params-lane2raws.txt
sed -i '/\[3] /c\ipsimdata/rad_example_barcodes.txt  ## [3] ' params-lane2raws.txt
## demultiplex lane1
ipyrad -p params-lane1raws.txt -s 1
-------------------------------------------------------------
 ipyrad [v.0.5.15]
 Interactive assembly and analysis of RAD-seq data
-------------------------------------------------------------
 New Assembly: lane1raws
 host compute node: [40 cores] on tinus

 Step 1: Demultiplexing fastq data to Samples

 [####################] 100%  sorting reads         | 0:00:06
 [####################] 100%  writing/compressing   | 0:00:01
## demultiplex lane2
ipyrad -p params-lane2raws.txt -s 1
-------------------------------------------------------------
 ipyrad [v.0.5.15]
 Interactive assembly and analysis of RAD-seq data
-------------------------------------------------------------
 New Assembly: lane2raws
 host compute node: [40 cores] on tinus

 Step 1: Demultiplexing fastq data to Samples

 [####################] 100%  sorting reads         | 0:00:06
 [####################] 100%  writing/compressing   | 0:00:00
## merge the two lanes into one Assembly named both
ipyrad -m both params-lane1raws.txt params-lane2raws.txt
-------------------------------------------------------------
 ipyrad [v.0.5.15]
 Interactive assembly and analysis of RAD-seq data
-------------------------------------------------------------

 Merging assemblies: ['params-lane1raws.txt', 'params-lane2raws.txt']
 loading Assembly: lane1raws
 from saved path: ~/Documents/ipyrad/tests/lane1raws.json
 loading Assembly: lane2raws
 from saved path: ~/Documents/ipyrad/tests/lane2raws.json

 Merging succeeded. New params file for merged assembly:

   params-both.txt
## print merged stats of new Assembly
ipyrad -p params-both.txt -r
Summary stats of Assembly both
------------------------------------------------
      state  reads_raw
1A_0      1      39724
1B_0      1      40086
1C_0      1      40272
1D_0      1      39932
2E_0      1      40034
2F_0      1      39866
2G_0      1      40060
2H_0      1      40398
3I_0      1      39770
3J_0      1      39644
3K_0      1      39930
3L_0      1      40016


Full stats files
------------------------------------------------
step 1: ./lane1raws_fastqs/s1_demultiplex_stats.txt
step 2: None
step 3: None
step 4: None
step 5: None
step 6: None
step 7: None
## run remaining steps on the merged assembly
ipyrad -p params-both.txt -s 234567

3. Multiple Libraries Multiple Lanes of Sequencing

The recommended way to combine multiple lanes of data is the same as we just demonstrated above, however, in this case because the Samples in each Object come from a different library, they will have different names. Imagine that each lane of sequencing contains a library with 48 Samples in it. In the example above (One library multiple lanes) the Samples would be combined so that you have 48 Samples, and each Sample has data from two fastq files. Alternatively, the merging in this example would combine the two libraries that contain different Samples into a single data set with 96 Samples, where each Sample has one lane of data.

4. Separate Multiple Libraries from One Lane of Sequencing

## create new Assembly named lib1
ipyrad -n lib1

## enter raw_fastq_path and barcodes_path into params
sed -i '/\[2] /c\ipsimdata/rad_example_R1_.fastq.gz  ## [2] ' params-lib1.txt
sed -i '/\[3] /c\ipsimdata/rad_example_barcodes.txt  ## [3] ' params-lib1.txt

## demultiplex the lane of data
ipyrad -p params-lib1.txt -s 1

## create a new branch with only the Samples for project 1
ipyrad -p params-lib1.txt -b project1 1A_0 1B_0 1C_0 1D_0

## create a another branch with only the Samples for project 2
ipyrad -p params-lib1.txt -b project2 2E_0 2F_0 2G_0 2H_0
## assemble project 1
ipyrad -p params-project1 -s 234567
## assemble project 2
ipyrad -p params-project2 -s 234567

5. Alternative: Using the ipyrad API to do these things

Using the ipyrad API is an alternative to using the command-line-interface (CLI) above. As you can see below, writing code with the Python API can be much simpler and more elegant. We recommend using the API inside a Jupyter-notebook.

## import ipyrad
import ipyrad as ip
## one lane one library
data1 = ip.Assembly("data1")
data1.set_params("raw_fastq_path", "ipsimdata/rad_example_R1_.fastq.gz")
data1.set_params("barcodes_path", "ipsimdata/rad_example_barcodes.txt")
data.run("123467")
## one library multiple lanes
lib1lane1 = ip.Assembly("lib1lane1")
lib1lane1.set_params("raw_fastq_path", "ipsimdata/rad_example_R1_.fastq.gz")
lib1lane1.set_params("barcodes_path", "ipsimdata/rad_example_barcodes.txt")
lib1lane1.run("1")

lib1lane2 = ip.Assembly("lib1lane2")
lib1lane2.set_params("raw_fastq_path", "ipsimdata/rad_example_R1_.fastq.gz")
lib1lane2.set_params("barcodes_path", "ipsimdata/rad_example_barcodes.txt")
lib1lane2.run("1")

merged = ip.merge("lib1-2lanes", [lib1lane1, lib1lane2])
merged.run("234567")
## multiple libraries multiple lanes
lib1lane1 = ip.Assembly("lib1lane1")
lib1lane1.set_params("raw_fastq_path", "ipsimdata/lib1_lane1_R1_.fastq.gz")
lib1lane1.set_params("barcodes_path", "ipsimdata/lib1_barcodes.txt")
lib1lane1.run("1")

lib1lane2 = ip.Assembly("lib1lane2")
lib1lane2.set_params("raw_fastq_path", "ipsimdata/lib1_lane2.fastq.gz")
lib1lane2.set_params("barcodes_path", "ipsimdata/lib1_barcodes.txt")
lib1lane2.run("1")

lib2lane1 = ip.Assembly("lib1lane1")
lib2lane1.set_params("raw_fastq_path", "ipsimdata/lib2_lane1.fastq.gz")
lib2lane1.set_params("barcodes_path", "ipsimdata/lib2_barcodes.txt")
lib2lane1.run("1")

lib2lane2 = ip.Assembly("lib1lane2")
lib2lane2.set_params("raw_fastq_path", "ipsimdata/lib2_lane2_.fastq.gz")
lib2lane2.set_params("barcodes_path", "ipsimdata/lib2_barcodes.txt")
lib2lane2.run("1")

fulldata = ip.merge("fulldata", [lib1lane1, lib1lane2, lib2lane1, lib2lane2])
fulldata.run("234567")
## splitting a library into different project
project1 = ["sample1", "sample2", "sample3"]
project2 = ["sample4", "sample5", "sample6"]

proj1 = fulldata.branch("proj1", subsamples=project1)
proj2 = fulldata.branch("proj2", subsamples=project2)

proj1.run("234567", force=True)
proj2.run("234567", force=True)
## print stats of project 1
print proj1.stats

For advanced examples see the CookBook section.

Assembly: Tutorials

The ipyrad command line interface (CLI) can be accessed through a terminal. Use the -help (-h) flag to see a description of the main arguments to the CLI. Detailed instructions are available through the tutorials below.

>>> ipyrad -h

Introductory tutorials

Start here to learn the basics. We run through an example simulated single-end RAD-seq data set and give detailed descriptions of files and statistics produced by each step of an assembly. Next, try some advanced methods, like using branching to assemble data sets under a range of parameter settings, and assemble data with respect to a reference genome.

Empirical examples

The following tutorials demonstrate assemblies of publicly available empirical data sets representing different data types. The first analysis (Eaton and Ree, 2013) can be assembled very quickly, and is re-used in our analysis cookbook recipes, below. The others include tips for optimizing ipyrad for use with that data type.

API: ipyrad assembly workflow

Eaton & Ree (2013) single-end RAD data set

Here we demonstrate a denovo assembly for an empirical RAD data set using the ipyrad Python API. This example was run on a workstation with 20 cores and takes about 10 minutes to assemble, but you should be able to run it on a 4-core laptop in ~30-60 minutes.

For our example data we will use the 13 taxa Pedicularis data set from Eaton and Ree (2013) (Open Access). This data set is composed of single-end 75bp reads from a RAD-seq library prepared with the PstI enzyme. The data set also serves as an example for several of our analysis cookbooks that demonstrate methods for analyzing RAD-seq data files. At the end of this notebook there are also several examples of how to use the ipyrad analysis tools to run downstream analyses in parallel.

The figure below shows the ingroup taxa from this study and their sampling locations. The study includes all species within a small monophyletic clade of Pedicularis, including multiple individuals from 5 species and several subspecies, as well as an outgroup species. The sampling essentially spans from population-level variation where species boundaries are unclear, to higher-level divergence where species boundaries are quite distinct. This is a common scale at which RAD-seq data are often very useful.
1385ffea13e0439eb8cde6072a591a1a
Setup (software and data files)

If you haven’t done so yet, start by installing ipyrad using conda (see ipyrad installation instructions) as well as the packages in the cell below. This is easiest to do in a terminal. Then open a jupyter-notebook, like this one, and follow along with the tuturial by copying and executing the code in the cells, and adding your own documentation between them using markdown. Feel free to modify parameters to see their effects on the downstream results.

[4]:
## conda install ipyrad -c ipyrad
## conda install toytree -c eaton-lab
## conda install sra-tools -c bioconda
[5]:
## imports
import ipyrad as ip
import ipyrad.analysis as ipa
import ipyparallel as ipp

In contrast to the ipyrad CLI, the ipyrad API gives users much more fine-scale control over the parallelization of their analysis, but this also requires learning a little bit about the library that we use to do this, called ipyparallel. This library is designed for use with jupyter-notebooks to allow massive-scale multi-processing while working interactively.

Understanding the nuts and bolts of it might take a little while, but it is fairly easy to get started using it, especially in the way it is integrated with ipyrad. To start a parallel client to you must run the command-line program ‘ipcluster’. This will essentially start a number of independent Python processes (kernels) which we can then send bits of work to do. The cluster can be stopped and restarted independently of this notebook, which is convenient for working on a cluster where connecting to many cores is not always immediately available.

[6]:
# Open a terminal and type the following command to start
# an ipcluster instance with 40 engines:
# ipcluster start -n 40 --cluster-id="ipyrad" --daemonize

# After the cluster is running you can attach to it with ipyparallel
ipyclient = ipp.Client(cluster_id="ipyrad")
Download the data set (Pedicularis)

These data are archived on the NCBI sequence read archive (SRA) under accession id SRP021469. As part of the ipyrad analysis tools we have a wrapper around the SRAtools software that can be used to query NCBI and download sequence data based on accession IDs. Run the code below to download the fastq data files associated with this study. The data will be saved the specified directory which will be created if it does not already exist. The compressed file size of the data is a little over 1GB. If you pass your ipyclient to the .run() command below then the download will be parallelized.

[ ]:
## download the Pedicularis data set from NCBI
sra = ipa.sratools(accessions="SRP021469", workdir="fastqs-Ped")
sra.run(force=True, ipyclient=ipyclient)
[####################] 100%  Downloading fastq files | 0:01:19 |
13 fastq files downloaded to /home/deren/Documents/ipyrad/tests/fastqs-Ped
Create an Assembly object

This object stores the parameters of the assembly and the organization of data files.

[6]:
## you must provide a name for the Assembly
data = ip.Assembly("pedicularis")
New Assembly: pedicularis

Set parameters for the Assembly. This will raise an error if any of the parameters are not allowed because they are the wrong type, or out of the allowed range.

[10]:
## set parameters
data.set_params("project_dir", "analysis-ipyrad")
data.set_params("sorted_fastq_path", "fastqs-Ped/*.fastq.gz")
data.set_params("clust_threshold", "0.90")
data.set_params("filter_adapters", "2")
data.set_params("max_Hs_consens", (5, 5))
data.set_params("trim_loci", (0, 5, 0, 0))
data.set_params("output_formats", "psvnkua")

## see/print all parameters
data.get_params()
0   assembly_name               pedicularis
1   project_dir                 ./analysis-ipyrad
2   raw_fastq_path
3   barcodes_path
4   sorted_fastq_path           ./example_empirical_rad/*.fastq.gz
5   assembly_method             denovo
6   reference_sequence
7   datatype                    rad
8   restriction_overhang        ('TGCAG', '')
9   max_low_qual_bases          5
10  phred_Qscore_offset         33
11  mindepth_statistical        6
12  mindepth_majrule            6
13  maxdepth                    10000
14  clust_threshold             0.9
15  max_barcode_mismatch        0
16  filter_adapters             2
17  filter_min_trim_len         35
18  max_alleles_consens         2
19  max_Ns_consens              (5, 5)
20  max_Hs_consens              (5, 5)
21  min_samples_locus           4
22  max_SNPs_locus              (20, 20)
23  max_Indels_locus            (8, 8)
24  max_shared_Hs_locus         0.5
25  trim_reads                  (0, 0, 0, 0)
26  trim_loci                   (0, 5, 0, 0)
27  output_formats              ('p', 's', 'v', 'n', 'k', 'u')
28  pop_assign_file
Assemble the data set
[24]:
## run steps 1 & 2 of the assembly
data.run("12")
Assembly: pedicularis
[####################] 100%  loading reads         | 0:00:04 | s1 |
[####################] 100%  processing reads      | 0:01:14 | s2 |
[25]:
## access the stats of the assembly (so far) from the .stats attribute
data.stats
[25]:
state reads_raw reads_passed_filter
29154_superba 2 696994 689996
30556_thamno 2 1452316 1440314
30686_cyathophylla 2 1253109 1206947
32082_przewalskii 2 964244 955480
33413_thamno 2 636625 626084
33588_przewalskii 2 1002923 993873
35236_rex 2 1803858 1787366
35855_rex 2 1409843 1397068
38362_rex 2 1391175 1379626
39618_rex 2 822263 813990
40578_rex 2 1707942 1695523
41478_cyathophylloides 2 2199740 2185364
41954_cyathophylloides 2 2199613 2176210
[26]:
## run steps 3-6 of the assembly
data.run("3456")
Assembly: pedicularis
[####################] 100%  dereplicating         | 0:00:07 | s3 |
[####################] 100%  clustering            | 0:05:08 | s3 |
[####################] 100%  building clusters     | 0:00:26 | s3 |
[####################] 100%  chunking              | 0:00:05 | s3 |
[####################] 100%  aligning              | 0:03:06 | s3 |
[####################] 100%  concatenating         | 0:00:15 | s3 |
[####################] 100%  inferring [H, E]      | 0:01:10 | s4 |
[####################] 100%  calculating depths    | 0:00:06 | s5 |
[####################] 100%  chunking clusters     | 0:00:06 | s5 |
[####################] 100%  consens calling       | 0:01:56 | s5 |
[####################] 100%  concat/shuffle input  | 0:00:05 | s6 |
[####################] 100%  clustering across     | 0:03:47 | s6 |
[####################] 100%  building clusters     | 0:00:06 | s6 |
[####################] 100%  aligning clusters     | 0:00:29 | s6 |
[####################] 100%  database indels       | 0:00:14 | s6 |
[####################] 100%  indexing clusters     | 0:00:03 | s6 |
[####################] 100%  building database     | 0:00:29 | s6 |
Branch to create several final data sets with different parameter settings
[11]:
## create a branch for outputs with min_samples = 4 (lots of missing data)
min4 = data.branch("min4")
min4.set_params("min_samples_locus", 4)
min4.run("7")

## create a branch for outputs with min_samples = 13 (no missing data)
min13 = data.branch("min13")
min13.set_params("min_samples_locus", 13)
min13.run("7")

## create a branch with no missing data for ingroups, but allow
## missing data in the outgroups by setting population assignments.
## The population min-sample values overrule the min-samples-locus param
pops = data.branch("min11-pops")
pops.populations = {
    "ingroup": (11, [i for i in pops.samples if "prz" not in i]),
    "outgroup" : (0, [i for i in pops.samples if "prz" in i]),
    }
pops.run("7")

## create a branch with no missing data and with outgroups removed
nouts = data.branch("nouts_min11", subsamples=[i for i in pops.samples if "prz" not in i])
nouts.set_params("min_samples_locus", 11)
nouts.run("7")
Assembly: min4
[####################] 100%  filtering loci        | 0:00:08 | s7 |
[####################] 100%  building loci/stats   | 0:00:01 | s7 |
[####################] 100%  building vcf file     | 0:00:09 | s7 |
[####################] 100%  writing vcf file      | 0:00:00 | s7 |
[####################] 100%  building arrays       | 0:00:04 | s7 |
[####################] 100%  writing outfiles      | 0:00:05 | s7 |
Outfiles written to: ~/Documents/ipyrad/tests/analysis-ipyrad/min4_outfiles

Assembly: min13
[####################] 100%  filtering loci        | 0:00:02 | s7 |
[####################] 100%  building loci/stats   | 0:00:01 | s7 |
[####################] 100%  building vcf file     | 0:00:03 | s7 |
[####################] 100%  writing vcf file      | 0:00:00 | s7 |
[####################] 100%  building arrays       | 0:00:04 | s7 |
[####################] 100%  writing outfiles      | 0:00:00 | s7 |
Outfiles written to: ~/Documents/ipyrad/tests/analysis-ipyrad/min13_outfiles

Assembly: min11-pops
[####################] 100%  filtering loci        | 0:00:02 | s7 |
[####################] 100%  building loci/stats   | 0:00:01 | s7 |
[####################] 100%  building vcf file     | 0:00:03 | s7 |
[####################] 100%  writing vcf file      | 0:00:00 | s7 |
[####################] 100%  building arrays       | 0:00:04 | s7 |
[####################] 100%  writing outfiles      | 0:00:00 | s7 |
Outfiles written to: ~/Documents/ipyrad/tests/analysis-ipyrad/min11-pops_outfiles

Assembly: nouts_min11
[####################] 100%  filtering loci        | 0:00:03 | s7 |
[####################] 100%  building loci/stats   | 0:00:02 | s7 |
[####################] 100%  building vcf file     | 0:00:02 | s7 |
[####################] 100%  writing vcf file      | 0:00:00 | s7 |
[####################] 100%  building arrays       | 0:00:04 | s7 |
[####################] 100%  writing outfiles      | 0:00:00 | s7 |
Outfiles written to: ~/Documents/ipyrad/tests/analysis-ipyrad/nouts_min11_outfiles

View final stats

The .stats attribute shows a stats summary for each sample, and a number of stats dataframes can be accessed for each step from the .stats_dfs attribute of the Assembly.

[9]:
## we can access the stats summary as a pandas dataframes.
min4.stats
[9]:
state reads_raw reads_passed_filter clusters_total clusters_hidepth hetero_est error_est reads_consens
29154_superba 6 696994 689996 134897 35143 0.010 0.002 34628
30556_thamno 6 1452316 1440314 212960 52776 0.011 0.003 51701
30686_cyathophylla 6 1253109 1206947 240824 54282 0.010 0.002 53347
32082_przewalskii 6 964244 955480 151762 42487 0.013 0.002 41841
33413_thamno 6 636625 626084 172709 31072 0.012 0.002 30674
33588_przewalskii 6 1002923 993873 158933 46555 0.013 0.002 45892
35236_rex 6 1803858 1787366 417632 54709 0.010 0.001 54064
35855_rex 6 1409843 1397068 184661 56595 0.013 0.003 55371
38362_rex 6 1391175 1379626 133541 53156 0.008 0.002 52496
39618_rex 6 822263 813990 148107 44016 0.010 0.002 43409
40578_rex 6 1707942 1695523 222328 56653 0.011 0.002 55937
41478_cyathophylloides 6 2199740 2185364 173188 55208 0.008 0.001 54482
41954_cyathophylloides 6 2199613 2176210 301343 76537 0.008 0.003 75167
[25]:
## or print the full stats file
cat $min4.stats_files.s7


## The number of loci caught by each filter.
## ipyrad API location: [assembly].stats_dfs.s7_filters

                            total_filters  applied_order  retained_loci
total_prefiltered_loci              96562              0          96562
filtered_by_rm_duplicates            2939           2939          93623
filtered_by_max_indels                189            189          93434
filtered_by_max_snps                  679             18          93416
filtered_by_max_shared_het            865            708          92708
filtered_by_min_sample              47958          46646          46062
filtered_by_max_alleles             10694           4565          41497
total_filtered_loci                 41497              0          41497


## The number of loci recovered for each Sample.
## ipyrad API location: [assembly].stats_dfs.s7_samples

                        sample_coverage
29154_superba                     21054
30556_thamno                      31790
30686_cyathophylla                26648
32082_przewalskii                 13201
33413_thamno                      18747
33588_przewalskii                 15461
35236_rex                         33228
35855_rex                         33403
38362_rex                         33748
39618_rex                         28135
40578_rex                         34107
41478_cyathophylloides            30826
41954_cyathophylloides            28260


## The number of loci for which N taxa have data.
## ipyrad API location: [assembly].stats_dfs.s7_loci

    locus_coverage  sum_coverage
1                0             0
2                0             0
3                0             0
4             5764          5764
5             4002          9766
6             3650         13416
7             3139         16555
8             3220         19775
9             4153         23928
10            4904         28832
11            5321         34153
12            4511         38664
13            2833         41497


## The distribution of SNPs (var and pis) per locus.
## var = Number of loci with n variable sites (pis + autapomorphies)
## pis = Number of loci with n parsimony informative site (minor allele in >1 sample)
## ipyrad API location: [assembly].stats_dfs.s7_snps

     var  sum_var    pis  sum_pis
0   2348        0  11768        0
1   4482     4482  10822    10822
2   5870    16222   7699    26220
3   6314    35164   4914    40962
4   5708    57996   2967    52830
5   4857    82281   1679    61225
6   3885   105591    874    66469
7   2932   126115    478    69815
8   1955   141755    188    71319
9   1255   153050     72    71967
10   827   161320     19    72157
11   506   166886     10    72267
12   262   170030      4    72315
13   141   171863      3    72354
14    78   172955      0    72354
15    39   173540      0    72354
16    19   173844      0    72354
17    10   174014      0    72354
18     6   174122      0    72354
19     2   174160      0    72354
20     1   174180      0    72354


## Final Sample stats summary

                        state  reads_raw  reads_passed_filter  clusters_total  clusters_hidepth  hetero_est  error_est  reads_consens  loci_in_assembly
29154_superba               7     696994               689996          134897             35143       0.010      0.002          34628             21054
30556_thamno                7    1452316              1440314          212960             52776       0.011      0.003          51701             31790
30686_cyathophylla          7    1253109              1206947          240824             54282       0.010      0.002          53347             26648
32082_przewalskii           7     964244               955480          151762             42487       0.013      0.002          41841             13201
33413_thamno                7     636625               626084          172709             31072       0.012      0.002          30674             18747
33588_przewalskii           7    1002923               993873          158933             46555       0.013      0.002          45892             15461
35236_rex                   7    1803858              1787366          417632             54709       0.010      0.001          54064             33228
35855_rex                   7    1409843              1397068          184661             56595       0.013      0.003          55371             33403
38362_rex                   7    1391175              1379626          133541             53156       0.008      0.002          52496             33748
39618_rex                   7     822263               813990          148107             44016       0.010      0.002          43409             28135
40578_rex                   7    1707942              1695523          222328             56653       0.011      0.002          55937             34107
41478_cyathophylloides      7    2199740              2185364          173188             55208       0.008      0.001          54482             30826
41954_cyathophylloides      7    2199613              2176210          301343             76537       0.008      0.003          75167             28260
[13]:
## and we can access parts of the full stats outputs as dataframes
min4.stats_dfs.s7_samples
[13]:
sample_coverage
29154_superba 21054
30556_thamno 31790
30686_cyathophylla 26648
32082_przewalskii 13201
33413_thamno 18747
33588_przewalskii 15461
35236_rex 33228
35855_rex 33403
38362_rex 33748
39618_rex 28135
40578_rex 34107
41478_cyathophylloides 30826
41954_cyathophylloides 28260
[14]:
## compare this to the one above, coverage is more equal
min13.stats_dfs.s7_samples
[14]:
sample_coverage
29154_superba 2833
30556_thamno 2833
30686_cyathophylla 2833
32082_przewalskii 2833
33413_thamno 2833
33588_przewalskii 2833
35236_rex 2833
35855_rex 2833
38362_rex 2833
39618_rex 2833
40578_rex 2833
41478_cyathophylloides 2833
41954_cyathophylloides 2833
[15]:
## similarly, coverage is equal here among ingroups, but allows missing in outgroups
pops.stats_dfs.s7_samples
[15]:
sample_coverage
29154_superba 5796
30556_thamno 5796
30686_cyathophylla 5796
32082_przewalskii 3266
33413_thamno 5796
33588_przewalskii 3747
35236_rex 5796
35855_rex 5796
38362_rex 5796
39618_rex 5796
40578_rex 5796
41478_cyathophylloides 5796
41954_cyathophylloides 5796
Analysis tools

We have a lot more information about analysis tools in the ipyrad documentation. But here I’ll show just a quick example of how you can easily access the data files for these assemblies and use them in downstream analysis software. The ipyrad analysis tools include convenient wrappers to make it easier to parallelize analyses of RAD-seq data. Please see the full documentation for the ipyrad.analysis tools in the ipyrad documentation for more details.

[6]:
import ipyrad as ip
import ipyrad.analysis as ipa
[7]:
## you can re-load assemblies at a later time from their JSON file
min4 = ip.load_json("analysis-ipyrad/min4.json")
min13 = ip.load_json("analysis-ipyrad/min13.json")
nouts = ip.load_json("analysis-ipyrad/nouts_min11.json")
loading Assembly: min4
from saved path: ~/Documents/ipyrad/tests/analysis-ipyrad/min4.json
loading Assembly: min13
from saved path: ~/Documents/ipyrad/tests/analysis-ipyrad/min13.json
loading Assembly: nouts_min11
from saved path: ~/Documents/ipyrad/tests/analysis-ipyrad/nouts_min11.json
RAxML – ML concatenation tree inference
[17]:
## conda install raxml -c bioconda
## conda install toytree -c eaton-lab
[12]:
## create a raxml analysis object for the min13 data sets
rax = ipa.raxml(
    name=min13.name,
    data=min13.outfiles.phy,
    workdir="analysis-raxml",
    T=20,
    N=100,
    o=[i for i in min13.samples if "prz" in i],
    )
[19]:
## print the raxml command and call it
print rax.command
rax.run(force=True)
raxmlHPC-PTHREADS-SSE3 -f a -T 20 -m GTRGAMMA -N 100 -x 12345 -p 54321 -n min13 -w /home/deren/Documents/ipyrad/tests/analysis-raxml -s /home/deren/Documents/ipyrad/tests/analysis-ipyrad/min13_outfiles/min13.phy -o 33588_przewalskii,32082_przewalskii
job min13 finished successfully
[13]:
## access the resulting tree files
rax.trees
[13]:
bestTree                   ~/Documents/ipyrad/tests/analysis-raxml/RAxML_bestTree.min13
bipartitions               ~/Documents/ipyrad/tests/analysis-raxml/RAxML_bipartitions.min13
bipartitionsBranchLabels   ~/Documents/ipyrad/tests/analysis-raxml/RAxML_bipartitionsBranchLabels.min13
bootstrap                  ~/Documents/ipyrad/tests/analysis-raxml/RAxML_bootstrap.min13
info                       ~/Documents/ipyrad/tests/analysis-raxml/RAxML_info.min13
[15]:
## plot a tree in the notebook with toytree
import toytree
tre = toytree.tree(rax.trees.bipartitions)
tre.draw(
    width=350,
    height=400,
    node_labels=tre.get_node_values("support"),
    );
32082_przewalskii33588_przewalskii29154_superba30686_cyathophylla41478_cyathophylloides41954_cyathophylloides33413_thamno30556_thamno35855_rex40578_rex35236_rex39618_rex38362_rexidx: 1 name: 1 dist: 0.0112591855114 support: 100100idx: 2 name: 2 dist: 0.0112591855114 support: 100100idx: 3 name: 3 dist: 0.00138635903129 support: 100100idx: 4 name: 4 dist: 0.00130362717349 support: 100100idx: 5 name: 5 dist: 0.00515136675257 support: 100100idx: 6 name: 6 dist: 0.0026758642527 support: 100100idx: 7 name: 7 dist: 0.000567990916626 support: 7777idx: 8 name: 8 dist: 0.000342308540455 support: 7373idx: 9 name: 9 dist: 0.00100552991932 support: 100100idx: 10 name: 10 dist: 0.000365735718032 support: 4848idx: 11 name: 11 dist: 0.0030022742397 support: 100100
tetrad – quartet tree inference
[16]:
## create a tetrad analysis object
tet = ipa.tetrad(
    name=min4.name,
    seqfile=min4.outfiles.snpsphy,
    mapfile=min4.outfiles.snpsmap,
    nboots=100,
    )
loading seq array [13 taxa x 174180 bp]
max unlinked SNPs per quartet (nloci): 39149
[18]:
## run tree inference
tet.run(ipyclient)
host compute node: [20 cores] on tinus
inferring 715 induced quartet trees
[####################] 100%  initial tree | 0:00:05 |
[####################] 100%  boot 100     | 0:02:23 |
[19]:
## access tree files
tet.trees
[19]:
boots   ~/Documents/ipyrad/tests/analysis-tetrad/min4.boots
cons    ~/Documents/ipyrad/tests/analysis-tetrad/min4.cons
nhx     ~/Documents/ipyrad/tests/analysis-tetrad/min4.nhx
tree    ~/Documents/ipyrad/tests/analysis-tetrad/min4.tree
[22]:
## plot results (just like above, but unrooted by default)
## the consensus tree here differs from the ML tree above.
import toytree
qtre = toytree.tree(tet.trees.nhx)
qtre.root(wildcard="prz")
qtre.draw(
    width=350,
    height=400,
    node_labels=qtre.get_node_values("support"),
    );
33588_przewalskii32082_przewalskii30686_cyathophylla29154_superba41954_cyathophylloides41478_cyathophylloides33413_thamno35236_rex30556_thamno35855_rex40578_rex38362_rex39618_rexidx: 1 name: 1 dist: 100 support: 100100idx: 2 name: added-node dist: 100 support: 100100idx: 3 name: 2 dist: 100 support: 100100idx: 4 name: 3 dist: 100 support: 100100idx: 5 name: 4 dist: 100 support: 100100idx: 6 name: 5 dist: 100 support: 100100idx: 7 name: 6 dist: 99 support: 9999idx: 8 name: 7 dist: 98 support: 9898idx: 9 name: 8 dist: 85 support: 8585idx: 10 name: 9 dist: 100 support: 100100idx: 11 name: 10 dist: 100 support: 100100
STRUCTURE – population cluster inference
[ ]:
## conda install structure clumpp -c ipyrad
[25]:
## create a structure analysis object for the no-outgroup data set
## NB: As of v0.9.64 you can use instead: data=data.outfiles.snps_database
struct = ipa.structure(
    name=nouts.name,
    data="analysis-ipyrad/nouts_min11_outfiles/nouts_min11.snps.hdf5",
)

## set params for analysis (should be longer in real analyses)
struct.mainparams.burnin=1000
struct.mainparams.numreps=8000
[96]:
## run structure across 10 random replicates of sampled unlinked SNPs
for kpop in [2, 3, 4, 5, 6]:
    struct.run(kpop=kpop, nreps=10, ipyclient=ipyclient)
submitted 10 structure jobs [nouts_min11-K-2]
submitted 10 structure jobs [nouts_min11-K-3]
submitted 10 structure jobs [nouts_min11-K-4]
submitted 10 structure jobs [nouts_min11-K-5]
submitted 10 structure jobs [nouts_min11-K-6]
[115]:
## wait for all of these jobs to finish
ipyclient.wait()
[115]:
True
[26]:
## collect results
tables = {}
for kpop in [2, 3, 4, 5, 6]:
    tables[kpop] = struct.get_clumpp_table(kpop)
mean scores across 10 replicates.
mean scores across 10 replicates.
mean scores across 10 replicates.
mean scores across 10 replicates.
mean scores across 10 replicates.
[27]:
## custom sorting order
myorder = [
    "41478_cyathophylloides",
    "41954_cyathophylloides",
    "29154_superba",
    "30686_cyathophylla",
    "33413_thamno",
    "30556_thamno",
    "35236_rex",
    "40578_rex",
    "35855_rex",
    "39618_rex",
    "38362_rex",
]
[33]:
## import toyplot (packaged with toytree)
import toyplot

## plot bars for each K-value (mean of 10 reps)
for kpop in [2, 3, 4, 5, 6]:
    table = tables[kpop]
    table = table.ix[myorder]

    ## plot barplot w/ hover
    canvas, axes, mark = toyplot.bars(
                            table,
                            title=[[i] for i in table.index.tolist()],
                            width=400,
                            height=200,
                            yshow=False,
                            style={"stroke": toyplot.color.near_black},
                            )
41478_cyathophylloides41954_cyathophylloides29154_superba30686_cyathophylla33413_thamno30556_thamno35236_rex40578_rex35855_rex39618_rex38362_rex41478_cyathophylloides41954_cyathophylloides29154_superba30686_cyathophylla33413_thamno30556_thamno35236_rex40578_rex35855_rex39618_rex38362_rex0510
41478_cyathophylloides41954_cyathophylloides29154_superba30686_cyathophylla33413_thamno30556_thamno35236_rex40578_rex35855_rex39618_rex38362_rex41478_cyathophylloides41954_cyathophylloides29154_superba30686_cyathophylla33413_thamno30556_thamno35236_rex40578_rex35855_rex39618_rex38362_rex41478_cyathophylloides41954_cyathophylloides29154_superba30686_cyathophylla33413_thamno30556_thamno35236_rex40578_rex35855_rex39618_rex38362_rex0510
41478_cyathophylloides41954_cyathophylloides29154_superba30686_cyathophylla33413_thamno30556_thamno35236_rex40578_rex35855_rex39618_rex38362_rex41478_cyathophylloides41954_cyathophylloides29154_superba30686_cyathophylla33413_thamno30556_thamno35236_rex40578_rex35855_rex39618_rex38362_rex41478_cyathophylloides41954_cyathophylloides29154_superba30686_cyathophylla33413_thamno30556_thamno35236_rex40578_rex35855_rex39618_rex38362_rex41478_cyathophylloides41954_cyathophylloides29154_superba30686_cyathophylla33413_thamno30556_thamno35236_rex40578_rex35855_rex39618_rex38362_rex0510
41478_cyathophylloides41954_cyathophylloides29154_superba30686_cyathophylla33413_thamno30556_thamno35236_rex40578_rex35855_rex39618_rex38362_rex41478_cyathophylloides41954_cyathophylloides29154_superba30686_cyathophylla33413_thamno30556_thamno35236_rex40578_rex35855_rex39618_rex38362_rex41478_cyathophylloides41954_cyathophylloides29154_superba30686_cyathophylla33413_thamno30556_thamno35236_rex40578_rex35855_rex39618_rex38362_rex41478_cyathophylloides41954_cyathophylloides29154_superba30686_cyathophylla33413_thamno30556_thamno35236_rex40578_rex35855_rex39618_rex38362_rex41478_cyathophylloides41954_cyathophylloides29154_superba30686_cyathophylla33413_thamno30556_thamno35236_rex40578_rex35855_rex39618_rex38362_rex0510
41478_cyathophylloides41954_cyathophylloides29154_superba30686_cyathophylla33413_thamno30556_thamno35236_rex40578_rex35855_rex39618_rex38362_rex41478_cyathophylloides41954_cyathophylloides29154_superba30686_cyathophylla33413_thamno30556_thamno35236_rex40578_rex35855_rex39618_rex38362_rex41478_cyathophylloides41954_cyathophylloides29154_superba30686_cyathophylla33413_thamno30556_thamno35236_rex40578_rex35855_rex39618_rex38362_rex41478_cyathophylloides41954_cyathophylloides29154_superba30686_cyathophylla33413_thamno30556_thamno35236_rex40578_rex35855_rex39618_rex38362_rex41478_cyathophylloides41954_cyathophylloides29154_superba30686_cyathophylla33413_thamno30556_thamno35236_rex40578_rex35855_rex39618_rex38362_rex41478_cyathophylloides41954_cyathophylloides29154_superba30686_cyathophylla33413_thamno30556_thamno35236_rex40578_rex35855_rex39618_rex38362_rex0510
TREEMIX – ML tree & admixture co-inference
[15]:
## conda install treemix -c ipyrad
[39]:
## group taxa into 'populations'
imap = {
    "prz": ["32082_przewalskii", "33588_przewalskii"],
    "cys": ["41478_cyathophylloides", "41954_cyathophylloides"],
    "cya": ["30686_cyathophylla"],
    "sup": ["29154_superba"],
    "cup": ["33413_thamno"],
    "tha": ["30556_thamno"],
    "rck": ["35236_rex"],
    "rex": ["35855_rex", "40578_rex"],
    "lip": ["39618_rex", "38362_rex"],
}

## optional: loci will be filtered if they do not have data for at
## least N samples in each species. Minimums cannot be <1.
minmap = {
    "prz": 2,
    "cys": 2,
    "cya": 1,
    "sup": 1,
    "cup": 1,
    "tha": 1,
    "rck": 1,
    "rex": 2,
    "lip": 2,
    }

## sets a random number seed
import numpy
numpy.random.seed(12349876)

## create a treemix analysis object
tmix = ipa.treemix(
    name=min13.name,
    data=min13.outfiles.snpsphy,
    mapfile=min13.outfiles.snpsmap,
    imap=imap,
    minmap=minmap,
    )

## you can set additional parameter args here
tmix.params.root = "prz"
tmix.params.global_ = 1

## print the full params
tmix.params
[39]:
binary      treemix
bootstrap   0
climb       0
cormig      0
g           (None, None)
global_     1
k           0
m           0
noss        0
root        prz
se          0
seed        737548365
[40]:
## a dictionary for storing treemix objects
tdict = {}

## iterate over values of m
for rep in xrange(4):
    for mig in xrange(4):

        ## create new treemix object copy
        name = "mig-{}-rep-{}".format(mig, rep)
        tmp = tmix.copy(name)

        ## set params on new object
        tmp.params.m = mig

        ## run treemix analysis
        tmp.run()

        ## store the treemix object
        tdict[name] = tmp
[43]:
import toyplot

## select a single result
tmp = tdict["mig-1-rep-1"]

## draw the tree similar to the Treemix plotting R code
## this code is rather new and will be expanded in the future.
canvas = toyplot.Canvas(width=350, height=350)
axes = canvas.cartesian(padding=25, margin=75)
axes = tmp.draw(axes)
przcyscyasuprckliprexthacupweight: 0.166766-0.3-0.2-0.10.0Drift parameter
[44]:
import toyplot
import numpy as np

## plot many results
canvas = toyplot.Canvas(width=800, height=1200)
idx = 0
for mig in range(4):
    for rep in range(4):
        tmp = tdict["mig-{}-rep-{}".format(mig, rep)]
        ax = canvas.cartesian(grid=(4, 4, idx), padding=25, margin=(25, 50, 100, 25))
        ax = tmp.draw(ax)
        idx += 1
przcyscyasupcupthaliprckrex-0.2-0.10.0Drift parameterprzcyscyasupcupthaliprckrex-0.2-0.10.0Drift parameterprzcyscyasupcupthaliprckrex-0.2-0.10.0Drift parameterprzcyscyasupcupthaliprckrex-0.2-0.10.0Drift parameterprzcyssupcyacuptharexliprckweight: 0.0325483-0.2-0.10.0Drift parameterprzcyscyasuprckliprexthacupweight: 0.166766-0.3-0.2-0.10.0Drift parameterprzcyssupcyacupliptharexrckweight: 0.0958665-0.3-0.2-0.10.0Drift parameterprzcyscyasupcupthaliprckrexweight: 0.065454-0.3-0.2-0.10.0Drift parameterprzcyssupcyacuptharexliprckweight: 0.0345748weight: 0.178132-0.3-0.2-0.10.0Drift parameterprzcyscyasupcuptharexrcklipweight: 0.0418195weight: 0.00647291-0.2-0.10.0Drift parameterprzcyssupcyacuprextharcklipweight: 0.0663672weight: 0.3213-0.3-0.2-0.10.0Drift parameterprzcyssupcyacupthaliprexrckweight: 0.232693weight: 0.0401708-0.3-0.2-0.10.0Drift parameterprzcyssupcyacupthaliprckrexweight: 0.125805weight: 0.101155weight: 0.0318235-0.3-0.2-0.10.0Drift parameterprzcyscyasupthacuprexrcklipweight: 0.0739663weight: 0.36854weight: 0.139219-0.3-0.2-0.10.0Drift parameterprzcyssupcyarckrexlipthacupweight: 0.167868weight: 0.368291weight: 0.0163652-0.3-0.2-0.10.0Drift parameterprzcyssupcyacupthaliprexrckweight: 0.49431weight: 0.0674044weight: 0.316649-0.3-0.2-0.10.0Drift parameter
ABBA-BABA admixture inference
[47]:
bb = ipa.baba(
    data=min4.outfiles.loci,
    newick="analysis-raxml/RAxML_bestTree.min13",
)
[55]:
## check params
bb.params
[55]:
database   None
mincov     1
nboots     1000
quiet      False
[53]:
## generate all tests from the tree where 32082 is p4
bb.generate_tests_from_tree(
    constraint_dict={
        "p4": ["32082_przewalskii"],
        "p3": ["30556_thamno"],
        }
    )
36 tests generated from tree
[54]:
## run the tests in parallel
bb.run(ipyclient=ipyclient)
[####################] 100%  calculating D-stats  | 0:00:37 |
[59]:
bb.results_table.sort_values(by="Z", ascending=False).head()
[59]:
dstat bootmean bootstd Z ABBA BABA nloci
12 0.198 0.199 0.027 7.291 593.156 396.844 10629
27 -0.208 -0.207 0.031 6.648 370.500 565.000 10365
20 0.208 0.208 0.031 6.625 565.000 370.500 10365
16 0.186 0.186 0.032 5.830 555.250 381.250 10243
26 -0.186 -0.186 0.033 5.708 381.250 555.250 10243
[63]:
## most significant result (more ABBA than BABA)
bb.tests[12]
[63]:
{'p1': ['35236_rex'],
 'p2': ['35855_rex', '40578_rex'],
 'p3': ['30556_thamno'],
 'p4': ['32082_przewalskii']}
[64]:
## the next most signif (more BABA than ABBA)
bb.tests[27]
[64]:
{'p1': ['40578_rex'],
 'p2': ['35236_rex'],
 'p3': ['30556_thamno'],
 'p4': ['32082_przewalskii']}
BPP – species tree inference/delim
[68]:
## a dictionary mapping sample names to 'species' names
imap = {
    "prz": ["32082_przewalskii", "33588_przewalskii"],
    "cys": ["41478_cyathophylloides", "41954_cyathophylloides"],
    "cya": ["30686_cyathophylla"],
    "sup": ["29154_superba"],
    "cup": ["33413_thamno"],
    "tha": ["30556_thamno"],
    "rck": ["35236_rex"],
    "rex": ["35855_rex", "40578_rex"],
    "lip": ["39618_rex", "38362_rex"],
    }

## optional: loci will be filtered if they do not have data for at
## least N samples/individuals in each species.
minmap = {
    "prz": 2,
    "cys": 2,
    "cya": 1,
    "sup": 1,
    "cup": 1,
    "tha": 1,
    "rck": 1,
    "rex": 2,
    "lip": 2,
    }

## a tree hypothesis (guidetree) (here based on tetrad results)
## for the 'species' we've collapsed samples into.
newick = "((((((rex, lip), rck), tha), cup), (cys, (cya, sup))), prz);"
[69]:
## initiata a bpp object
b = ipa.bpp(
    name=min4.name,
    data=min4.outfiles.alleles,
    imap=imap,
    minmap=minmap,
    guidetree=newick,
    )
[70]:
## set some optional params, leaving others at their defaults
## you should definitely run these longer for real analyses
b.params.burnin = 1000
b.params.nsample = 2000
b.params.sampfreq = 20

## print params
b.params
[70]:
burnin          1000
cleandata       0
delimit_alg     (0, 5)
finetune        (0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01)
infer_delimit   0
infer_sptree    0
nsample         2000
sampfreq        20
seed            12345
tauprior        (2, 2000, 1)
thetaprior      (2, 2000)
usedata         1
[71]:
## set some optional filters leaving others at their defaults
b.filters.maxloci=100
b.filters.minsnps=4

## print filters
b.filters
[71]:
maxloci   100
minmap    {'cys': 4, 'rex': 4, 'cup': 2, 'rck': 2, 'cya': 2, 'lip': 4, 'sup': 2, 'tha': 2, 'prz': 4}
minsnps   4
run BPP

You can either call ‘write_bpp_files()’ to write input files for this data set to be run in BPP, and then call BPP on those files, or you can use the ‘.run()’ command to run the data files directly, and in parallel on the cluster. If you specify multiple reps then a different random sample of loci will be selected, and different random seeds applied to each replicate.

[75]:
b.write_bpp_files()
input files created for job min4 (100 loci)
[76]:
b.run()
submitted 2 bpp jobs [min4] (100 loci)
[77]:
## wait for all ipyclient jobs to finish
ipyclient.wait()
[77]:
True
[90]:
## check results
## parse the mcmc table with pandas library
import pandas as pd
btable = pd.read_csv(b.files.mcmcfiles[0], sep="\t", index_col=0)
btable.describe().T
[90]:
count mean std min 25% 50% 75% max
theta_1cup 2000.0 0.002 4.720e-04 1.110e-03 1.949e-03 0.002 0.003 0.005
theta_2cya 2000.0 0.001 4.587e-04 1.944e-04 9.056e-04 0.001 0.002 0.003
theta_3cys 2000.0 0.001 2.569e-04 7.718e-04 1.091e-03 0.001 0.001 0.002
theta_4lip 2000.0 0.002 3.075e-04 9.518e-04 1.619e-03 0.002 0.002 0.003
theta_5prz 2000.0 0.007 7.235e-04 4.556e-03 6.150e-03 0.007 0.007 0.009
theta_6rck 2000.0 0.002 4.692e-04 1.090e-03 1.860e-03 0.002 0.002 0.004
... ... ... ... ... ... ... ... ...
tau_13rexliprcktha 2000.0 0.002 2.376e-04 1.470e-03 1.854e-03 0.002 0.002 0.003
tau_14rexliprck 2000.0 0.002 2.367e-04 1.191e-03 1.781e-03 0.002 0.002 0.003
tau_15rexlip 2000.0 0.002 2.466e-04 9.497e-04 1.717e-03 0.002 0.002 0.003
tau_16cyscyasup 2000.0 0.005 4.718e-04 3.503e-03 4.881e-03 0.005 0.006 0.007
tau_17cyasup 2000.0 0.002 7.494e-04 3.230e-04 1.270e-03 0.002 0.002 0.005
lnL 2000.0 -13806.667 2.362e+01 -1.389e+04 -1.382e+04 -13806.152 -13790.543 -13733.031

26 rows × 8 columns

Success – you’re finished

Now that you have an idea of the myriad ways you can assembly your data, and some of the downstream analysis tools you are ready to explore RAD-seq data.

Demultiplex on i7 outer barcodes

Outer barcodes/indicies (e.g., i7 barcodes) can be used to label samples for demultiplexing, or, often to label different libraries. An example could be you and a friend both prepared separate libraries for different organisms but used the same set of 48 internal barcodes. You then both add a unique outer barcode to your samples so that you can pool them together for sequencing. The resulting data would look like below, where the i7 index for the library is in red and the internal barcode is in blue.

<span style="font-family:monospace; font-size:10px;">

@NB551405:60:H7T2GAFXY:1:11101:6611:1038 1:N:0: CGAACTGT+NNNNNCNA

CCGAATTCCTATCGGAAACGGATCAATAGCCCAATTGAAAATCAAGATAAGTTGAGGAAGACCAAGTCTGAAGAATTATCAAAT + AAAAA#EEEAEEEEE/EEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEE6EEEEA/EEEAAEEEEEEEAEEEEAEEEEAEE

i7 indices

Even if there are few samples to be demultiplexed (e.g., here we have just two, one for your library and one for your friend’s) you still generally need to use multiple barcodes to create sufficient base diversity for Illumina sequencing. In this case we have 12 i7 barcodes to indicate two different libraries, so our barcodes file for i7 demultiplexing looks like the following:

[2]:
cat Amaranthus_COL_3RAD_run2_i7_barcodes.txt
[2]:
library1    CATCATCAT
library1    CCAGTGATA
library2    TGGCCTAGT
library2    GCCAGTATA
demultiplex on i7 indices
[ ]:
# demux outer i7s to plate1 and plate2
outer1 = ip.Assembly("demux_run2_i7s")
outer1.params.project_dir = "/moto/eaton/projects/PAG3"
outer1.params.raw_fastq_path = "/moto/eaton/projects/RAW_DATA/Amaranthus/Amaranthus_COL_3RAD_run2_R*.gz"
outer1.params.barcodes_path = "/moto/eaton/projects/RAW_DATA/Amaranthus/Amaranthus_COL_3RAD_run2_i7_barcodes.txt"
outer1.params.datatype = 'pairddrad'

# important: set hackers params to demux on i7
outer1.hackersonly.demultiplex_on_i7_tags = True
outer1.hackersonly.merge_technical_replicates = True

# run step 1 to demux
outer1.run("1", ipyclient=ipyclient, force=True)
[ ]:

API: ipyrad analysis tools

The ipyrad-analysis toolkit is a Python interface for taking the output files produced in a ipyrad assembly and running a suite of evolutionary analysis tools with convenient features for filtering for missing data, grouping individuals into populations, dropping samples, and more.

All of these tools share a common syntax making them easy to use without having to worry about creating different input files, or learn new file formats. They are designed for use within Jupyter notebooks, a tool for reproducible science. See the examples below.

# the analysis tools are a subpackage of ipyrad
import ipyrad as ipa

# a large suite of tools are available
tool = ipa.structure(data="./outfiles/data.snps.hdf5")

# all tools share a common syntax for setting params
# and distributing work in parallel.
tool.run()

ipyrad-analysis toolkit: vcf_to_hdf5

View as notebook

Many genome assembly tools will write variant SNP calls to the VCF format (variant call format). This is a plain text file that stores variant calls relative to a reference genome in tabular format. It includes a lot of additional information about the quality of SNP calls, etc., but is not very easy to read or efficient to parse. To make analyses run a bit faster ipyrad uses a simplified format to store this information in the form of an HDF5 database. You can easily convert any VCF file to this HDF5 format using the ipa.vcf_to_hdf5() tool.

This tool includes an added benefit of allowing you to enter an (optional) ld_block_size argument when creating the file which will store information that can be used downstream by many other tools to subsample SNPs and perform bootstrap resampling in a way that reduces the effects of linkage among SNPs. If your data are assembled RAD data then the ld_block_size is not required, since we can simply use RAD loci as the linkage blocks. But if you want to combine reference-mapped RAD loci located nearby in the genome as being on the same linkage block then you can enter a value such as 50,000 to create 50Kb linkage block that will join many RAD loci together and sample only 1 SNP per block in each bootstrap replicate. If your data are not RAD data, e.g., whole genome data, then the ld_block_size argument will be required in order to encode linkage information as discrete blocks into your database.

Required software

If you are converting a VCF file assembled from some other tool (e.g., GATK, freebayes, etc.) then you will need to install the htslib and bcftools software and use them as described below.

[1]:
# conda install ipyrad -c bioconda
# conda install htslib -c bioconda
# conda install bcftools -c bioconda
[2]:
import ipyrad.analysis as ipa
import pandas as pd
Pre-filter data from other programs (e.g., FreeBayes, GATK)

You can use the program bcftools to pre-filter your data to exclude indels and low quality SNPs. If you ran the conda install commands above then you will have all of the required tools installed. To achieve the format that ipyrad expects you will need to exclude indel containing SNPs (this may change in the future). Further quality filtering is optional.

The example below reduced the size of a VCF data file from 29Gb to 80Mb! VCF contains a lot of information that you do not need to retain through all of your analyses. We will keep only the final genotype calls.

Note that the code below is bash script. You can run this from a terminal, or in a jupyter notebook by appending the (%%bash) header like below.

[ ]:
%%bash

# compress the VCF file if not already done (creates .vcf.gz)
bgzip data.vcf

# tabix index the compressed VCF (creates .vcf.gz.tbi)
tabix data.vcf.gz

# remove multi-allelic SNPs and INDELs and PIPE to next command
bcftools view -m2 -M2 -i'CIGAR="1X" & QUAL>30' data.vcf.gz -Ou |

    # remove extra annotations/formatting info and save to new .vcf
    bcftools annotate -x FORMAT,INFO  > data.cleaned.vcf

# recompress the final file (create .vcf.gz)
bgzip data.cleaned.vcf
A peek at the cleaned VCF file
[3]:
# load the VCF as an datafram
dfchunks = pd.read_csv(
    "/home/deren/Documents/ipyrad/sandbox/Macaque-Chr1.clean.vcf.gz",
    sep="\t",
    skiprows=1000,
    chunksize=1000,
)

# show first few rows of first dataframe chunk
next(dfchunks).head()
[3]:
NC_018152.2 51273 . G A 280.482 ..1 ..2 GT 0/0 ... 0/0.9 0/0.10 0/0.11 0/0.12 0/0.13 0/0.14 0/0.15 0/0.16 0/0.17 0/1.1
0 NC_018152.2 51292 . A G 16750.300 . . GT 1/1 ... 1/1 . 1/1 1/1 1/1 1/1 0/0 1/1 1/1 1/1
1 NC_018152.2 51349 . A G 628.563 . . GT 0/0 ... 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0
2 NC_018152.2 51351 . C T 943.353 . . GT 0/0 ... 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0
3 NC_018152.2 51352 . G A 607.681 . . GT 0/0 ... 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0
4 NC_018152.2 51398 . C T 510.120 . . GT 0/0 ... 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0

5 rows × 29 columns

Converting clean VCF to HDF5

Here I using a VCF file from whole geome data for 20 monkey’s from an unpublished study (in progress). It contains >6M SNPs all from chromosome 1. Because many SNPs are close together and thus tightly linked we will likely wish to take linkage into account in our downstream analyses.

The ipyrad analysis tools can do this by encoding linkage block information into the HDF5 file. Here we encode ld_block_size of 20K bp. This breaks the 1 scaffold (chromosome) into about 10K linkage blocks. See the example below of this information being used in an ipyrad PCA analysis.

[4]:
# init a conversion tool
converter = ipa.vcf_to_hdf5(
    name="Macaque_LD20K",
    data="/home/deren/Documents/ipyrad/sandbox/Macaque-Chr1.clean.vcf.gz",
    ld_block_size=20000,
)

# run the converter
converter.run()
Indexing VCF to HDF5 database file
VCF: 6094152 SNPs; 1 scaffolds
[####################] 100% 0:02:22 | converting VCF to HDF5
HDF5: 6094152 SNPs; 10845 linkage group
SNP database written to ./analysis-vcf2hdf5/Macaque_LD20K.snps.hdf5
Downstream analyses

The data file now contains 6M SNPs across 20 samples and N linkage blocks. By default the PCA tool subsamples a single SNP per linkage block. To explore variation over multiple random subsamplings we can use the nreplicates argument.

[5]:
# init a PCA tool and filter to allow no missing data
pca = ipa.pca(
    data="./analysis-vcf2hdf5/Macaque_LD20K.snps.hdf5",
    mincov=1.0,
)
Samples: 20
Sites before filtering: 6094152
Filtered (indels): 0
Filtered (bi-allel): 0
Filtered (mincov): 794597
Filtered (minmap): 0
Filtered (combined): 794597
Sites after filtering: 5299555
Sites containing missing values: 0 (0.00%)
Missing values in SNP matrix: 0 (0.00%)
Run a single PCA analysis from subsampled unlinked SNPs
[6]:
pca.run_and_plot_2D(0, 1, seed=123);
Subsampling SNPs: 10841/5299555
SRR2981140SRR2981114nemestrina2SRR4454020SRR5947292SRR5947293SRR7588781SRR5947294SRR2981139sylvanusfasnoSRR4453966SRR4454026silenusfuscata2fassoSRR8285768DRR002233SRR5628058SRR1024051-250255075PC0 (25.4%) explained-25025PC1 (14.4%) explained
Run multiple PCAs over replicates of subsampled SNPs

Here you can see the results for a different 10K SNPs that are sampled in each replicate iteration. If the signal in the data is robust then we should expect to see the points clustering at a similar place across replicates. Internally ipyrad will rotate axes to ensure the replicate plots align despite axes swapping (which is arbitrary in PCA space). You can see this provides a better view of uncertainty in our estimates than the plot above (and it looks cool!)

[7]:
pca.run_and_plot_2D(0, 1, seed=123, nreplicates=25);
Subsampling SNPs: 10841/5299555
SRR2981140SRR2981114nemestrina2SRR4454020SRR5947292SRR5947293SRR7588781SRR5947294SRR2981139sylvanusfasnoSRR4453966SRR4454026silenusfuscata2fassoSRR8285768DRR002233SRR5628058SRR1024051-250255075PC0 (25.4%) explained-25025PC1 (14.4%) explained

More details on running PCAs, toggling options, and styling plots can be found in our ipyrad.analysis PCA tutorial

ipyrad-analysis toolkit: treemix

View as notebook

The program TreeMix by Pickrell & Pritchard (2012) is used to infer population splits and admixture from allele frequency data. From the TreeMix documentation: “In the underlying model, the modern-day populations in a species are related to a common ancestor via a graph of ancestral populations. We use the allele frequencies in the modern populations to infer the structure of this graph.”

Required software

A minor detail of the treemix conda installation is that it installs a bunch of junk alongside it, including R and openblas, into your conda environment. It’s a ton of bloat that in my experience has a high chance of causing incompatibilities with other packages in your conda installation eventually. For this reason I recommend installing it into a separate conda environment.

The installation instructions below can be used to setup a new environment called treemix that you can use for this analyis. This will avoid the installation from conflicting with any of your other software. When you are done you can switch back to your default environment.

[ ]:
# conda init
# conda create -n treemix
# conda activate treemix
# conda install treemix -c bioconda -c conda-forge
# conda install ipyrad -c bioconda -c conda-forge
# conda install jupyter -c conda-forge
# conda install toytree -c eaton-lab -c conda-forge
# jupyter-notebook
[2]:
import ipyrad.analysis as ipa
import toytree
import toyplot
[3]:
print('ipyrad', ipa.__version__)
print('toytree', toytree.__version__)
! treemix --version | grep 'TreeMix v. '
ipyrad 0.9.15
toytree 0.2.3
TreeMix v. 1.13
Required input data files

Your input data should be a .snps.hdf database file produced by ipyrad. If you do not have this you can generate it from any VCF file following the vcf2hdf5 tool tutorial. The database file contains the genotype calls information as well as linkage information that is used for subsampling unlinked SNPs and bootstrap resampling.

[4]:
# the path to your HDF5 formatted snps file
data = "/home/deren/Downloads/ref_pop2.snps.hdf5"
Short Tutorial:

If you entered population information during data assembly then you may have already produced a .treemix.gz output file that can be used as input to the treemix command line program. Alternatively, you can run treemix using the ipyrad tool here which offers some additional flexibility for filtering SNP data, and for running treemix programatically over many parameter settings.

The key features offered by ipa.treemix include:

  1. Filter unlinked SNPs (1 per locus) many times for replicate analyses.
  2. Filter by sample or populations coverage.
  3. Plotting functions.
  4. Easy to write for-loops
[5]:
# group individuals into populations
imap = {
    "virg": ["TXWV2", "LALC2", "SCCU3", "FLSF33", "FLBA140"],
    "mini": ["FLSF47", "FLMO62", "FLSA185", "FLCK216"],
    "gemi": ["FLCK18", "FLSF54", "FLWO6", "FLAB109"],
    "bran": ["BJSL25", "BJSB3", "BJVL19"],
    "fusi": ["MXED8", "MXGT4", "TXGR3", "TXMD3"],
    "sagr": ["CUVN10", "CUCA4", "CUSV6", "CUMM5"],
    "oleo": ["CRL0030", "HNDA09", "BZBB1", "MXSA3017"],
}

# minimum % of samples that must be present in each SNP from each group
minmap = {i: 0.5 for i in imap}
[6]:
# init a treemix analysis object with some param arguments
tmx = ipa.treemix(
    data=data,
    imap=imap,
    minmap=minmap,
    seed=123456,
    root="bran,fusi",
    m=2,
)
Samples: 28
Sites before filtering: 349914
Filtered (indels): 0
Filtered (bi-allel): 13077
Filtered (mincov): 0
Filtered (minmap): 109383
Filtered (combined): 117718
Sites after filtering: 232196
Sites containing missing values: 221834 (95.54%)
Missing values in SNP matrix: 822320 (12.65%)
subsampled 29923 unlinked SNPs
[7]:
# print the command string that will be called and run it
print(tmx.command)
tmx.run()
treemix -i /home/deren/Documents/ipyrad/newdocs/API-analysis/analysis-treemix/test.treemix.in.gz -o /home/deren/Documents/ipyrad/newdocs/API-analysis/analysis-treemix/test -m 2 -seed 123456 -root bran,fusi
[8]:
# draw the resulting tree
tmx.draw_tree();
minigemivirgoleosagrfusibran0.000.050.10
[9]:
# draw the covariance matrix
tmx.draw_cov();
0.0392740.008558-0.010861-0.007422-0.009953-0.010672-0.008925bran0.0085580.017783-0.006984-0.005362-0.003162-0.005733-0.005100fusi-0.010861-0.0069840.0175430.009980-0.003749-0.002483-0.003446sagr-0.007422-0.0053620.0099800.030868-0.009673-0.009907-0.008485oleo-0.009953-0.003162-0.003749-0.0096730.0182450.0049720.003321virg-0.010672-0.005733-0.002483-0.0099070.0049720.0167360.007087gemi-0.008925-0.005100-0.003446-0.0084850.0033210.0070870.015547minibranfusisagroleovirggemimini
1. Finding the best value for m

As with structure plots there is no True best value, but you can use model selection methods to decide whether one is a statistically better fit to your data than another. Adding additional admixture edges will always improve the likelihood score, but with diminishing returns as you add additional edges that explain little variation in the data. You can look at the log likelihood score of each model fit by running a for-loop like below. You may want to run this within another for-loop that iterates over different subsampled SNPs.

[10]:
# init a treemix analysis object with some param arguments
tmx = ipa.treemix(
    data=data,
    imap=imap,
    minmap=minmap,
    seed=1234,
    root="bran,fusi",
)
Samples: 28
Sites before filtering: 349914
Filtered (indels): 0
Filtered (bi-allel): 13077
Filtered (mincov): 0
Filtered (minmap): 109383
Filtered (combined): 117718
Sites after filtering: 232196
Sites containing missing values: 221834 (95.54%)
Missing values in SNP matrix: 822320 (12.65%)
subsampled 29923 unlinked SNPs
[11]:
tests = {}
nadmix = [0, 1, 2, 3, 4, 5]

# iterate over n admixture edges and store results in a dictionary
for adm in nadmix:
    tmx.params.m = adm
    tmx.run()
    tests[adm] = tmx.results.llik
[12]:
# plot the likelihood for different values of m
toyplot.plot(
    nadmix,
    [tests[i] for i in nadmix],
    width=350,
    height=275,
    stroke_width=3,
    xlabel="n admixture edges",
    ylabel="ln(likelihood)",
);
012345n admixture edges-1000100200ln(likelihood)
2. Iterate over different subsamples of SNPs

The treemix tool randomly subsamples 1 SNP per locus to reduce the effect of linkage on the results. However, depending on the size of your data set, and the strength of the signal, subsampling may yield slightly different results in different iterations. You can check over different subsampled iterations by re-initing the treemix tool with a different (or no) random seed. Below I plot the results of 9 iterations for m=2. I also use the global_=True option here which performs a more thorough (but slower) search.

[13]:
# a gridded canvas to plot trees on
canvas = toyplot.Canvas(width=600, height=700)

# iterate over multiple set of SNPs
for i in range(9):

    # init a treemix analysis object with a random (no) seed
    tmx = ipa.treemix(
        data=data,
        imap=imap,
        minmap=minmap,
        root="bran,fusi",
        global_=True,
        m=2,
        quiet=True
    )

    # run model fit
    tmx.run()

    # select a plot grid axis and add tree to axes
    axes = canvas.cartesian(grid=(3, 3, i))
    tmx.draw_tree(axes)
gemiminivirgsagroleofusibran0.000.050.11virggemiminioleosagrbranfusi0.000.040.09gemiminivirgoleosagrfusibran0.000.060.13minigemivirgsagroleofusibran0.000.060.11minigemivirgoleosagrbranfusi0.000.050.10virggemiminioleosagrbranfusi0.000.050.10minigemivirgsagroleofusibran0.000.050.10minigemivirgsagroleofusibran0.000.060.12gemiminivirgsagroleofusibran0.000.050.09
[14]:
# a gridded canvas to plot trees on
canvas = toyplot.Canvas(width=600, height=700)

# iterate over multiple set of SNPs
for i in range(9):

    # init a treemix analysis object with a random (no) seed
    tmx = ipa.treemix(
        data=data,
        imap=imap,
        minmap=minmap,
        root="bran,fusi",
        global_=True,
        m=3,
        quiet=True
    )

    # run model fit
    tmx.run()

    # create a grid axis and add tree to axes
    axes = canvas.cartesian(grid=(3, 3, i))
    tmx.draw_tree(axes)
gemiminivirgoleosagrfusibran0.000.050.09minigemivirgoleosagrfusibran0.000.050.11gemiminivirgoleosagrbranfusi0.000.050.10minigemivirgsagroleobranfusi0.000.040.08gemiminivirgoleosagrbranfusi0.000.040.08minigemivirgoleosagrfusibran0.000.060.11gemiminivirgsagroleofusibran0.000.050.10gemivirgminisagroleobranfusi0.000.050.10gemiminivirgoleosagrfusibran0.000.060.13
3. Save the plot to pdf
[15]:
import toyplot.pdf
toyplot.pdf.render(canvas, "treemix-m3.pdf")

ipyrad-analysis toolkit: PCA and other dimensionality reduction

View as notebook

Principal component analysis is a dimensionality reduction method used to transform and project data points onto fewer orthogonal axes that can explain the greatest amount of variance in the data. While there are many tools available to implement PCA, the ipyrad tool has many options available specifically to deal with missing data. PCA analyses are very sensitive to missing data. The ipyrad.pca tool makes it easy to perform PCA on RAD-seq data by filtering and/or imputing missing data, and allowing for easy subsampling of individuals to include in analyses.

Required software
[1]:
# conda install ipyrad -c bioconda
# conda install scikit-learn -c bioconda
# conda install toyplot -c eaton-lab
[2]:
import ipyrad.analysis as ipa
import pandas as pd
import toyplot
Required input data files

Your input data should be a .snps.hdf database file produced by ipyrad. If you do not have this you can generate it from any VCF file following the vcf2hdf5 tool tutorial. The database file contains the genotype calls information as well as linkage information that is used for subsampling unlinked SNPs and bootstrap resampling.

[3]:
# the path to your .snps.hdf5 database file
data = "/home/deren/Downloads/ref_pop2.snps.hdf5"
#data = "/home/deren/Downloads/denovo-min50.snps.hdf5"
Input data file and population assignments

Population assignments (imap dictionary) are optional, but can be used in a number of ways by the pca tool. First, you can filter your data to require at least N coverage in each population. Second, you can use the frequency of genotypes within populations to impute missing data for other samples. Finally, population assignments can be used to color points when plotting your results. You can assign individual samples to populations using an imap dictionary like below. We also create a minmap dictionary stating that we want to require 50% coverage in each population.

[4]:
# group individuals into populations
imap = {
    "virg": ["TXWV2", "LALC2", "SCCU3", "FLSF33", "FLBA140"],
    "mini": ["FLSF47", "FLMO62", "FLSA185", "FLCK216"],
    "gemi": ["FLCK18", "FLSF54", "FLWO6", "FLAB109"],
    "bran": ["BJSL25", "BJSB3", "BJVL19"],
    "fusi": ["MXED8", "MXGT4", "TXGR3", "TXMD3"],
    "sagr": ["CUVN10", "CUCA4", "CUSV6"],
    "oleo": ["CRL0030", "HNDA09", "BZBB1", "MXSA3017"],
}

# require that 50% of samples have data in each group
minmap = {i: 0.5 for i in imap}
[5]:
# ipa.snps_extracter(data).names
Enter data file and params

The pca analysis object takes input data as the .snps.hdf5 file produced by ipyrad. All other parameters are optional. The imap dictionary groups individuals into populations and minmap can be used to filter SNPs to only include those that have data for at least some proportion of samples in every group. The mincov option works similarly, it filters SNPs that are shared across less than some proportion of all samples (in contrast to minmap this does not use imap groupings).

When you init the object it will load the data and apply filtering. The printed output tells you how many SNPs were removed by each filter and the remaining amount of missing data after filtering. These remaining missing values are the ones that will be filled by imputation. The options for imputing data are listed further down in this tutorial. Here we are using the “sample” method, which I generally recommend.

[6]:
# init pca object with input data and (optional) parameter options
pca = ipa.pca(
    data=data,
    imap=imap,
    minmap=minmap,
    mincov=0.75,
    impute_method="sample",
)
Samples: 27
Sites before filtering: 349914
Filtered (indels): 0
Filtered (bi-allel): 13001
Filtered (mincov): 110150
Filtered (minmap): 112898
Filtered (combined): 138697
Sites after filtering: 211217
Sites containing missing values: 183722 (86.98%)
Missing values in SNP matrix: 501031 (8.79%)
Imputation: 'sampled'; (0, 1, 2) = 77.1%, 10.7%, 12.2%
Run PCA

Call .run() and to generate the PC axes and the variance explained by each axis. The results are stored in your analysis object as dictionaries under the attributes .pcaxes and .variances. Feel free to take these data and plot them using any method you prefer. The code cell below shows how to save the data to a CSV file, and also to view the PC data as a table.

[6]:
# run the PCA analysis
pca.run()
Subsampling SNPs: 28369/211217
[7]:
# store the PC axes as a dataframe
df = pd.DataFrame(pca.pcaxes[0], index=pca.names)

# write the PC axes to a CSV file
df.to_csv("pca_analysis.csv")

# show the first ten samples and the first 10 PC axes
df.iloc[:10, :10].round(2)
[7]:
0 1 2 3 4 5 6 7 8 9
BJSB3 45.26 52.07 -10.97 -35.64 2.39 -0.39 3.12 0.14 2.59 0.07
BJSL25 43.05 48.69 -9.43 -30.46 1.44 0.72 2.19 -0.17 1.07 0.63
BJVL19 43.01 48.83 -10.85 -31.74 2.47 0.47 2.57 -0.45 2.36 -0.62
BZBB1 39.00 -48.77 4.83 0.08 10.37 -22.45 2.96 -2.56 3.47 -0.63
CRL0030 39.69 -49.19 3.03 -0.23 8.56 -13.14 0.13 1.22 2.31 0.07
CUCA4 12.60 -34.85 -3.61 -4.60 -21.33 40.63 -0.54 15.26 0.97 -4.93
CUSV6 8.41 -33.68 -3.85 -5.30 -21.44 42.09 3.06 -18.19 0.33 8.01
CUVN10 13.45 -35.30 -1.01 -3.65 -14.59 27.80 1.94 1.44 -3.39 -7.37
FLAB109 -31.42 -0.51 -20.02 -3.05 -25.75 -16.48 -2.72 -3.17 1.67 -1.13
FLBA140 -30.57 2.87 25.80 -8.74 3.68 0.65 -0.59 3.00 -0.39 0.15
Run PCA and plot results.

When you call .run() a PCA model is fit to the data and two results are generated: (1) samples weightings on the component axes; (2) the proportion of variance explained by each axis. For convenience we have developed a plotting function that can be called as .draw() to plot these results (generated with `toyplot <https://toyplot.rtfd.io>`__). The first two arguments to this function are the two axes to be plotted. By default this plotting function will use the imap information to color points and create a legend.

[8]:
# plot PC axes 0 and 2
pca.draw(0, 2);
BJSB3BJSL25BJVL19BZBB1CRL0030CUCA4CUSV6CUVN10FLAB109FLBA140FLCK18FLCK216FLMO62FLSA185FLSF33FLSF47FLSF54FLWO6HNDA09LALC2MXED8MXGT4MXSA3017SCCU3TXGR3TXMD3TXWV2-40-2002040PC0 (15.0%) explained-2502550PC2 (6.7%) explainedvirgminigemibranfusisagroleo
Subsampling SNPs

By default run() will randomly subsample one SNP per RAD locus to reduce the effect of linkage on your results. This can be turned off by setting subsample=False, like in the plot above. When using subsampling you can set the random seed to make your results repeatable. The results here subsample 29K SNPs from a possible 228K SNPs, but the final results are quite similar to above.

[9]:
# plot PC axes 0 and 2 with no subsampling
pca.run(subsample=False)
pca.draw(0, 2);
BJSB3BJSL25BJVL19BZBB1CRL0030CUCA4CUSV6CUVN10FLAB109FLBA140FLCK18FLCK216FLMO62FLSA185FLSF33FLSF47FLSF54FLWO6HNDA09LALC2MXED8MXGT4MXSA3017SCCU3TXGR3TXMD3TXWV2-1000100PC0 (13.8%) explained-1000100PC2 (6.6%) explainedvirgminigemibranfusisagroleo
Subsampling with replication

Subsampling unlinked SNPs is generally a good idea for PCA analyses since you want to remove the effects of linkage from your data. It also presents a convenient way to explore the confidence in your results. By using the option nreplicates you can run many replicate analyses that subsample a different random set of unlinked SNPs each time. The replicate results are drawn with a lower opacity and the centroid of all the points for each sample is plotted as a black point. You can hover over the points with your cursor to see the sample names pop-up.

[10]:
# plot PC axes 0 and 2 with many replicate subsamples
pca.run(nreplicates=25, seed=12345)
pca.draw(0, 2);
Subsampling SNPs: 28369/211217
BJSB3BJSL25BJVL19BZBB1CRL0030CUCA4CUSV6CUVN10FLAB109FLBA140FLCK18FLCK216FLMO62FLSA185FLSF33FLSF47FLSF54FLWO6HNDA09LALC2MXED8MXGT4MXSA3017SCCU3TXGR3TXMD3TXWV2-2502550PC0 (14.8%) explained-40-2002040PC2 (6.7%) explainedvirgminigemibranfusisagroleo

Advanced: Imputation algorithms:

We offer three algorithms for imputing missing data:

  1. sample: Randomly sample genotypes based on the frequency of alleles within (user-defined) populations (imap).
  2. kmeans: Randomly sample genotypes based on the frequency of alleles in (kmeans cluster-generated) populations.
  3. None: All missing values are imputed with zeros (ancestral allele).
No imputation (None)

The None option will almost always be a bad choice when there is any reasonable amount of missing data. Missing values will all be filled as zeros (ancestral allele) – this is what many other PCA tools do as well. I show it here for comparison to the imputed results, which are better. The two points near the top of the plot are samples with the most missing data that are erroneously grouped together. The rest of the samples also form much less clear clusters than in the other examples where we use imputation or stricter filtering options.

[11]:
# init pca object with input data and (optional) parameter options
pca1 = ipa.pca(
    data=data,
    imap=imap,
    minmap=minmap,
    mincov=0.25,
    impute_method=None,
)

# run and draw results for impute_method=None
pca1.run(nreplicates=25, seed=123)
pca1.draw(0, 2);
Samples: 27
Sites before filtering: 349914
Filtered (indels): 0
Filtered (bi-allel): 13001
Filtered (mincov): 9517
Filtered (minmap): 112898
Filtered (combined): 121048
Sites after filtering: 228866
Sites containing missing values: 201371 (87.99%)
Missing values in SNP matrix: 640419 (10.36%)
Imputation (null; sets to 0): 100.0%, 0.0%, 0.0%
Subsampling SNPs: 29695/228866
BJSB3BJSL25BJVL19BZBB1CRL0030CUCA4CUSV6CUVN10FLAB109FLBA140FLCK18FLCK216FLMO62FLSA185FLSF33FLSF47FLSF54FLWO6HNDA09LALC2MXED8MXGT4MXSA3017SCCU3TXGR3TXMD3TXWV2-3003060PC0 (12.0%) explained-2002040PC2 (5.1%) explainedvirgminigemibranfusisagroleo
No imputation but stricter filtering (mincov)

Here I do not allow for any missing data (mincov=1.0). You can see that this reduces the number of total SNPs from 349K to 10K. The final reslult is not too different from our first example, but seems a little less smooth. In most data sets it is probably better to include more data by imputing some values, though. Many data sets may not have as many SNPs without missing data as this one.

[12]:
# init pca object with input data and (optional) parameter options
pca2 = ipa.pca(
    data=data,
    imap=imap,
    minmap=minmap,
    mincov=1.0,
    impute_method=None,
)

# run and draw results for impute_method=None and mincov=1.0
pca2.run(nreplicates=25, seed=123)
pca.draw(0, 2);
Samples: 27
Sites before filtering: 349914
Filtered (indels): 0
Filtered (bi-allel): 13001
Filtered (mincov): 321628
Filtered (minmap): 112898
Filtered (combined): 322419
Sites after filtering: 27495
Sites containing missing values: 0 (0.00%)
Missing values in SNP matrix: 0 (0.00%)
Subsampling SNPs: 6675/27495
BJSB3BJSL25BJVL19BZBB1CRL0030CUCA4CUSV6CUVN10FLAB109FLBA140FLCK18FLCK216FLMO62FLSA185FLSF33FLSF47FLSF54FLWO6HNDA09LALC2MXED8MXGT4MXSA3017SCCU3TXGR3TXMD3TXWV2-2502550PC0 (14.8%) explained-40-2002040PC2 (6.7%) explainedvirgminigemibranfusisagroleo
Kmeans imputation (integer)

The kmeans clustering method allows imputing values based on population allele frequencies (like the sample method) but without having to a priori assign individuals to populations. In other words, it is meant to reduce the bias introduced by assigning individuals yourself. Instead, this method uses kmeans clustering to group individuals into “populations” and then imputes values based on those population assignments. This is accomplished through iterative clustering, starting by using only SNPs that are present across 90% of all samples (this can be changed with the topcov param) and then allowing more missing data in each iteration until it reaches the mincov parameter value.

This method works great especially if you have a lot of missing data and fear that user-defined population assignments will bias your results. Here it gives super similar results to our first plots using the “sample” impute method, suggesting that our population assignments are not greatly biasing the results. To use K=7 clusters you simply enter impute_method=7.

[13]:
# kmeans imputation
pca3 = ipa.pca(
    data=data,
    imap=imap,
    minmap=minmap,
    mincov=0.5,
    impute_method=7,
)

# run and draw results for kmeans clustering into 7 groups
pca3.run(nreplicates=25, seed=123)
pca3.draw(0, 2);
Kmeans clustering: iter=0, K=7, mincov=0.9, minmap={'global': 0.5}
Samples: 27
Sites before filtering: 349914
Filtered (indels): 0
Filtered (bi-allel): 13001
Filtered (mincov): 222081
Filtered (minmap): 29740
Filtered (combined): 225958
Sites after filtering: 123956
Sites containing missing values: 96461 (77.82%)
Missing values in SNP matrix: 142937 (4.27%)
Imputation: 'sampled'; (0, 1, 2) = 76.7%, 15.0%, 8.3%
{0: ['FLCK216', 'FLSA185'], 1: ['BJSB3', 'BJSL25', 'BJVL19'], 2: ['FLAB109', 'FLCK18', 'FLMO62', 'FLSF47', 'FLSF54', 'FLWO6'], 3: ['BZBB1', 'CRL0030', 'CUVN10', 'HNDA09', 'MXSA3017'], 4: ['MXED8', 'MXGT4', 'TXGR3', 'TXMD3'], 5: ['FLBA140', 'FLSF33', 'LALC2', 'SCCU3', 'TXWV2'], 6: ['CUCA4', 'CUSV6']}

Kmeans clustering: iter=1, K=7, mincov=0.8, minmap={0: 0.5, 1: 0.5, 2: 0.5, 3: 0.5, 4: 0.5, 5: 0.5, 6: 0.5}
Samples: 27
Sites before filtering: 349914
Filtered (indels): 0
Filtered (bi-allel): 13001
Filtered (mincov): 131220
Filtered (minmap): 111129
Filtered (combined): 150798
Sites after filtering: 199116
Sites containing missing values: 171621 (86.19%)
Missing values in SNP matrix: 427659 (7.95%)
Imputation: 'sampled'; (0, 1, 2) = 77.6%, 10.0%, 12.5%
{0: ['FLAB109', 'FLCK18', 'FLMO62', 'FLSF47', 'FLSF54', 'FLWO6'], 1: ['BZBB1', 'CRL0030', 'CUVN10', 'HNDA09', 'MXSA3017'], 2: ['FLBA140', 'FLSF33', 'LALC2', 'SCCU3', 'TXWV2'], 3: ['MXED8', 'MXGT4', 'TXGR3', 'TXMD3'], 4: ['BJSB3', 'BJSL25', 'BJVL19'], 5: ['FLCK216', 'FLSA185'], 6: ['CUCA4', 'CUSV6']}

Kmeans clustering: iter=2, K=7, mincov=0.7, minmap={0: 0.5, 1: 0.5, 2: 0.5, 3: 0.5, 4: 0.5, 5: 0.5, 6: 0.5}
Samples: 27
Sites before filtering: 349914
Filtered (indels): 0
Filtered (bi-allel): 13001
Filtered (mincov): 76675
Filtered (minmap): 111129
Filtered (combined): 124159
Sites after filtering: 225755
Sites containing missing values: 198260 (87.82%)
Missing values in SNP matrix: 606805 (9.96%)
Imputation: 'sampled'; (0, 1, 2) = 77.4%, 10.1%, 12.5%
{0: ['FLCK216', 'FLSA185'], 1: ['BJSB3', 'BJSL25', 'BJVL19'], 2: ['FLAB109', 'FLCK18', 'FLMO62', 'FLSF47', 'FLSF54', 'FLWO6'], 3: ['BZBB1', 'CRL0030', 'CUVN10', 'HNDA09', 'MXSA3017'], 4: ['MXED8', 'MXGT4', 'TXGR3', 'TXMD3'], 5: ['FLBA140', 'FLSF33', 'LALC2', 'SCCU3', 'TXWV2'], 6: ['CUCA4', 'CUSV6']}

Kmeans clustering: iter=3, K=7, mincov=0.6, minmap={0: 0.5, 1: 0.5, 2: 0.5, 3: 0.5, 4: 0.5, 5: 0.5, 6: 0.5}
Samples: 27
Sites before filtering: 349914
Filtered (indels): 0
Filtered (bi-allel): 13001
Filtered (mincov): 52105
Filtered (minmap): 111129
Filtered (combined): 119932
Sites after filtering: 229982
Sites containing missing values: 202487 (88.04%)
Missing values in SNP matrix: 646076 (10.40%)
Imputation: 'sampled'; (0, 1, 2) = 77.3%, 10.1%, 12.5%
{0: ['FLBA140', 'FLSF33', 'LALC2', 'SCCU3', 'TXWV2'], 1: ['BJSB3', 'BJSL25', 'BJVL19'], 2: ['BZBB1', 'CRL0030', 'HNDA09', 'MXSA3017'], 3: ['FLCK216', 'FLSA185'], 4: ['FLAB109', 'FLCK18', 'FLMO62', 'FLSF47', 'FLSF54', 'FLWO6'], 5: ['MXED8', 'MXGT4', 'TXGR3', 'TXMD3'], 6: ['CUCA4', 'CUSV6', 'CUVN10']}

Kmeans clustering: iter=4, K=7, mincov=0.5, minmap={0: 0.5, 1: 0.5, 2: 0.5, 3: 0.5, 4: 0.5, 5: 0.5, 6: 0.5}
Samples: 27
Sites before filtering: 349914
Filtered (indels): 0
Filtered (bi-allel): 13001
Filtered (mincov): 29740
Filtered (minmap): 115494
Filtered (combined): 123595
Sites after filtering: 226319
Sites containing missing values: 198824 (87.85%)
Missing values in SNP matrix: 627039 (10.26%)
Imputation: 'sampled'; (0, 1, 2) = 77.2%, 10.4%, 12.4%
Subsampling SNPs: 29415/226319
BJSB3BJSL25BJVL19BZBB1CRL0030CUCA4CUSV6CUVN10FLAB109FLBA140FLCK18FLCK216FLMO62FLSA185FLSF33FLSF47FLSF54FLWO6HNDA09LALC2MXED8MXGT4MXSA3017SCCU3TXGR3TXMD3TXWV2-3003060PC0 (14.8%) explained-40-2002040PC2 (6.9%) explainedvirgminigemibranfusisagroleo
Save plot to PDF

You can save the figure as a PDF or SVG automatically by passing an outfile argument to the .draw() function.

[14]:
# The outfile must end in either `.pdf` or `.svg`
pca.draw(outfile="mypca.pdf")
Advanced: Missing data per sample

You can view the proportion of missing data per sample by accessing the .missing data table from your pca analysis object. You can see that most samples in this data set had 10% missing data or less, but a few had 20-50% missing data. You can hover your cursor over the plot above to see the sample names. It seems pretty clear that samples with huge amounts of missing data do not stand out at outliers in these plots like they did in the no-imputation plot. Which is great!

[15]:
# .missing is a pandas DataFrame
pca3.missing.sort_values(by="missing")
[15]:
missing
BJSL25 0.03
BJVL19 0.03
FLBA140 0.03
CRL0030 0.04
LALC2 0.04
FLSF54 0.04
CUVN10 0.06
FLAB109 0.06
MXGT4 0.07
MXED8 0.08
CUSV6 0.08
HNDA09 0.08
FLSF33 0.08
BJSB3 0.09
FLSF47 0.09
MXSA3017 0.09
FLMO62 0.10
TXMD3 0.10
FLWO6 0.11
FLCK18 0.11
TXGR3 0.11
BZBB1 0.11
FLCK216 0.11
FLSA185 0.13
CUCA4 0.14
SCCU3 0.23
TXWV2 0.55
Advanced: TSNE and other dimensionality reduction methods

While PCA plots are very informative, it is sometimes difficult to visualize just how well separated all of your samples are since the results are in many dimensions. A popular tool to further examine the separation of samples is t-distribution stochastic neighbor embedding (TSNE). We’ve implemented this in the pca tool as well, where it first decomposes the data using pca, and then TSNE on the PC axes. The results will vary depending on the parameters and random seed, and so you cannot plot replicates runs using this method. And it is important to explore parameter values to find something that works well.

[16]:
pca.run_tsne(subsample=True, perplexity=4.0, n_iter=100000, seed=123)
Subsampling SNPs: 28369/211217
[17]:
pca.draw();
BJSB3BJSL25BJVL19BZBB1CRL0030CUCA4CUSV6CUVN10FLAB109FLBA140FLCK18FLCK216FLMO62FLSA185FLSF33FLSF47FLSF54FLWO6HNDA09LALC2MXED8MXGT4MXSA3017SCCU3TXGR3TXMD3TXWV2-100-50050100150TNSE component 1-300-200-1000100TNSE component 2virgminigemibranfusisagroleo
Advanced: UMAP dimensionality reduction

From the UMAP docs: “low values of n_neighbors will force UMAP to concentrate on very local structure (potentially to the detriment of the big picture), while large values will push UMAP to look at larger neighborhoods of each point when estimating the manifold structure of the data”

The min_dist parameter controls how tightly UMAP is allowed to pack points together. It, quite literally, provides the minimum distance apart that points are allowed to be in the low dimensional representation

[33]:
pca.run_umap(subsample=False, n_neighbors=12, min_dist=0.1)
[34]:
pca.draw();
BJSB3BJSL25BJVL19BZBB1CRL0030CUCA4CUSV6CUVN10FLAB109FLBA140FLCK18FLCK216FLMO62FLSA185FLSF33FLSF47FLSF54FLWO6HNDA09LALC2MXED8MXGT4MXSA3017SCCU3TXGR3TXMD3TXWV29111315UMAP component 125810UMAP component 2virgminigemibranfusisagroleo

ipyrad-analysis toolkit: RAxML

RAxML is the most popular tool for inferring phylogenetic trees using maximum likelihood. It is fast even for very large data sets. The documentation for raxml is huge, and there are many options. However, I tend to use the same small number of options very frequently, which motivated me to write the ipa.raxml() tool to automate the process of generating RAxml command line strings, running them, and accessing the resulting tree files. The simplicity of this tool makes it easy to incorporate into other more complex tools, for example, to infer tress in sliding windows along the genome using the ipa.treeslider tool.

Required software
[1]:
# conda install ipyrad -c bioconda
# conda install raxml -c bioconda
# conda install toytree -c eaton-lab
[2]:
import ipyrad.analysis as ipa
import toytree
Short Tutorial:

The raxml tool takes a phylip formatted file as input. In addition you can set a number of analysis options either when you init the tool, or afterwards by accessing the .params dictionary. You can view the raxml command string that is generated from the input arguments and you can call .run() to start the tree inference. This example takes about 3-5 minutes to run on my laptop for a data set with 13 samples and ~1.2M SNPs and about 14% missing data.

[6]:
# the path to your phylip formatted output file
phyfile = "../min10_outfiles/min10.phy"
[7]:
# init raxml object with input data and (optional) parameter options
rax = ipa.raxml(data=phyfile, T=4, N=10)

# print the raxml command string for prosperity
print(rax.command)

# run the command, (options: block until finishes; overwrite existing)
rax.run(block=True, force=True)
raxmlHPC-PTHREADS-SSE3 -f a -T 4 -m GTRGAMMA -n test -w /home/deren/Documents/ipyrad/newdocs/cookbook/analysis-raxml -s /home/deren/Documents/ipyrad/tests/pedicularis/data10_outfiles/data10.phy -p 54321 -N 10 -x 12345
job test finished successfully
Draw the inferred tree

After inferring a tree you can then visualize it in a notebook using toytree.

[8]:
# (optional) draw your tree in the notebook
import toytree

# load from the .trees attribute of the raxml object, or from the saved tree file
tre = toytree.tree(rax.trees.bipartitions)

# draw the tree
rtre = tre.root(wildcard="prz")
rtre.draw(tip_labels_align=True, node_labels="support");
38362_rex39618_rex35236_rex40578_rex35855_rex30556_thamno33413_thamno41954_cyathophylloides41478_cyathophylloides30686_cyathophylla29154_superba33588_przewalskii32082_przewalskii100100100100100100100100100100100100
Setting parameters

By default several parameters are pre-set in the raxml object. To remove those parameters from the command string you can set them to None. Additionally, you can build complex raxml command line strings by adding almost any parameter to the raxml object init, like below. You probably can’t do everythin in raxml using this tool, it’s only meant as a convenience. You can always of course just write the raxml command line string by hand instead.

[23]:
# init raxml object
rax = ipa.raxml(data=phyfile, T=4, N=10)

# parameter dictionary for a raxml object
rax.params
[23]:
N        10
T        4
binary   raxmlHPC-PTHREADS-SSE3
f        a
m        GTRGAMMA
n        test
p        54321
s        ~/Documents/ipyrad/tests/pedicularis/data10_outfiles/data10.phy
w        ~/Documents/ipyrad/newdocs/cookbook/analysis-raxml
x        12345
[24]:
# paths to output files produced by raxml inference
rax.trees
[24]:
bestTree                   ~/Documents/ipyrad/newdocs/cookbook/analysis-raxml/RAxML_bestTree.test
bipartitions               ~/Documents/ipyrad/newdocs/cookbook/analysis-raxml/RAxML_bipartitions.test
bipartitionsBranchLabels   ~/Documents/ipyrad/newdocs/cookbook/analysis-raxml/RAxML_bipartitionsBranchLabels.test
bootstrap                  ~/Documents/ipyrad/newdocs/cookbook/analysis-raxml/RAxML_bootstrap.test
info                       ~/Documents/ipyrad/newdocs/cookbook/analysis-raxml/RAxML_info.test
Cookbook

Most frequently used: perform 100 rapid bootstrap analyses followed by 10 rapid hill-climbing ML searches from random starting trees under the GTRGAMMA substitution model.

[9]:
rax = ipa.raxml(
    data=phyfile,
    name="test",
    workdir="analysis-raxml",
    m="GTRGAMMA",
    T=20,
    f="a",
    N=100,
)
print(rax.command)
raxmlHPC-PTHREADS-SSE3 -f a -T 20 -m GTRGAMMA -n test -w /home/deren/Documents/ipyrad/newdocs/cookbook/analysis-raxml -s /home/deren/Documents/ipyrad/tests/pedicularis/data10_outfiles/data10.phy -p 54321 -N 100 -x 12345

Another common option: Perform N rapid hill-climbing ML analyses from random starting trees, with no bootstrap replicates.

[10]:
rax = ipa.raxml(
    data=phyfile,
    name="test",
    workdir="analysis-raxml",
    m="GTRGAMMA",
    T=20,
    f="d",
    N=10,
    x=None,
)
print(rax.command)
raxmlHPC-PTHREADS-SSE3 -f d -T 20 -m GTRGAMMA -n test -w /home/deren/Documents/ipyrad/newdocs/cookbook/analysis-raxml -s /home/deren/Documents/ipyrad/tests/pedicularis/data10_outfiles/data10.phy -p 54321 -N 10
Generating new phylip alignments

See the ipyrad-analysis window_extracter tool for creating new alignments by applying filtering or consensus calling to include/exclude taxa or loci/sites from analyses with options to minimize th amount of missing data in alignments.

What’s next?

If you have reference mapped data then you should see the .treeslider() tool to infer trees in sliding windows along scaffolds; or the .window_extracter() tool to extract, filter, and concatenate RAD loci within a given window (e.g., near some known gene).

ipyrad-analysis toolkit: mrBayes

View as notebook

Bayesian phylogenetic inference can provide several advantages over ML approaches, particularly with regards to inferring dated trees (separating rate and time), and because they assess support differently, using a posterior sample of trees inferred from the full data set as opposed to bootstrap resampling of the data set. This may be particularly relevant to RAD-seq data that contains missing data.

Bayesian inference programs often contain waaay too many options, which make them difficult to use. But it’s kind of necessary in order to make informed decisions when setting priors on parameters, as opposed to treating the analysis like a black box. That being said, I’ve written a sort of blackbox tool for running mrbayes analyses. Once you’ve established a set of parameters that are best for your analysis this tool is useful to then automate it across many loci (e.g., see `ipa.treeslider() <https://ipyrad.readthedocs.io/en/latest/API-analysis/cookbook-treeslider.html>`__).

Required software
[1]:
# conda install ipyrad -c bioconda
# conda install toytree -c eaton-lab
# conda install mrbayes -c bioconda -c conda-forge
[2]:
import ipyrad.analysis as ipa
import toytree
[3]:
# print current versions
print("toytree", toytree.__version__)
! mb -v
toytree 1.0.0
MrBayes, Bayesian Analysis of Phylogeny

Version:   3.2.7a
Features:  SSE AVX FMA Beagle readline
Host type: x86_64-unknown-linux-gnu (CPU: x86_64)
Compiler:  gnu 7.3.0
Setting mb param settings

The purpose of the ipyrad wrapper tool for mrbayes is merely for convenience. You should still take care to understand the priors and parameters that you are selecting by reading the mrbayes documentation. We only support two relatively simple models, a clock and non-clock model with a number of parameters listed in the .params attribute of the analysis object. Here I demonstrate the clock model with uncorrelated lognormal distributed rate variation and a birth-death tree prior. This run, which samples 5M iterations takes about 20 hours on 4 cores.

[4]:
# the path to your NEXUS formatted file
nexfile = "/home/deren/Documents/ipyrad/sandbox/pedicularis/analysis-ipyrad/min10_outfiles/min10.nex"

[5]:
# init mrbayes object with input data and (optional) parameter options
mb = ipa.mrbayes(
    name="pedicularis-min10-5M",
    data=nexfile,
    clock=True,
    ngen=5000000,
    samplefreq=5000,
    nruns=1,
)
[6]:
# modify a param and show all params
mb.params.samplefreq = 10000
mb.params
[6]:
brlenspr       clock:birthdeath
clockratepr    lognorm(-7,0.6)
clockvarpr     tk02
extinctionpr   beta(2, 200)
nchains        4
ngen           5000000
nruns          1
samplefreq     10000
sampleprob     0.1
samplestrat    diversity
speciationpr   exp(10)
tk02varpr      exp(1.0)
treeagepr      offsetexp(1, 5)
[7]:
# print the mb nexus string; this can be modified by changing .params
print(mb.nexus_string)
#NEXUS
execute /home/deren/Documents/ipyrad/sandbox/pedicularis/analysis-ipyrad/min10_outfiles/min10.nex;

begin mrbayes;
set autoclose=yes nowarn=yes;

lset nst=6 rates=gamma;

prset clockratepr=lognorm(-7,0.6);
prset clockvarpr=tk02;
prset tk02varpr=exp(1.0);
prset brlenspr=clock:birthdeath;
prset samplestrat=diversity;
prset sampleprob=0.1;
prset speciationpr=exp(10);
prset extinctionpr=beta(2, 200);
prset treeagepr=offsetexp(1,5);

mcmcp ngen=5000000 nrun=1 nchains=4;
mcmcp relburnin=yes burninfrac=0.25;
mcmcp samplefreq=10000;
mcmcp printfreq=10000 diagnfr=5000;
mcmcp filename=/home/deren/Documents/ipyrad/newdocs/API-analysis/analysis-mb/pedicularis-min10-5M.nex;
mcmc;

sump filename=/home/deren/Documents/ipyrad/newdocs/API-analysis/analysis-mb/pedicularis-min10-5M.nex;
sumt filename=/home/deren/Documents/ipyrad/newdocs/API-analysis/analysis-mb/pedicularis-min10-5M.nex;
end;

Run mrbayes

This takes about 20 minutes to run on my laptop for a data set with 13 samples and ~1.2M SNPs and about 14% missing data. For a publication quality analyses you will likely want to run it quite a bit longer and to test for convergence of your runs using a tool like tracer and treeannotator.

[8]:
# run the command, (options: block until finishes; overwrite existing)
mb.run(block=True, force=True)
job pedicularis-min10-5M finished successfully
Accessing results

You can access the results files of the mrbayes analysis from the mb analysis object from the .trees attribute, or of course you can also write out the full path to the result files which will be located in your working directory which can be set as a parameter option, otherwise is defaulted to "./analysis-mb/".

[9]:
# you can access the tree results from the mb analysis object
mb.trees
[9]:
constre   ~/Documents/ipyrad/newdocs/API-analysis/analysis-mb/pedicularis-min10-5M.nex.con.tre
info      ~/Documents/ipyrad/newdocs/API-analysis/analysis-mb/pedicularis-min10-5M.nex.lstat
posttre   ~/Documents/ipyrad/newdocs/API-analysis/analysis-mb/pedicularis-min10-5M.nex.t
[10]:
# for example to select the consensus tree path
mb.trees.constre
[10]:
'/home/deren/Documents/ipyrad/newdocs/API-analysis/analysis-mb/pedicularis-min10-5M.nex.con.tre'
Check the parameter values and convergence

On this very large data set (in terms of number of sites) we can see that the posterior has not explored the parameter space very well by the end of this run. We would hope to see the ESS score for all parameters above 100. Here the TH, TL, net_speciation, and clockrate parameters have low convergence scores.

[12]:
# show the convergence statistics (from the .pstat mb output file)
mb.convergence_stats.round(3)
[12]:
Mean Variance Lower Upper Median ESS
Parameter
TH 0.055 0.000 0.045 0.067 0.054 4.014
TL 0.443 0.005 0.328 0.564 0.438 4.031
r(A<->C) 0.102 0.000 0.100 0.105 0.102 325.813
r(A<->G) 0.286 0.000 0.283 0.289 0.286 267.213
r(A<->T) 0.112 0.000 0.109 0.114 0.112 341.615
r(C<->G) 0.102 0.000 0.100 0.105 0.102 376.000
r(C<->T) 0.296 0.000 0.293 0.299 0.296 376.000
r(G<->T) 0.102 0.000 0.100 0.104 0.102 376.000
pi(A) 0.297 0.000 0.296 0.298 0.297 238.441
pi(C) 0.207 0.000 0.206 0.207 0.207 331.586
pi(G) 0.208 0.000 0.207 0.209 0.208 288.870
pi(T) 0.288 0.000 0.288 0.289 0.288 239.621
alpha 0.018 0.000 0.000 0.038 0.018 182.125
net_speciation 0.406 0.016 0.189 0.621 0.400 10.388
relative_extinction 0.010 0.000 0.000 0.024 0.009 367.141
tk02var 13.475 7.837 8.495 18.632 13.420 220.202
clockrate 0.003 0.000 0.002 0.005 0.003 14.291
Plotting mb consesus tree

You can plot either the consensus tree (.nex.cons.tre) or the posterior distribution of trees (.nex.t) produced by a mrbayes run by loading the single tree or list of trees with toytree as demonstrated below.

[16]:
# load from the .trees attribute or from the saved tree file
tre = toytree.tree(mb.trees.constre, tree_format=10)

# draw the tree with styling
tre.draw(
    tip_labels_align=True,
    scalebar=True,
    #node_labels="support",
);
39618_rex_SRR175472338362_rex_SRR175472535236_rex_SRR175473140578_rex_SRR175472435855_rex_SRR175472630556_thamno_SRR175472033413_thamno_SRR175472841954_cyathophylloides_SRR175472141478_cyathophylloides_SRR175472230686_cyathophylla_SRR175473029154_superba_SRR175471533588_przewalskii_SRR175472732082_przewalskii_SRR1754729061218
Plotting mb posterior distribution of trees
[27]:
# load from the .trees attribute or from the saved tree file
mtre = toytree.mtree(mb.trees.posttre)

# remove the burnin manually
mtre.treelist = mtre.treelist[10:]

# draw a few trees on a shared axis
mtre.draw_tree_grid(
    nrows=1, ncols=4, start=20, shared_axis=True,
    height=400,
);
38362_rex_SRR175472539618_rex_SRR175472335236_rex_SRR175473140578_rex_SRR175472435855_rex_SRR175472630556_thamno_SRR175472033413_thamno_SRR175472841478_cyathophylloides_SRR175472241954_cyathophylloides_SRR175472129154_superba_SRR175471530686_cyathophylla_SRR175473032082_przewalskii_SRR175472933588_przewalskii_SRR175472738362_rex_SRR175472539618_rex_SRR175472335236_rex_SRR175473135855_rex_SRR175472640578_rex_SRR175472430556_thamno_SRR175472033413_thamno_SRR175472829154_superba_SRR175471530686_cyathophylla_SRR175473041478_cyathophylloides_SRR175472241954_cyathophylloides_SRR175472133588_przewalskii_SRR175472732082_przewalskii_SRR175472938362_rex_SRR175472539618_rex_SRR175472335236_rex_SRR175473140578_rex_SRR175472435855_rex_SRR175472630556_thamno_SRR175472033413_thamno_SRR175472830686_cyathophylla_SRR175473029154_superba_SRR175471541954_cyathophylloides_SRR175472141478_cyathophylloides_SRR175472233588_przewalskii_SRR175472732082_przewalskii_SRR175472938362_rex_SRR175472539618_rex_SRR175472335236_rex_SRR175473135855_rex_SRR175472640578_rex_SRR175472430556_thamno_SRR175472033413_thamno_SRR175472829154_superba_SRR175471530686_cyathophylla_SRR175473041478_cyathophylloides_SRR175472241954_cyathophylloides_SRR175472133588_przewalskii_SRR175472732082_przewalskii_SRR17547290.000.020.03

The command below plots a cloud tree diagram across the ~90 trees in the posterior distribution. Because we have no fossil calibrations for this data the root age is not of great interest so I use the .mod fuction below to scale the root height of each tree to 1.0. This will highlight variation among trees in relative node heights. Below you can see that in the posterior set of trees there some variation around node heights in the middle of the tree, and there does not appear to be any variation in the topology recovered, which is not uncommon for large concatenated data sets.

[38]:
# set root height of all trees to 1.0
mtre.treelist = [i.mod.node_scale_root_height(1.0) for i in mtre.treelist]
[39]:
# draw the posterior of trees overlapping
mtre.draw_cloud_tree(
    width=400, height=300,
    html=True,
    edge_colors="red",
    edge_style={"stroke-opacity": 0.02},
    tip_labels={i: i.rsplit("_", 1)[0] for i in mtre.treelist[0].get_tip_labels()},
);
38362_rex39618_rex35236_rex35855_rex40578_rex30556_thamno33413_thamno29154_superba30686_cyathophylla41478_cyathophylloides41954_cyathophylloides33588_przewalskii32082_przewalskii
Save figures
[14]:
import toyplot.pdf
canvas, axes = mtre.draw_cloud_tree(
    width=400, height=300,
    html=True,
    edge_colors="red",
    tip_labels={i: i.rsplit("_", 1)[0] for i in mtre.treelist[0].get_tip_labels()}
);
toyplot.pdf.render(canvas, "./pedicularis-min10-5M-mb-tree.pdf")
What else?

If you have reference mapped data then you should see the .treeslider() tool to infer trees in sliding windows along scaffolds; or the .extractor() tool to extract, filter, and concatenate RAD loci within a given window (e.g., near some known gene).

ipyrad-analysis toolkit: tetrad

tetrad is a species tree inference tool based on the SVDQuartets algorithm of Chifman and Kubatko. It uses the theory of phylogenetic invariants to resolve quartet trees from SNPs for all sets of quartets in a larger tree, and then joins the quartets together into a supertree using QMC. Here I demonstrate how to call tetrad from the ipyrad-analysis tools, which is convenient for keeping all of your analyses in jupyter notebooks. Alternatively, you could also call tetrad as a command-line tool (see the tetrad docs).

The following features of tetrad are highlighted:

Parallelization: tetrad can be massively parallelized on an HPC cluster. It approximately scales linearly with the number of available cores. Using the flexible ipyparallel backend, you can even parallelize over multiple nodes using MPI ([more details coming soon]).

Bootstrap sampling: In contrast to SVDquartets, tetrad is designed particularly to work well with RAD-seq data, though it can apply to any SNP data set. Bootstrap replicates resample the number of loci with replacement to the same size as the original data set.

SNP subsampling: The underlying model assumes that the examined SNPs are unlinked, and tetrad uses a very efficient method to maximize the number of unlinked SNPs used in the analysis. A very crude way to achieve unlinked SNPs would be to sample one SNP per locus before beginning the analysis. However, a site that is variable in the context of all your samples is not necessarily informative for any given quartet of samples. Instead, tetrad subsamples a single SNP from every locus separately for every quartet set in the analysis, and repeats this independently in every bootstrap replicate. This way the maximum amount of unlinked SNP information is used in every quartet inference.

Required software
[1]:
# conda install ipyrad -c bioconda
# conda install tetrad -c eaton-lab -c conda-forge
[2]:
import ipyrad.analysis as ipa
import toytree
Input data

The input data file should be the .snps.hdf5 file produced by ipyrad (v.0.9.13 or newer). If you did not assemble your data in ipyrad then you can convert your SNPs data to this format from any VCF using the vcf_to_hdf5 tool.

[3]:
# the path to your sequence data in HDF5 format
data = "/home/deren/Documents/virentes-reference/analysis-ipyrad/ref_min4_outfiles/ref_min4.snps.hdf5"
Initialize the analysis object

Here you can enter parameters for the analysis. By default only a subsample of the total quartets will be inferred. To instead infer all quartets simply enter a very large number for the nquartets parameter and it will use the maximum. When you initialize the object it will print the size of your dataset, the number of loci, and the number of quartets. The number of loci is of interest because this is the maximum number of SNPs that will be used in an analysis.

[7]:
# init analysis object with input data and (optional) parameter options
tet = ipa.tetrad(
    name="virentes-min4",
    data=data,
    nquartets=1e6,
    nboots=16,
)
loading snps array [37 taxa x 1182005 snps]
max unlinked SNPs per quartet [nloci]: 88938
quartet sampler [full]: 66045 / 66045
Call run
[8]:
tet.run(auto=True)
Parallel connection | d9118d19223a: 40 cores
initializing quartet sets database
[####################] 100% 0:01:48 | inferring full tree * | mean SNPs/quartet: 78381
[####################] 100% 0:01:32 | bootstrap inference 1 | mean SNPs/quartet: 77664
[####################] 100% 0:01:33 | bootstrap inference 2 | mean SNPs/quartet: 78121
[####################] 100% 0:01:33 | bootstrap inference 3 | mean SNPs/quartet: 78356
[####################] 100% 0:01:33 | bootstrap inference 4 | mean SNPs/quartet: 78399
[####################] 100% 0:01:32 | bootstrap inference 5 | mean SNPs/quartet: 77797
[####################] 100% 0:01:33 | bootstrap inference 6 | mean SNPs/quartet: 78836
[####################] 100% 0:01:32 | bootstrap inference 7 | mean SNPs/quartet: 78550
[####################] 100% 0:01:32 | bootstrap inference 8 | mean SNPs/quartet: 77943
[####################] 100% 0:01:32 | bootstrap inference 9 | mean SNPs/quartet: 77753
[####################] 100% 0:01:33 | bootstrap inference 10 | mean SNPs/quartet: 78646
[####################] 100% 0:01:33 | bootstrap inference 11 | mean SNPs/quartet: 78200
[####################] 100% 0:01:32 | bootstrap inference 12 | mean SNPs/quartet: 78073
[####################] 100% 0:01:32 | bootstrap inference 13 | mean SNPs/quartet: 78056
[####################] 100% 0:01:33 | bootstrap inference 14 | mean SNPs/quartet: 78491
[####################] 100% 0:01:32 | bootstrap inference 15 | mean SNPs/quartet: 78709
[####################] 100% 0:01:32 | bootstrap inference 16 | mean SNPs/quartet: 78777
Show the full tree with bootstrap supports
[22]:
tre = toytree.tree(tet.trees.tree).root(["HE", "NI"])
tre.draw(node_labels="support", use_edge_lengths=False);
FLSF33SCCU3FLBA140TXWV2LALC2FLSA185FLCK216FLMO62FLSF54FLAB109FLWO6FLCK18FLSF47HNDA09CRL0030BZBB1MXSA3017CUVN10CRL0001CUCA4CUSV6CUMM5BJVL19BJSL25BJSB3MXED8MXGT4TXMD3TXGR3ENARDODUreferenceCHHENI1810018251003737754343621001001006837100100100128110010081100100100100100100100100100100100100
Show the majority-rule consensus tree with bootstrap supports
[23]:
tre = toytree.tree(tet.trees.cons).root(["HE", "NI"])
tre.draw(node_labels="support", use_edge_lengths=False);
FLSF33LALC2FLBA140SCCU3TXWV2FLCK216FLMO62FLSA185FLWO6FLSF54FLAB109FLCK18FLSF47CRL0030HNDA09BZBB1MXSA3017CRL0001CUVN10CUCA4CUSV6CUMM5BJVL19BJSL25BJSB3MXED8MXGT4TXGR3TXMD3ENARDODUreferenceCHNIHE3825381003862751005662100100100563810010010069318110010081100100100100100100100100100100100100
Show variation over the bootstrap replicates
[21]:
mtre = toytree.mtree(tet.trees.boots)
mtre.treelist = [i.root(["HE", "NI"]) for i in mtre.treelist]
mtre.draw_cloud_tree(
    height=600,
    width=400,
    use_edge_lengths=False,
    html=True,
);
FLBA140LALC2FLSF33SCCU3TXWV2FLMO62FLCK216FLSA185FLWO6FLCK18FLSF54FLAB109FLSF47HNDA09CRL0030BZBB1MXSA3017CRL0001CUVN10CUSV6CUCA4CUMM5BJVL19BJSB3BJSL25MXED8MXGT4TXGR3TXMD3DODUENARreferenceCHHENI
Continuing from a checkpoint

If you want to run more bootstrap replicates you can simply reinit an analysis object with the same name and set the number of bootstrap replicates to a higher value and call .run() again. If you try calling .run() again and you have already completed all of the requested results then it will simply print that it is finished. If you want it to overwrite existing results with the same name then you can use the force arg in run.

[24]:
# analysis is finished so it will not run
tet.run()
16 bootstrap result trees already exist for virentes-min4.

Here I set the number of requested bootstrap replicates to 20 and call .run() again. You can see that the analysis continues from 17, since we already completed 16 bootstrap replicates earlier, and will go until it completes 20 bootstraps.

[26]:
# increase nboots and continue from existing analysis object
tet.params.nboots = 20
tet.run(auto=True)
Parallel connection | d9118d19223a: 40 cores
[####################] 100% 0:01:49 | bootstrap inference 17 | mean SNPs/quartet: 78350
[####################] 100% 0:01:34 | bootstrap inference 18 | mean SNPs/quartet: 77653
[####################] 100% 0:01:32 | bootstrap inference 19 | mean SNPs/quartet: 78560
[####################] 100% 0:01:32 | bootstrap inference 20 | mean SNPs/quartet: 78380

Alternatively, maybe you are returning to this analysis after a while and decide you want to do more bootstraps. You can re-load the analysis object by entering the same name and working_dir as in the original analysis, and adding the load=True argument. I set the number of bootstraps to 25 now. This will load the results from before and add new results when you call .run().

[29]:
# # re-init analysis object (will load existing results at this name)
# tet = ipa.tetrad(
#     name="virentes-min4",
#     data=data,
#     nquartets=1e6,
#     nboots=25,
#     load=True,
# )
# tet.run(auto=True)

ipyrad-analysis toolkit: STRUCTURE

Structure v.2.3.4 is a standard tool for examining population genetic structure based on allele frequencies within and among populations. Although many new implementations of the structure algorithm have been developed in recent years offering improvements to speed, the classic tool offers a number of useful options that keep it relevant to this day.

Required software
[1]:
# conda install ipyrad -c bioconda
# conda install -c bioconda -c ipyrad structure clumpp
# conda install toyplot -c eaton-lab
[2]:
import ipyrad.analysis as ipa
import toyplot
Required input data files

Your input data should be a .snps.hdf database file produced by ipyrad. If you do not have this you can generate it from any VCF file following the vcf2hdf5 tool tutorial. The database file contains the genotype calls information as well as linkage information that is used for subsampling unlinked SNPs and bootstrap resampling.

[3]:
# the path to your .snps.hdf5 database file
data = "/home/deren/Downloads/ref_pop2.snps.hdf5"
Note: missing data in STRUCTURE analyses:

Structure infers the values of missing data while it runs the MCMC chain. No imputation is required, but it will perform more accurately if there is less missing data and when base calls are more accurate. I recommend not imputing data and simply filtering fairly stringently.

Approximate run times

This example data set should probably be run for a longer burnin and number of reps if it were to be used in a publication. For reference, this data set takes about 2.5 hours to run 12 jobs on a 4 core laptop for a data set with 27 samples and ~125K SNPs. If your data set has more samples or SNPs then it will take longer. If you have 2X as many cores then it will run 2X faster.

Input data file and population assignments

If you are using the “sample” input method then population assignments (imap dictionary) are used for for filtering, color coding plots, and for imputation. If you are using the “kmeans” imputing method then population assignments are only used for filtering and color coding plots.

[4]:
# group individuals into populations
imap = {
    "virg": ["TXWV2", "LALC2", "SCCU3", "FLSF33", "FLBA140"],
    "mini": ["FLSF47", "FLMO62", "FLSA185", "FLCK216"],
    "gemi": ["FLCK18", "FLSF54", "FLWO6", "FLAB109"],
    "sagr": ["CUVN10", "CUCA4", "CUSV6"],
    "oleo": ["CRL0030", "HNDA09", "BZBB1", "MXSA3017"],
    "fusi": ["MXED8", "MXGT4", "TXGR3", "TXMD3"],
    "bran": ["BJSL25", "BJSB3", "BJVL19"],
}

# require that 50% of samples have data in each group
minmap = {i: 0.5 for i in imap}
Enter data file and params

The struct analysis object takes input data as the .snps.hdf5 file produced by ipyrad. All other parameters are optional. The imap dictionary groups individuals into populations and minmap can be used to filter SNPs to only include those that have data for at least some proportion of samples in every group. The mincov option works similarly, it filters SNPs that are shared across less than some proportion of all samples (in contrast to minmap this does not use imap groupings).

When you init the object it will load the data and apply filtering. The printed output tells you how many SNPs were removed by each filter and the remaining amount of missing data after filtering. These remaining missing values are the ones that will be filled with imputation.

[6]:
# init analysis object with input data and (optional) parameter options
struct = ipa.structure(
    name="test",
    data=data,
    imap=imap,
    minmap=minmap,
    mincov=0.9,
)
Samples: 27
Sites before filtering: 349914
Filtered (indels): 0
Filtered (bi-allel): 13001
Filtered (mincov): 222081
Filtered (minmap): 112898
Filtered (combined): 226418
Sites after filtering: 123496
Sites containing missing values: 96001 (77.74%)
Missing values in SNP matrix: 142017 (4.26%)
Run STRUCTURE and plot results.

The burnin and numreps parameters determine the length of the run. For analyses with many samples and with larger values of K you should use much larger values than these.

[7]:
struct.mainparams.burnin = 5000
struct.mainparams.numreps = 10000
[8]:
struct.run(nreps=3, kpop=[2, 3, 4, 5], auto=True)
Parallel connection | oud: 4 cores
[####################] 100% 2:26:57 | running 12 structure jobs
Analyze results: Choosing K
[83]:
etable = struct.get_evanno_table([2, 3, 4, 5])
etable
[83]:
Nreps lnPK lnPPK deltaK estLnProbMean estLnProbStdev
2 3 0.000000 0.000000 0.000000 -254535.766667 2023.420259
3 3 25229.900000 35261.166667 30.892635 -229305.866667 1141.410147
4 3 -10031.266667 1451.800000 1.675614 -239337.133333 866.428568
5 3 -8579.466667 0.000000 0.000000 -247916.600000 8537.460208
[90]:
# get canvas object and set size
canvas = toyplot.Canvas(width=400, height=300)

# plot the mean log probability of the models in red
axes = canvas.cartesian(ylabel="estLnProbMean")
axes.plot(etable.estLnProbMean * -1, color="darkred", marker="o")
axes.y.spine.style = {"stroke": "darkred"}

# plot delta K with its own scale bar of left side and in blue
axes = axes.share("x", ylabel="deltaK", ymax=etable.deltaK.max() + etable.deltaK.max() * .25)
axes.plot(etable.deltaK, color="steelblue", marker="o");
axes.y.spine.style = {"stroke": "steelblue"}

# set x labels
axes.x.ticks.locator = toyplot.locator.Explicit(range(len(etable.index)), etable.index)
axes.x.label.text = "K (N ancestral populations)"
2345K (N ancestral populations)230000235000240000245000250000255000estLnProbMean010203040deltaK
Analyze results: Barplots
[7]:
k = 3
table = struct.get_clumpp_table(k)

[K3] 3/3 results permuted across replicates (max_var=0).
[8]:
# sort list by columns
table.sort_values(by=list(range(k)), inplace=True)

# or, sort by a list of names (here taken from imap)
import itertools
onames = list(itertools.chain(*imap.values()))
table = table.loc[onames]
[9]:
# build barplot
canvas = toyplot.Canvas(width=500, height=250)
axes = canvas.cartesian(bounds=("10%", "90%", "10%", "45%"))
axes.bars(table)

# add labels to x-axis
ticklabels = [i for i in table.index.tolist()]
axes.x.ticks.locator = toyplot.locator.Explicit(labels=ticklabels)
axes.x.ticks.labels.angle = -60
axes.x.ticks.show = True
axes.x.ticks.labels.offset = 10
axes.x.ticks.labels.style = {"font-size": "12px"}
TXWV2LALC2SCCU3FLSF33FLBA140FLSF47FLMO62FLSA185FLCK216FLCK18FLSF54FLWO6FLAB109CUVN10CUCA4CUSV6CRL0030HNDA09BZBB1MXSA3017MXED8MXGT4TXGR3TXMD3BJSL25BJSB3BJVL190.00.51.0

Cookbook

Advanced: Load existing results

Results files can be loaded by providing the name and workdir combination that leads to the path where your previous results were stored.

[12]:
rerun = ipa.structure(
    data=data,
    name="test",
    workdir="analysis-structure",
    imap=imap,
    load_only=True,
)
12 previous results loaded for run [test]
[13]:
rerun.get_clumpp_table(3)
[K3] 3/3 results permuted across replicates (max_var=0).
[13]:
0 1 2
BJSB3 0.0000 1.0000 0.0000
BJSL25 0.0000 1.0000 0.0000
BJVL19 0.0000 1.0000 0.0000
BZBB1 0.0000 0.0000 1.0000
CRL0030 0.0000 0.0000 1.0000
CUCA4 0.3450 0.0000 0.6550
CUSV6 0.4098 0.0000 0.5902
CUVN10 0.3408 0.0000 0.6592
FLAB109 1.0000 0.0000 0.0000
FLBA140 1.0000 0.0000 0.0000
FLCK18 1.0000 0.0000 0.0000
FLCK216 1.0000 0.0000 0.0000
FLMO62 0.9987 0.0010 0.0003
FLSA185 1.0000 0.0000 0.0000
FLSF33 1.0000 0.0000 0.0000
FLSF47 1.0000 0.0000 0.0000
FLSF54 1.0000 0.0000 0.0000
FLWO6 1.0000 0.0000 0.0000
HNDA09 0.0000 0.0000 1.0000
LALC2 1.0000 0.0000 0.0000
MXED8 0.1760 0.6953 0.1287
MXGT4 0.1531 0.8093 0.0377
MXSA3017 0.0477 0.0013 0.9510
SCCU3 1.0000 0.0000 0.0000
TXGR3 0.3649 0.6267 0.0083
TXMD3 0.3987 0.6010 0.0003
TXWV2 1.0000 0.0000 0.0000

Advanced: Add replicates or additional K values

You can continue an analysis with the same name and workdir by setting additional replicates or values of K and calling .run() again. Here I will increase the number of replicates per K value from 3 to 5, and run one additional K value. Be sure to use all of the same parameter and filtering values that you used in the previous run or you might cause unexpected problems.

Here because we already finished 3 replicates for K=2,3,4,5 it will run 2 more for each of those, and it will run 5 replicates for K=6 since we do not have any finished replicates of those yet. You can see which result files exist for a named analysis object by accessing the .result_files attribute, or by looking in the working directory. To overwrite existing files instead of adding more replicates you can use force=True in the run command. You could also simply create a new object with a different name.

[15]:
# init analysis object with same params as previously
struct = ipa.structure(
    name="test",
    data=data,
    imap=imap,
    minmap=minmap,
    mincov=0.9,
)

# use the same params as before
struct.mainparams.burnin = 5000
struct.mainparams.numreps = 10000

# call run for all K values you want to have 5 finished replicates
struct.run(nreps=5, kpop=[2, 3, 4, 5, 6], auto=True)
12 previous results loaded for run [test]
Samples: 27
Sites before filtering: 349914
Filtered (indels): 0
Filtered (bi-allel): 13001
Filtered (mincov): 222081
Filtered (minmap): 112898
Filtered (combined): 226418
Sites after filtering: 123496
Sites containing missing values: 96001 (77.74%)
Missing values in SNP matrix: 142017 (4.26%)
Parallel connection | oud: 4 cores
[####################] 100% 3:39:43 | running 13 structure jobs
[16]:
struct.get_evanno_table([2, 3, 4, 5, 6])
[16]:
Nreps lnPK lnPPK deltaK estLnProbMean estLnProbStdev
2 5 0.00 0.00 0.000000 -254950.12 1675.280397
3 5 25807.62 38146.76 39.701878 -229142.50 960.830118
4 5 -12339.14 7180.72 2.018760 -241481.64 3556.995749
5 5 -5158.42 8885.00 1.413531 -246640.06 6285.676647
6 5 3726.58 0.00 0.000000 -242913.48 2164.870641

ipyrad-analysis toolkit: sratools

For reproducibility purposes, it is nice to be able to download the raw data for your analysis from an online repository like NCBI with a simple script at the top of your notebook. We’ve written a simple wrapper for the sratools command line program (which is notoriously difficult to use and poorly documented) to try to make this easier to do.

Required software
[1]:
# conda install ipyrad -c bioconda
# conda install sratools -c bioconda
[2]:
import ipyrad.analysis as ipa
Fetch info for a published data set by its accession ID

You can find the study ID or individual sample IDs from published papers or by searching the NCBI or related databases. ipyrad can take as input one or more accessions IDs for individual Runs or Studies (SRR or SRP, and similarly ERR or ERP, etc.).

[3]:
# init sratools object with an accessions argument
sra = ipa.sratools(accessions="SRP065788")
[4]:
# fetch info for all samples from this study, save as a dataframe
stable = sra.fetch_runinfo()

Fetching project data...
[5]:
# the dataframe has all information about this study
stable.head()
[5]:
Run ReleaseDate LoadDate spots bases spots_with_mates avgLength size_MB AssemblyName download_path ... SRAStudy BioProject Study_Pubmed_id ProjectID Sample BioSample SampleType TaxID ScientificName SampleName
0 SRR2895732 2015-11-04 15:50:01 2015-11-04 17:19:15 2009174 182834834 0 91 116 NaN https://sra-download.ncbi.nlm.nih.gov/sos/sra-... ... SRP065788 PRJNA299402 NaN 299402 SRS1146158 SAMN04202163 simple 224736 Viburnum betulifolium Lib1_betulifolium
1 SRR2895743 2015-11-04 15:50:01 2015-11-04 17:18:35 2452970 223220270 0 91 140 NaN https://sra-download.ncbi.nlm.nih.gov/sos/sra-... ... SRP065788 PRJNA299402 NaN 299402 SRS1146171 SAMN04202164 simple 1220044 Viburnum bitchiuense Lib1_bitchiuense_combined
2 SRR2895755 2015-11-04 15:50:01 2015-11-04 17:18:46 4640732 422306612 0 91 264 NaN https://sra-download.ncbi.nlm.nih.gov/sos/sra-... ... SRP065788 PRJNA299402 NaN 299402 SRS1146182 SAMN04202165 simple 237927 Viburnum carlesii Lib1_carlesii_D1_BP_001
3 SRR2895756 2015-11-04 15:50:01 2015-11-04 17:20:18 3719383 338463853 0 91 214 NaN https://sra-download.ncbi.nlm.nih.gov/sos/sra-... ... SRP065788 PRJNA299402 NaN 299402 SRS1146183 SAMN04202166 simple 237928 Viburnum cinnamomifolium Lib1_cinnamomifolium_PWS2105X
4 SRR2895757 2015-11-04 15:50:01 2015-11-04 17:20:06 3745852 340872532 0 91 213 NaN https://sra-download.ncbi.nlm.nih.gov/sos/sra-... ... SRP065788 PRJNA299402 NaN 299402 SRS1146181 SAMN04202167 simple 237929 Viburnum clemensae Lib1_clemensiae_DRY6_PWS_2135

5 rows × 30 columns

File names

You can select columns by their index number to use for file names. See below.

[8]:
stable.iloc[:5, [0, 28, 29]]
[8]:
Run ScientificName SampleName
0 SRR2895732 Viburnum betulifolium Lib1_betulifolium
1 SRR2895743 Viburnum bitchiuense Lib1_bitchiuense_combined
2 SRR2895755 Viburnum carlesii Lib1_carlesii_D1_BP_001
3 SRR2895756 Viburnum cinnamomifolium Lib1_cinnamomifolium_PWS2105X
4 SRR2895757 Viburnum clemensae Lib1_clemensiae_DRY6_PWS_2135
Download the data

From an sratools object you can fetch just the info, or you can download the files as well. Here we call .run() to download the data into a designated workdir. There are arguments for how to name the files according to name fields in the fetch_runinfo table. The accessions argument here is a list of the first five SRR sample IDs in the table above.

[10]:
# select first 5 samples
list_of_srrs = stable.Run[:5]
list_of_srrs
[10]:
0    SRR2895732
1    SRR2895743
2    SRR2895755
3    SRR2895756
4    SRR2895757
Name: Run, dtype: object
[11]:
# new sra object
sra2 = ipa.sratools(accessions=list_of_srrs, workdir="downloaded")

# call download (run) function
sra2.run(auto=True, name_fields=(1,30))
Parallel connection | oud: 4 cores
[####################] 100% 0:02:07 | downloading/extracting fastq data

5 fastq files downloaded to /home/deren/Documents/ipyrad/newdocs/cookbook/downloaded
Check the data files

You can see that the files were named according to the SRR and species name in the table. The intermediate .sra files were removed and only the fastq files were saved.

[12]:
! ls -l downloaded
total 6174784
-rw-rw-r-- 1 deren deren 1372440058 Aug 17 16:36 SRR2895732_Lib1_betulifolium.fastq
-rw-rw-r-- 1 deren deren 1422226640 Aug 17 16:36 SRR2895743_Lib1_bitchiuense_combined.fastq
-rw-rw-r-- 1 deren deren  759216310 Aug 17 16:37 SRR2895755_Lib1_carlesii_D1_BP_001.fastq
-rw-rw-r-- 1 deren deren 1812215534 Aug 17 16:36 SRR2895756_Lib1_cinnamomifolium_PWS2105X.fastq
-rw-rw-r-- 1 deren deren  956848184 Aug 17 16:36 SRR2895757_Lib1_clemensiae_DRY6_PWS_2135.fastq

ipyrad-analysis toolkit: abba-baba

The baba tool can be used to measure abba-baba statistics across many different hypotheses on a tree, to easily group individuals into populations for measuring abba-baba using allele frequencies, and to summarize or plot the results of many analyses.

Load packages
[4]:
import ipyrad.analysis as ipa
import ipyparallel as ipp
import toytree
import toyplot
[2]:
print(ipa.__version__)
print(toyplot.__version__)
print(toytree.__version__)
0.9.51
0.18.0
1.1.2
Set up and connect to the ipyparallel cluster

Depending on the number of tests, abba-baba analysis can be computationally intensive, so we will first set up a clustering backend and attach to it.

[14]:
# In a terminal on your computer you must launch the ipcluster instance by hand, like this:
# `ipcluster start -n 40 --cluster-id="baba" --daemonize`

# Now you can create a client for the running ipcluster
ipyclient = ipp.Client(cluster_id="baba")

# How many cores are you attached to?
len(ipyclient)
[14]:
40
A tree-based hypothesis

abba-baba tests are explicitly a tree-based test, and so ipyrad requires that you enter a tree hypothesis in the form of a newick file. This is used by the baba tool to auto-generate hypotheses.

Load in your .loci data file and a tree hypothesis

We are going to use the shape of our tree topology hypothesis to generate 4-taxon tests to perform, therefore we’ll start by looking at our tree and making sure it is properly rooted.

[5]:
## ipyrad and raxml output files
locifile = "./analysis-ipyrad/pedic_outfiles/pedic.loci"
newick = "./analysis-raxml/RAxML_bipartitions.pedic"
[6]:
## parse the newick tree, re-root it, and plot it.
rtre = toytree.tree(newick).root(wildcard="prz")
rtre.draw(
    height=350,
    width=400,
    node_labels=rtre.get_node_values("support")
    )

## store rooted tree back into a newick string.
newick = rtre.write()
38362_rex39618_rex35236_rex35855_rex40578_rex30556_thamno33413_thamno41954_cyathophylloides41478_cyathophylloides30686_cyathophylla29154_superba33588_przewalskii32082_przewalskii1009610099100100100100100100100
Short tutorial: calculating abba-baba statistics

To give a gist of what this code can do, here is a quick tutorial version, each step of which we explain in greater detail below. We first create a 'baba' analysis object that is linked to our data file, in this example we name the variable bb. Then we tell it which tests to perform, here by automatically generating a number of tests using the generate_tests_from_tree() function. And finally, we calculate the results and plot them.

[7]:
## create a baba object linked to a data file and newick tree
bb = ipa.baba(data=locifile, newick=newick)
[ ]:
## generate all possible abba-baba tests meeting a set of constraints
bb.generate_tests_from_tree(
    constraint_dict={
        "p4": ["32082_przewalskii", "33588_przewalskii"],
        "p3": ["33413_thamno"],
    })
[8]:
## show the first 3 tests
bb.tests[:3]
[8]:
[{'p1': ['41478_cyathophylloides'],
  'p2': ['29154_superba', '30686_cyathophylla'],
  'p3': ['33413_thamno'],
  'p4': ['32082_przewalskii', '33588_przewalskii']},
 {'p1': ['41954_cyathophylloides'],
  'p2': ['29154_superba', '30686_cyathophylla'],
  'p3': ['33413_thamno'],
  'p4': ['32082_przewalskii', '33588_przewalskii']},
 {'p1': ['41478_cyathophylloides'],
  'p2': ['29154_superba'],
  'p3': ['33413_thamno'],
  'p4': ['32082_przewalskii', '33588_przewalskii']}]
[9]:
## run all tests linked to bb
bb.run(ipyclient)
[####################] 100%  calculating D-stats  | 0:02:58 |
[10]:
## show first 5 results
bb.results_table.head()
[10]:
dstat bootmean bootstd Z ABBA BABA nloci
0 0.089 0.089 0.036 2.485 436.125 365.000 8721
1 0.096 0.098 0.037 2.640 425.062 350.250 8267
2 0.101 0.102 0.044 2.301 329.938 269.375 6573
3 0.114 0.114 0.043 2.623 319.312 254.125 6255
4 0.124 0.124 0.039 3.188 400.250 312.188 8026
Look at the results

By default we do not attach the names of the samples that were included in each test to the results table since it makes the table much harder to read, and we wanted it to look very clean. However, this information is readily available in the .test() attribute of the baba object as shown below. Also, we have made plotting functions to show this information clearly as well.

[11]:
## save all results table to a tab-delimited CSV file
bb.results_table.to_csv("bb.abba-baba.csv", sep="\t")

## show the results table sorted by index score (Z)
sorted_results = bb.results_table.sort_values(by="Z", ascending=False)
sorted_results.head()
[11]:
dstat bootmean bootstd Z ABBA BABA nloci
16 0.290 0.290 0.030 9.531 606.312 333.812 8937
15 0.239 0.238 0.028 8.492 608.281 373.365 9266
17 0.199 0.199 0.032 6.311 550.062 367.312 9033
19 0.204 0.205 0.033 6.120 545.375 360.938 8925
20 0.160 0.161 0.030 5.383 499.766 362.047 9351
[12]:
## get taxon names in the sorted results order
sorted_taxa = bb.taxon_table.iloc[sorted_results.index]

## show taxon names in the first few sorted tests
sorted_taxa.head()
[12]:
p1 p2 p3 p4
16 [35236_rex] [30556_thamno] [33413_thamno] [32082_przewalskii, 33588_przewalskii]
15 [35236_rex, 39618_rex, 38362_rex] [30556_thamno] [33413_thamno] [32082_przewalskii, 33588_przewalskii]
17 [39618_rex, 38362_rex] [30556_thamno] [33413_thamno] [32082_przewalskii, 33588_przewalskii]
19 [38362_rex] [30556_thamno] [33413_thamno] [32082_przewalskii, 33588_przewalskii]
20 [35236_rex] [40578_rex, 35855_rex] [33413_thamno] [32082_przewalskii, 33588_przewalskii]
Plotting and interpreting results

Interpreting the results of D-statistic tests is actually very complicated. You cannot treat every test as if it were independent because introgression between one pair of species may cause one or both of those species to appear as if they have also introgressed with other taxa in your data set. This problem is described in great detail in this paper (Eaton et al. 2015). A good place to start, then, is to perform many tests and focus on those which have the strongest signal of admixture. Then, perform additional tests, such as partitioned D-statistics (described further below) to tease apart whether a single or multiple introgression events are likely to have occurred.

In the example plot below we find evidence of admixture between the sample 33413_thamno (black) with several other samples, but the signal is strongest with respect to 30556_thamno (tests 12-19). It also appears that admixture is consistently detected with samples of (40578_rex & 35855_rex) when contrasted against 35236_rex (tests 20, 24, 28, 34, and 35). Take note, the tests are indexed starting at 0.

[13]:
## plot results on the tree
bb.plot(height=850, width=700, pct_tree_y=0.2, pct_tree_x=0.5, alpha=4.0);
01234567891011121314151617181920212223242526272829303132333435363738394041424332082_przewalskii33588_przewalskii29154_superba30686_cyathophylla41478_cyathophylloides41954_cyathophylloides33413_thamno30556_thamno40578_rex35855_rex35236_rex39618_rex38362_rex11.00.0-0.40.4Z-scoresD-statistics
generating tests

Because tests are generated based on a tree file, it will only generate tests that fit the topology of the test. For example, the entries below generate zero possible tests because the two samples entered for P3 (the two thamnophila subspecies) are paraphyletic on the tree topology, and therefore cannot form a clade together.

[14]:
## this is expected to generate zero tests
aa = bb.copy()
aa.generate_tests_from_tree(
    constraint_dict={
        "p4": ["32082_przewalskii", "33588_przewalskii"],
        "p3": ["33413_thamno", "30556_thamno"],
    })
0 tests generated from tree

If you want to get results for a test that does not fit on your tree you can always write the result out by hand instead of auto-generating it from the tree. Doing it this way is fine when you have few tests to run, but becomes burdensome when writing many tests.

[15]:
## writing tests by hand for a new object
aa = bb.copy()
aa.tests = [
    {"p4": ["32082_przewalskii", "33588_przewalskii"],
     "p3": ["33413_thamno", "30556_thamno"],
     "p2": ["40578_rex", "35855_rex"],
     "p1": ["39618_rex", "38362_rex"]},
    {"p4": ["32082_przewalskii", "33588_przewalskii"],
     "p3": ["33413_thamno", "30556_thamno"],
     "p2": ["40578_rex", "35855_rex"],
     "p1": ["35236_rex"]},
    ]
## run the tests
aa.run(ipyclient)
aa.results_table
[####################] 100%  calculating D-stats  | 0:00:23 |
[15]:
dstat bootmean bootstd Z ABBA BABA nloci
0 0.050 0.050 0.022 2.291 939.172 850.500 15820
1 0.163 0.163 0.021 7.900 984.953 708.797 15576
Further investigating results with 5-part tests

You can also perform partitioned D-statistic tests like below. Here we are testing the direction of introgression. If the two thamnophila subspecies are in fact sister species then they would be expected to share derived alleles that arose in their ancestor and which would be introduced from together if either one of them introgressed into a P. rex taxon. As you can see, test 0 shows no evidence of introgression, whereas test 1 shows that the two thamno subspecies share introgressed alleles that are present in two samples of rex relative to sample “35236_rex”.

More on this further below in this notebook.

[16]:
## further investigate with a 5-part test
cc = bb.copy()
cc.tests = [
    {"p5": ["32082_przewalskii", "33588_przewalskii"],
     "p4": ["33413_thamno"],
     "p3": ["30556_thamno"],
     "p2": ["40578_rex", "35855_rex"],
     "p1": ["39618_rex", "38362_rex"]},
    {"p5": ["32082_przewalskii", "33588_przewalskii"],
     "p4": ["33413_thamno"],
     "p3": ["30556_thamno"],
     "p2": ["40578_rex", "35855_rex"],
     "p1": ["35236_rex"]},
    ]
cc.run(ipyclient)
[####################] 100%  calculating D-stats  | 0:00:23 |
[17]:
## the partitioned D results for two tests
cc.results_table
[17]:
Dstat bootmean bootstd Z ABxxA BAxxA nloci
0 p3 -0.037 -0.035 0.041 0.885 230.852 248.352 8933
p4 0.044 0.044 0.053 0.840 160.125 146.531 8933
shared 0.020 0.020 0.025 0.801 449.754 431.895 8933
1 p3 0.176 0.178 0.046 3.862 252.953 177.109 8840
p4 0.135 0.134 0.052 2.612 159.172 121.266 8840
shared 0.177 0.177 0.025 7.060 514.859 359.703 8840
[17]:
## and view the 5-part test taxon table
cc.taxon_table
[17]:
p1 p2 p3 p4 p5
0 [39618_rex, 38362_rex] [40578_rex, 35855_rex] [30556_thamno] [33413_thamno] [32082_przewalskii, 33588_przewalskii]
1 [35236_rex] [40578_rex, 35855_rex] [30556_thamno] [33413_thamno] [32082_przewalskii, 33588_przewalskii]
Full Tutorial
Creating a baba object

The fundamental object for running abba-baba tests is the ipa.baba() object. This stores all of the information about the data, tests, and results of your analysis, and is used to generate plots. If you only have one data file that you want to run many tests on then you will only need to enter the path to your data once. The data file must be a '.loci' file from an ipyrad analysis. In general, you will probably want to use the largest data file possible for these tests (min_samples_locus=4), to maximize the amount of data available for any test. Once an initial baba object is created you create different copies of that object that will inherit its parameter setttings, and which you can use to perform different tests on, like below.

[19]:
## create an initial object linked to your data in 'locifile'
aa = ipa.baba(data=locifile)

## create two other copies
bb = aa.copy()
cc = aa.copy()

## print these objects
print aa
print bb
print cc
<ipyrad.analysis.baba.Baba object at 0x7fc55634a8d0>
<ipyrad.analysis.baba.Baba object at 0x7fc55634ab50>
<ipyrad.analysis.baba.Baba object at 0x7fc55634a110>
Linking tests to the baba object

The next thing we need to do is to link a 'test' to each of these objects, or a list of tests. In the Short tutorial above we auto-generated a list of tests from an input tree, but to be more explicit about how things work we will write out each test by hand here. A test is described by a Python dictionary that tells it which samples (individuals) should represent the ‘p1’, ‘p2’, ‘p3’, and ‘p4’ taxa in the ABBA-BABA test. You can see in the example below that we set two samples to represent the outgroup taxon (p4). This means that the SNP frequency for those two samples combined will represent the p4 taxon. For the baba object named 'cc' below we enter two tests using a list to show how multiple tests can be linked to a single baba object.

[20]:
aa.tests = {
    "p4": ["32082_przewalskii", "33588_przewalskii"],
    "p3": ["29154_superba"],
    "p2": ["33413_thamno"],
    "p1": ["40578_rex"],
}

bb.tests = {
    "p4": ["32082_przewalskii", "33588_przewalskii"],
    "p3": ["30686_cyathophylla"],
    "p2": ["33413_thamno"],
    "p1": ["40578_rex"],
}

cc.tests = [
    {
     "p4": ["32082_przewalskii", "33588_przewalskii"],
     "p3": ["41954_cyathophylloides"],
     "p2": ["33413_thamno"],
     "p1": ["40578_rex"],
    },
    {
     "p4": ["32082_przewalskii", "33588_przewalskii"],
     "p3": ["41478_cyathophylloides"],
     "p2": ["33413_thamno"],
     "p1": ["40578_rex"],
    },
]
Other parameters

Each baba object has a set of parameters associated with it that are used to filter the loci that will be used in the test and to set some other optional settings. If the 'mincov' parameter is set to 1 (the default) then loci in the data set will only be used in a test if there is at least one sample from every tip of the tree that has data for that locus. For example, in the tests above where we entered two samples to represent “p4” only one of those two samples needs to be present for the locus to be included in our analysis. If you want to require that both samples have data at the locus in order for it to be included in the analysis then you could set mincov=2. However, for the test above setting mincov=2 would filter out all of the data, since it is impossible to have a coverage of 2 for ‘p3’, ‘p2’, and ‘p1’, since they each have only one sample. Therefore, you can also enter the mincov parameter as a dictionary setting a different minimum for each tip taxon, which we demonstrate below for the baba object 'bb'.

[21]:
## print params for object aa
aa.params
[21]:
database   None
mincov     1
nboots     1000
quiet      False
[22]:
## set the mincov value as a dictionary for object bb
bb.params.mincov = {"p4":2, "p3":1, "p2":1, "p1":1}
bb.params
[22]:
database   None
mincov     {'p2': 1, 'p3': 1, 'p1': 1, 'p4': 2}
nboots     1000
quiet      False
Running the tests

When you execute the 'run()' command all of the tests for the object will be distributed to run in parallel on your cluster (or the cores available on your machine) as connected to your ipyclient object. The results of the tests will be stored in your baba object under the attributes 'results_table' and 'results_boots'.

[23]:
## run tests for each of our objects
aa.run(ipyclient)
bb.run(ipyclient)
cc.run(ipyclient)
[####################] 100%  calculating D-stats  | 0:00:07 |
[####################] 100%  calculating D-stats  | 0:00:06 |
[####################] 100%  calculating D-stats  | 0:00:10 |
The results table

The results of the tests are stored as a data frame (pandas.DataFrame) in results_table, which can be easily accessed and manipulated. The tests are listed in order and can be referenced by their 'index' (the number in the left-most column). For example, below we see the results for object 'cc' tests 0 and 1. You can see which taxa were used in each test by accessing them from the .tests attribute as a dictionary, or as .taxon_table which returns it as a dataframe. An even better way to see which individuals were involved in each test, however, is to use our plotting functions, which we describe further below.

[31]:
## you can sort the results by Z-score
cc.results_table.sort_values(by="Z", ascending=False)

## save the table to a file
cc.results_table.to_csv("cc.abba-baba.csv")

## show the results in notebook
cc.results_table
[31]:
dstat bootmean bootstd Z ABBA BABA nloci
0 -0.007 -0.009 0.044 0.152 238.688 241.875 8313
1 -0.008 -0.008 0.041 0.193 248.250 252.250 8822
Auto-generating tests

Entering all of the tests by hand can be pain, which is why we wrote functions to auto-generate tests given an input rooted tree, and a number of contraints on the tests to generate from that tree. It is important to add constraints on the tests otherwise the number that can be produced becomes very large very quickly. Calculating results runs pretty fast, but summarizing and interpreting thousands of results is pretty much impossible, so it is generally better to limit the tests to those which make some intuitive sense to run. You can see in this example that implementing a few contraints reduces the number of tests from 1608 to 13.

[32]:
## create a new 'copy' of your baba object and attach a treefile
dd = bb.copy()
dd.newick = newick

## generate all possible tests
dd.generate_tests_from_tree()

## a dict of constraints
constraint_dict={
        "p4": ["32082_przewalskii", "33588_przewalskii"],
        "p3": ["40578_rex", "35855_rex"],
    }

## generate tests with contraints
dd.generate_tests_from_tree(
    constraint_dict=constraint_dict,
    constraint_exact=False,
)

## 'exact' contrainst are even more constrained
dd.generate_tests_from_tree(
    constraint_dict=constraint_dict,
    constraint_exact=True,
)
2006 tests generated from tree
126 tests generated from tree
14 tests generated from tree
Running the tests

The .run() command will run the tests linked to your analysis object. An ipyclient object is required to distribute the jobs in parallel. The .plot() function can then optionally be used to visualize the results on a tree. Or, you can simply look at the results in the .results_table attribute.

[33]:
## run the dd tests
dd.run(ipyclient)
dd.plot(height=500, pct_tree_y=0.2, alpha=4);
dd.results_table
[####################] 100%  calculating D-stats  | 0:01:00 |
[33]:
dstat bootmean bootstd Z ABBA BABA nloci
0 0.071 0.071 0.034 2.082 415.266 360.406 9133
1 0.120 0.121 0.035 3.400 421.000 330.484 8611
2 0.085 0.088 0.041 2.044 327.828 276.609 6849
3 0.129 0.129 0.044 2.967 326.953 252.047 6505
4 0.096 0.097 0.037 2.558 376.078 310.266 8413
5 0.135 0.135 0.038 3.519 380.672 290.359 7939
6 -0.092 -0.090 0.040 2.299 278.641 335.234 6863
7 -0.109 -0.109 0.037 2.916 310.672 386.297 8439
8 -0.085 -0.083 0.044 1.948 276.609 327.828 6849
9 -0.096 -0.096 0.038 2.506 310.266 376.078 8413
10 -0.129 -0.130 0.043 3.009 252.047 326.953 6505
11 -0.135 -0.134 0.038 3.556 290.359 380.672 7939
12 -0.023 -0.023 0.032 0.714 435.562 455.750 8208
13 -0.013 -0.014 0.030 0.434 509.906 523.438 9513
01234567891011121332082_przewalskii33588_przewalskii29154_superba30686_cyathophylla41478_cyathophylloides41954_cyathophylloides33413_thamno30556_thamno40578_rex35855_rex35236_rex39618_rex38362_rex5.00.0-0.30.3Z-scoresD-statistics
More about input file paths (i/o)

The default (required) input data file is the .loci file produced by ipyrad. When performing D-statistic calculations this file will be parsed to retain the maximal amount of information useful for each test.

An additional (optional) file to provide is a newick tree file. While you do not need a tree in order to run ABBA-BABA tests, you do need at least need a hypothesis for how your samples are related in order to setup meaningful tests. By loading in a tree for your data set we can use it to easily set up hypotheses to test, and to plot results on the tree.

[20]:
## path to a locifile created by ipyrad
locifile = "./analysis-ipyrad/pedicularis_outfiles/pedicularis.loci"

## path to an unrooted tree inferred with tetrad
newick = "./analysis-tetrad/tutorial.tree"
(optional): root the tree

For abba-baba tests you will pretty much always want your tree to be rooted, since the test relies on an assumption about which alleles are ancestral. You can use our simple tree plotting library toytree to root your tree. This library uses Toyplot as its plotting backend, and ete3 as its tree manipulation backend.

Below I load in a newick string and root the tree on the two P. przewalskii samples using the root() function. You can either enter the names of the outgroup samples explicitly or enter a wildcard to select them. We show the rooted tree from a tetrad analysis below. The newick string of the rooted tree can be saved or accessed by the .newick attribute, like below.

[39]:
## load in the tree
tre = toytree.tree(newick)

## set the outgroup either as a list or using a wildcard selector
tre.root(names=["32082_przewalskii", "33588_przewalskii"])
tre.root(wildcard="prz")

## draw the tree
tre.draw(width=400)

## save the rooted newick string back to a variable and print
newick = tre.newick
38362_rex39618_rex35236_rex35855_rex40578_rex30556_thamno33413_thamno41954_cyathophylloides41478_cyathophylloides30686_cyathophylla29154_superba33588_przewalskii32082_przewalskii
Interpreting results

You can see in the results_table below that the D-statistic range around 0.0-0.15 in these tests. These values are not too terribly informative, and so we instead generally focus on the Z-score representing how far the distribution of D-statistic values across bootstrap replicates deviates from its expected value of zero. The default number of bootstrap replicates to perform per test is 1000. Each replicate resamples nloci with replacement.

In these tests ABBA and BABA occurred with pretty equal frequency. The values are calculated using SNP frequencies, which is why they are floats instead of integers, and this is also why we were able to combine multiple samples to represent a single tip in the tree (e.g., see the test we setup, above).

[41]:
## show the results table
print dd.results_table
    dstat  bootmean  bootstd      Z     ABBA     BABA  nloci
0   0.071     0.071    0.034  2.082  415.266  360.406   9133
1   0.120     0.121    0.035  3.400  421.000  330.484   8611
2   0.085     0.088    0.041  2.044  327.828  276.609   6849
3   0.129     0.129    0.044  2.967  326.953  252.047   6505
4   0.096     0.097    0.037  2.558  376.078  310.266   8413
5   0.135     0.135    0.038  3.519  380.672  290.359   7939
6  -0.092    -0.090    0.040  2.299  278.641  335.234   6863
7  -0.109    -0.109    0.037  2.916  310.672  386.297   8439
8  -0.085    -0.083    0.044  1.948  276.609  327.828   6849
9  -0.096    -0.096    0.038  2.506  310.266  376.078   8413
10 -0.129    -0.130    0.043  3.009  252.047  326.953   6505
11 -0.135    -0.134    0.038  3.556  290.359  380.672   7939
12 -0.023    -0.023    0.032  0.714  435.562  455.750   8208
13 -0.013    -0.014    0.030  0.434  509.906  523.438   9513
Running 5-taxon (partitioned) D-statistics

To perform partitioned D-statistic tests is not any harder than running the standard four-taxon D-statistic tests. You simply enter your tests with 5 taxa in them now, listed as p1-p5. We have not developed a function to generate 5-taxon tests from a phylogeny, as this test is more appropriately applied to a smaller number of tests to further tease apart the meaning of significant 4-taxon results. See example above in the short tutorial. A simulation example will be added here soon…

[ ]:

ipyrad-analysis toolkit: BUCKy

This notebook uses the Pedicularis example data set from the first empirical ipyrad tutorial. Here I show how to run BUCKy on a large set of loci parsed from the output file with the .alleles.loci ending. All code in this notebook is Python. You can simply follow along and execute this same code in a Jupyter notebook of your own.

Software requirements for this notebook

All required software can be installed through conda by running the commented out code below in a terminal.

[5]:
## conda install -c BioBuilds mrbayes
## conda install -c ipyrad ipyrad
## conda install -c ipyrad bucky
[6]:
## import Python libraries
import ipyrad.analysis as ipa
import ipyparallel as ipp
Cluster setup

To execute code in parallel we will use the ipyparallel Python library. A quick guide to starting a parallel cluster locally can be found here, and instructions for setting up a remote cluster on a HPC cluster is available here. In either case, this notebook assumes you have started an ipcluster instance that this notebook can find, which the cell below will test.

[7]:
## look for running ipcluster instance, and create load-balancer
ipyclient = ipp.Client()
print("{} engines found".format(len(ipyclient)))
6 engines found
Create a bucky analysis object

The two required arguments are the name and data arguments. The data argument should be a .loci file or a .alleles.loci file. The name will be used to name output files, which will be written to {workdir}/{name}/{number}.nexus. Bucky doesn’t deal well with missing data, so loci will only be included if they contain data for all samples in the analysis. By default, all samples found in the loci file will be used, unless you enter a list of names (the samples argument) to subsample taxa, which we do here. It is best to select one individual per species or subspecies. You can set a number of additional parameters in the .params dictionary. Here I use the maxloci argument to limit the total number of loci so that the example analysis will finish faster. But in practice, BUCKy runs quite fast and I would typically just use all of your loci in a real analysis.

[8]:
## make a list of sample names you wish to include in your BUCKy analysis
samples = [
    "29154_superba",
    "30686_cyathophylla",
    "41478_cyathophylloides",
    "33413_thamno",
    "30556_thamno",
    "35236_rex",
    "40578_rex",
    "38362_rex",
    "33588_przewalskii",
]
[14]:
## initiate a bucky object
c = ipa.bucky(
    name="buckytest",
    data="analysis-ipyrad/pedic_outfiles/pedic.alleles.loci",
    workdir="analysis-bucky",
    samples=samples,
    minsnps=0,
    maxloci=100,
)
[15]:
## print the params dictionary
c.params
[15]:
bucky_alpha           [0.1, 1.0, 10.0]
bucky_nchains         4
bucky_niter           1000000
bucky_nreps           4
maxloci               100
mb_mcmc_burnin        100000
mb_mcmc_ngen          1000000
mb_mcmc_sample_freq   1000
minsnps               0
seed                  224443248
Write data to nexus files

As you will see below, one step of this analysis is to convert the data into nexus files with a mrbayes code block. Let’s run that step quickly here just to see what the converted files look like.

[16]:
## This will write nexus files to {workdir}/{name}/[number].nex
c.write_nexus_files(force=True)
wrote 100 nexus files to ~/Documents/ipyrad/tests/analysis-bucky/buckytest
An example nexus file
[17]:
## print an example nexus file
! cat analysis-bucky/buckytest/1.nex
#NEXUS
begin data;
dimensions ntax=9 nchar=64;
format datatype=dna interleave=yes gap=- missing=N;
matrix
30556_thamno            TCCTCGGCAGCCATTAAACCAGTGGAGTATGCACCATGTACCGATCCTGGATAATCAAAACTTG
40578_rex               TCCTCGGCAGCCATTAAACCAGTGGAGTATGCACCATGTACCGATCCTGGATAATCAAAACTTG
38362_rex               TCCTCGGCAGCCATTAAACCGGTGGAGTATGCACCATGTACCGATCCTGGATAATCAAAACTTG
29154_superba           TCCTCGGCAGCCATTAGACCGGTGGAGTATGCACCATGTACCGATCCTGGATAATCAAAACTCG
30686_cyathophylla      TCCTCGGCAGCCATTAGACCGGTGGAATATGCACCATGTACCGATCCTGGATAATCAAAACTCG
33413_thamno            TCCTCGGCAGCCATTAAACCGGTGGAGTATGCACCATGTACTGATCCTGGATAATCAAAACTTG
41478_cyathophylloides  TCCTCGGCAGCCATTAGACCAGTGGAGTATGCACCATGTACCGATCCTGGATAATCAAAACTCG
33588_przewalskii       TCCTCGGCAGCCATTAGACCGGTGGAGTGTGCACCATGCACCGATCCCGGATAATCAAAACTCG
35236_rex               TCCTCGGCAGCCATTAAACCAGTGGAGTATGCACCATGTACCGATCCTGGATAATCAAAACTTG

    ;

begin mrbayes;
set autoclose=yes nowarn=yes;
lset nst=6 rates=gamma;
mcmc ngen=1000000 samplefreq=1000 printfreq=1000000;
sump burnin=100000;
sumt burnin=100000;
end;
Complete a BUCKy analysis

There are four parts to a full BUCKy analysis. The first is converting the data into nexus files. The following are .run_mrbayes(), then .run_mbsum(), and finally .run_bucky(). Each uses the files produced by the previous function in order. You can use the force flag to overwrite existing files. An ipyclient should be provided to distribute the jobs in parallel. The parallelization is especially important for the mrbayes analyses, where more cores will lead to approximately linear speed improvements. An ipyrad.bucky analysis object will run all four steps sequentially by simply calling the .run() command. See the end of the notebook for results.

[9]:
## run the complete analysis
c.run(force=True, ipyclient=ipyclient)
wrote 100 nexus files to ~/Documents/ipyrad/tests/analysis-bucky/buckytest
[####################] 100% [mb] infer gene-tree posteriors | 0:41:56 |
[####################] 100% [mbsum] sum replicate runs      | 0:00:02 |
[####################] 100% [bucky] infer CF posteriors     | 0:25:06 |
Alternatively, you can run each step separately
[18]:
## (1) This will write nexus files to {workdir}/{name}/[number].nex
c.write_nexus_files(force=True)
wrote 100 nexus files to ~/Documents/ipyrad/tests/analysis-bucky/buckytest
[19]:
## (2) distributes mrbayes jobs across the parallel client
c.run_mrbayes(force=True, ipyclient=ipyclient)
[####################] 100% [mb] infer gene-tree posteriors | 0:47:06 |
[10]:
## (3) this step is fast, simply summing the gene-tree posteriors
c.run_mbsum(force=True, ipyclient=ipyclient)
[####################] 100% [mbsum] sum replicate runs      | 0:00:07 |
[11]:
## (4) infer concordance factors with BUCKy. This will run in parallel
## for however many alpha values are in b.params.bucky_alpha list
b.run_bucky(force=True, ipyclient=ipyclient)

[####################] 100% [bucky] infer CF posteriors     | 1:35:16 |
Convenient access to results

View the results in the file [workdir]/[name]/CF-{alpha-value}.concordance. We haven’t yet developed any further ipyrad tools for parsing these results, but hope to do so in the future. The main results you are typically interested in are the Primary Concordance Tree and the Splits in the Primary Concordance Tree.

[17]:
## print first 50 lines of a results files
! head -n 50 analysis-bucky/buckytest/CF-a1.0.concordance
translate
 1 30686_cyathophylla,
 2 33413_thamno,
 3 33588_przewalskii,
 4 29154_superba,
 5 41478_cyathophylloides,
 6 40578_rex,
 7 30556_thamno,
 8 38362_rex,
 9 35236_rex;

Population Tree:
(((((((1,4),5),3),2),8),6),7,9);

Primary Concordance Tree Topology:
(((((1,4),5),3),7),((2,8),6),9);

Population Tree, With Branch Lengths In Estimated Coalescent Units:
(((((((1:10.000,4:10.000):0.250,5:10.000):0.763,3:10.000):1.798,2:10.000):0.116,8:10.000):0.044,6:10.000):0.000,7:10.000,9:10.000);

Primary Concordance Tree with Sample Concordance Factors:
(((((1:1.000,4:1.000):0.465,5:1.000):0.578,3:1.000):0.754,7:1.000):0.255,((2:1.000,8:1.000):0.231,6:1.000):0.210,9:1.000);

Four-way partitions in the Population Tree: sample-wide CF, coalescent units and Ties(if present)
{1; 4|2,3,6,7,8,9; 5}   0.481, 0.250,
{1,4; 5|2,6,7,8,9; 3}   0.689, 0.763,
{1,4,5; 3|2; 6,7,8,9}   0.890, 1.798,
{1,2,3,4,5,8; 6|7; 9}   0.327, 0.000,
{1,3,4,5; 2|6,7,9; 8}   0.406, 0.116,
{1,2,3,4,5; 8|6; 7,9}   0.362, 0.044,

Splits in the Primary Concordance Tree: sample-wide and genome-wide mean CF (95% credibility), SD of mean sample-wide CF across runs
{1,3,4,5|2,6,7,8,9} 0.754(0.630,0.830) 0.746(0.596,0.862)       0.041
{1,4,5|2,3,6,7,8,9} 0.578(0.470,0.690) 0.572(0.425,0.718)       0.032
{1,4|2,3,5,6,7,8,9} 0.465(0.310,0.600) 0.461(0.276,0.631)       0.059
{1,3,4,5,7|2,6,8,9} 0.255(0.110,0.400) 0.252(0.092,0.423)       0.056
{1,3,4,5,6,7,9|2,8} 0.231(0.040,0.340) 0.229(0.032,0.381)       0.069
{1,3,4,5,7,9|2,6,8} 0.210(0.070,0.350) 0.208(0.059,0.374)       0.030

Splits NOT in the Primary Concordance Tree but with estimated CF > 0.050:
{1,5|2,3,4,6,7,8,9} 0.306(0.140,0.470) 0.304(0.122,0.503)       0.074
{1,2,3,4,5|6,7,8,9} 0.250(0.150,0.390) 0.248(0.121,0.410)       0.030
{1,2,3,4,5,8|6,7,9} 0.194(0.050,0.360) 0.192(0.027,0.389)       0.095
{1,2,3,4,5,7,8|6,9} 0.174(0.000,0.390) 0.173(0.000,0.409)       0.117
{1,2,3,4,5,6|7,8,9} 0.160(0.000,0.420) 0.158(0.000,0.441)       0.138
{1,2,5,6,7,8,9|3,4} 0.146(0.020,0.280) 0.145(0.009,0.300)       0.032
{1,2,3,4,5,6,9|7,8} 0.140(0.040,0.340) 0.139(0.029,0.341)       0.040
{1,3,4,5,9|2,6,7,8} 0.139(0.060,0.260) 0.138(0.038,0.283)       0.051
{1,2,3,4,5,9|6,7,8} 0.133(0.070,0.230) 0.132(0.045,0.255)       0.018
{1,2,3,4,5,8,9|6,7} 0.130(0.000,0.450) 0.130(0.000,0.467)       0.093

ipyrad-analysis toolkit: window_extracter

View as notebook

Extract all sequence data within a genomic window, concatenate, and write to a phylip file. Useful for inferring the phylogeny near a specific gene/region of interest. Follow up with downstream phylogenetic analysis of the region.

Key features:

  1. Automatically concatenates ref-mapped RAD loci in sliding windows.
  2. Filter to remove sites by missing data.
  3. Optionally remove samples from alignments.
  4. Optionally use consensus seqs to represent clades of multiple samples.

78d4cf2a0001427f86288e6f20f869bb

Required software
[1]:
# conda install ipyrad -c bioconda
# conda install raxml -c bioconda
# conda install toytree -c eaton-lab
[2]:
import ipyrad.analysis as ipa
import toytree
Required input data files

Your input data should be a .seqs.hdf database file produced by ipyrad. This file contains the full sequence alignment for your samples as well as associated meta-data of the genomic positions of RAD loci relative to a reference genome.

[3]:
# the path to your HDF5 formatted seqs file
seqfile = "/home/deren/Downloads/ref_pop2.seqs.hdf5"
The scaffold table

The window_extracter() tool takes the .seqs.hdf5 database file from ipyrad as its input file. You select scaffolds by their index (integer) which can be found in the .scaffold_table. We can see from the table below that this genome has 12 large scaffolds (chromosome-scale linkage blocks) and many other smaller unplaced scaffolds. If you are working with a high quality reference genome then it will likely look similar to this, whereas many other reference genomes will be composed of many more scaffolds that are mostly smaller in size. Here I will focus just on the large chromosomes.

[4]:
# first load the data file with no other arguments to see scaffold table
ext = ipa.window_extracter(seqfile)

# the scaffold table shows scaffold names and lens in length-order
ext.scaffold_table.head(15)
[4]:
scaffold_name scaffold_length
0 Qrob_Chr01 55068941
1 Qrob_Chr02 115639695
2 Qrob_Chr03 57474983
3 Qrob_Chr04 44977106
4 Qrob_Chr05 70629082
5 Qrob_Chr06 57352617
6 Qrob_Chr07 51661711
7 Qrob_Chr08 71345938
8 Qrob_Chr09 50221317
9 Qrob_Chr10 50368918
10 Qrob_Chr11 52130961
11 Qrob_Chr12 39860516
12 Qrob_H2.3_Sc0000024 2943817
13 Qrob_H2.3_Sc0000026 2906018
14 Qrob_H2.3_Sc0000030 2801502
Selecting scaffolds

The scaffold_idxs designates the scaffold to extract sequence data from. This is the index (row) of the named scaffold from the scaffold table (e.g., above). The window_extracter tool will select all RAD data within this window and exclude any sites that have no data (e.g., the space between RAD markers, or the space between paired reads) to create a clean concise alignment.

The .stats attribute shows the information content of the selected window before and after filtering. The stats are returned as a dataframe, showing the size, information content, missingness, and number of samples in the alignment. You can see that the 55Mbp scaffold is reduced to a 450Kbp alignment that includes 13K snps and has 20% missing data across 30 samples (NB: this dataset already had some minimum sample filtering applied during assembly). The default filtering applied to sites only reduced the number of sites by a few thousand.

[5]:
# select a scaffold idx, start, and end positions
ext = ipa.window_extracter(
    data=seqfile,
    scaffold_idxs=0,
)

# show stats of the window
ext.stats
[5]:
scaffold start end sites snps missing samples
prefilter Qrob_Chr01 0 55068941 456515 13133 0.20 30
postfilter Qrob_Chr01 0 55068941 449832 13131 0.19 30
Subsetting scaffold windows

You can use the start and end arguments to select subsets of scaffolds as smaller window sizes to be extracted. As with the example above the selected window will be filtered to reduce missing data. If there is no data in the selected window the stats will show no sites, and a warning will be printed. An example with no data and with some data are both shown below.

[6]:
# select a scaffold idx, start, and end positions
ext = ipa.window_extracter(
    data=seqfile,
    scaffold_idxs=0,
    start=0,
    end=10000,
)

# show stats of the window
ext.stats
No data in selected window.
[6]:
scaffold start end sites snps missing samples
prefilter Qrob_Chr01 0 10000 0 0 0 0
postfilter Qrob_Chr01 0 10000 0 0 0 0
[7]:
# select a scaffold idx, start, and end positions
ext = ipa.window_extracter(
    data=seqfile,
    scaffold_idxs=0,
    start=500000,
    end=800000,
)

# show stats of the window
ext.stats
[7]:
scaffold start end sites snps missing samples
prefilter Qrob_Chr01 500000 800000 3431 103 0.14 30
postfilter Qrob_Chr01 500000 800000 3422 103 0.14 30
Filtering missing data with mincov

You can filter sites from the alignment by using mincov, which applies a filter to all sites in the alignment. For example, mincov=0.5 will require that 50% of samples contain a site that is not N or - for the site to be included in the alignment. This value can be a proportion like 0.5, or it can be a number, like 10.

7f45d7ca5a3a41c7a7216bbfeaac42c6

[8]:
# select a scaffold idx, start, and end positions
ext = ipa.window_extracter(
    data=seqfile,
    scaffold_idxs=0,
    start=500000,
    end=800000,
    mincov=0.8,
    rmincov=0.5,
)

# show stats of the window
ext.stats
[8]:
scaffold start end sites snps missing samples
prefilter Qrob_Chr01 500000 800000 3431 103 0.14 30
postfilter Qrob_Chr01 500000 800000 2720 94 0.07 29
Filtering missing data with imap and minmap

An imap dictionary can be used to group samples into populations/species, as in the example below. It takes key,value pairs where the key is the name of the group, and the value is a list of sample names. One way to use an imap is to apply a minmap filter. This acts just like the global mincov filter, but applies to each group separately. Only if a site meets the minimum coverage argument for each group will it be retained in the data set. In this case the imap sampling selected 28/30 samples and required 75% of data in each group which reduced the number of SNPs from 92 to 86.

[10]:
# assign samples to groups/taxa
imap = {
    "reference": ["reference"],
    "virg": ["TXWV2", "LALC2", "SCCU3", "FLSF33", "FLBA140"],
    "mini": ["FLSF47", "FLMO62", "FLSA185", "FLCK216"],
    "gemi": ["FLCK18", "FLSF54", "FLWO6", "FLAB109"],
    "bran": ["BJSL25", "BJSB3", "BJVL19"],
    "fusi": ["MXED8", "MXGT4", "TXGR3", "TXMD3"],
    "sagr": ["CUVN10", "CUCA4", "CUSV6"],
    "oleo": ["CRL0030", "HNDA09", "BZBB1", "MXSA3017"],
}

# set a simple minmap requiring 1 sample from each group
minmap = {name: 0.75 for name in imap}
[11]:
# select a scaffold idx, start, and end positions
ext = ipa.window_extracter(
    data=seqfile,
    scaffold_idxs=0,
    start=500000,
    end=800000,
    mincov=0.8,
    imap=imap,
    minmap=minmap,
)

# show stats of the window
ext.stats
[11]:
scaffold start end sites snps missing samples
prefilter Qrob_Chr01 500000 800000 3431 92 0.12 28
postfilter Qrob_Chr01 500000 800000 2904 86 0.08 28
Subsample taxa with imap

You can also use an imap dictionary to select which samples to include/exclude from an analysis. This is an easy way to remove rogue taxa, hybrids, or technical replicates from phylogenetic analyses. Here I select a subset ot taxa to include in the analyses and keep only sites that have 80% coverage from scaffold 2 (Qrob_Chr03).

[12]:
# select a scaffold idx, start, and end positions
ext = ipa.window_extracter(
    data=seqfile,
    scaffold_idxs=2,
    mincov=0.8,
    imap={
        "include": [
            "TXWV2", "LALC2", "SCCU3", "FLSF33", "FLBA140",
            "FLSF47", "FLMO62", "FLSA185", "FLCK216",
            "FLCK18", "FLSF54", "FLWO6", "FLAB109",
        ]
    },
)

# show stats of the window
ext.stats
[12]:
scaffold start end sites snps missing samples
prefilter Qrob_Chr03 0 57474983 419278 7283 0.22 13
postfilter Qrob_Chr03 0 57474983 302300 5441 0.10 13
Concatenate multiple scaffolds together

You can also concatenate multiple scaffolds together using window_extracter. This can be useful for creating genome-wide alignments, or smaller subsets of the genome. For example, you may want to combine multiple scaffolds from the same chromosome together, or, if you are working with denovo data, you could even combine a random sample of anonymous loci together as a sort of pseudo bootstrapping procedure. To select multiple scaffolds you simply provide a list or range of scaffold idxs.

[13]:
# select a scaffold idx, start, and end positions
ext = ipa.window_extracter(
    data=seqfile,
    scaffold_idxs=[0, 1, 2, 3, 4, 5],
    mincov=0.5,
)

# show stats of the window
ext.stats
[13]:
scaffold start end sites snps missing samples
0 concatenated 0 2835162 2835162 90452 0.152085 30
Consensus reduction with imap

You can further reduce missing data by condensing data from multiple samples into a single “consensus” representative using the consensus_reduce=True option. This uses the imap dictionary to group samples into groups and sample the most frequent allele. This can be particularly useful for analyses in which you want dense species-level coverage with little missing data, but it is not particularly important which individual represents the sampled allele for a species at a given locus. For example, if you want to construct many gene trees with one representative per species to use as input to a two-step species tree inference program like ASTRAL.

4c41f38eeeb84d5c800f5ae40ae5f2f7

[22]:
# select a scaffold idx, start, and end positions
ext = ipa.window_extracter(
    data=seqfile,
    scaffold_idxs=0,
    start=200000,
    end=5000000,
    mincov=0.8,
    imap=imap,
    minmap=minmap,
    consensus_reduce=True,
)

# show stats of the window
ext.stats
[22]:
scaffold start end sites snps missing samples
prefilter Qrob_Chr01 200000 5000000 51288 1622 0.19 28
postfilter Qrob_Chr01 200000 5000000 48927 651 0.01 8
Write selected window to a file

Once you’ve chosen the final set of arguments to select the window of interest you can write the alignment to .phy format by calling the .run() command. If you want to write to nexus format you can simply add the argument nexus=True.

[24]:
ext.run(force=True)
Wrote data to /home/deren/Documents/ipyrad/newdocs/API-analysis/analysis-window_extracter/scaf0-200000-5000000.phy
Accessing the output files

The output files created by the .run() command will be written to the working directory (defaults to “./analysis-window_extracter”). You can either find the full path to that file or access it easily from the extracter object itself as an attribute like below.

[26]:
# path to the phylip file output
ext.outfile
[26]:
'/home/deren/Documents/ipyrad/newdocs/API-analysis/analysis-window_extracter/scaf0-200000-5000000.phy'

Advanced: Infer tree from phy output

You can pass in the file path that was created above to a number of inference programs. The ipyrad tools for raxml and mrbayes both accept phylip format (ipyrad converts it to nexus under the hood for mrbayes).

[19]:
# run raxml on the phylip file
rax = ipa.raxml(data=ext.outfile, name="test", N=50, T=4)

# show the raxml command
print(rax.command)
/home/deren/miniconda3/envs/py36/bin/raxmlHPC-PTHREADS-AVX2 -f a -T 4 -m GTRGAMMA -n test -w /home/deren/Documents/ipyrad/newdocs/API-analysis/analysis-raxml -s /home/deren/Documents/ipyrad/newdocs/API-analysis/analysis-window_extracter/scaf0-500000-800000.phy -p 54321 -N 50 -x 12345
[20]:
# run job and wait to finish
rax.run(force=True)
job test finished successfully
[21]:
# plot the tree for this genome window
print(rax.trees.bipartitions)
tre = toytree.tree(rax.trees.bipartitions)
rtre = tre.root("reference").collapse_nodes(min_support=50)
rtre.draw(node_labels="support");
/home/deren/Documents/ipyrad/newdocs/API-analysis/analysis-raxml/RAxML_bipartitions.test
sagrvirggemiminioleofusibranreference50885456100100

Advanced: Infer trees from many windows

Now that you’ve seen how easy it is to extract a single window from the genome, and to apply a number of filters to it and then infer a tree, you can imagine how easy it is to apply this framework to many hundreds or thousands of windows along the genome. Head over to the tree_slider tutorial next to see this in action.

Frequenty Asked Questions

Some very general assembly guidelines offer insights into choosing parameters and interpreting assembly statistics.

Troubleshooting Procedures

Troubleshooting ipyparallel issues

Sometimes ipyrad can have trouble talking to the ipyparallel cluster on HPC systems. First we’ll get an interactive shell on an HPC compute node (YMMV with the qsub -I here, you might need to specify the queue and allocate specific resource).

qsub -I
ipcluster start --n 4 --daemonize

Then type ipython to open an ipython session.

import ipyparallel as ipp

rc = ipp.Client()
rc[:]

The result should look something like this: .. parsed-literal:

Out[1]: <DirectView [0, 1, 2, 3]>
import ipyparallel as ipp

rc = ipp.Client(profile="default")
rc[:]
import ipyrad as ip

## location of your json file here
data = ip.load_json("dir/path.json")

print data.ipcluster
data = ip.Assembly('test')

data.set_params("raw_fastq_path", "path_to_data/\*.gz")
data.set_params("barcodes_path", "path_to_barcode.txt")

data.run('1')

print data.stats
print data.ipcluster
{'profile': 'default', 'engines': 'Local', 'quiet': 0, 'cluster_id': '', 'timeout': 120, 'cores': 48}
data.write_params('params-test.txt')

Don’t forget to stop the ipcluster when you are done.

ipcluster stop

“What do I do about all this missing data?”

Here’s what not to do: Don’t try to treat RADSeq data as if it were a really big Sanger dataset. This is a LOW-THROUGHPUT mindset. You have to be bigger than the missing data. Rise above it. The best thing is not to freak out and try to remove all missing data, but to perform the analysis in such a way as to take the most care with it. Embrace the uncertainty!

For example, look at what we do with missing data in the PCA analysis tutorial: PCA Imputing Missing Data. This is the only principled way to deal with missing data. Think about it this way, only retaining sites with high sample coverage is BIASING toward conserved regions, this is going to be highly detrimental to downstream analysis.

Overfiltering on min_samples_locus is a crime against your data!

Running ipyrad on HPC that restricts write-access to /home on compute nodes

Some clusters forbid writing to /home on the compute nodes. It guarantees that users only write to scratch drives or high performance high volume disk, and not the user home directory (which is probably high latency/low volume). They have write access on login, just not inside batch jobs. This manifests in weird ways, it’s hard to debug, but you can fix it by adding an export inside your batch script.

export HOME=/<path>/<to>/<some>/<writable>/<dir>

In this way, ipcluster and ipyrad will both look in $HOME for the .ipython directory.

ipyrad crashes during dereplication in step 3

ERROR sample [XYZ] failed in step [derep_concat_split]; error: EngineError(Engine '68e79bbc-0aae-4c91-83ec-97530e257387' died while running task u'fdef6e55-dcb9-47cb-b4e6-f0d2b591b4af')

If step 3 crashes during dereplication you may see an error like above. Step 3 can take quite a lot of memory if your data do not de-replicate very efficiently. Meaning that the sample which failed may contain a lot of singleton reads.

You can take advantage of the following steps during step 2 to better filter your data so that it will be cleaner, and thus dereplicate more efficiently. This will in turn greatly speed up the step3 clustering and aligning steps.

  • Use the “filter_adapters” = 2 argument in ipyrad which will search for and remove Illumina adapters.
  • Consider trimming edges of the reads with the “trim_reads” option. An argument like (5, 75, 5, 75) would trim the first five bases of R1 and R2 reads, and trim all reads to a max length of 75bp. Trimming to a fixed length helps if your read qualities are variable, because the reads may be trimmed to variable lengths.
  • Try running on a computer with more memory, or requesting more memory if on a cluster.

Collisions with other local python/conda installs

Failed at nopython (nopython frontend)
UntypedAttributeError: Unknown attribute "any" of type Module(<module 'numpy' from...

In some instances if you already have conda/python installed the local environment variable PYTHONPATH will be set, causing python to use versions of modules outside the miniconda path set during ipyrad installation. This error can be fixed by blanking the PYTHONPATH variable during execution (as below), or by adding the export to your ~/.bashrc file.

export PYTHONPATH=""; ipyrad -p params.txt -s 1

Why doesn’t ipyrad handle PE original RAD?

Paired-End RAD protocol is tricky to denovo assemble. Because of the sonication step R2 doesn’t line up nicely. ipyrad makes strong assumptions about how r1 and r2 align, assumptions which are met by PE gbs and ddrad, but which are not met by original RAD. This doesn’t matter (as much) if you have a reference genome, but if you don’t have a reference it’s a nightmare… dDocent has a PE-RAD mode, but I haven’t evaluated it. I know that people have also used stacks (because stacks treats r1 andr2 as independent loci). If people ask me how to denovo assemble with PE-RAD in ipyrad I tell them to just assemble it as SE and ignore R2.

Why doesn’t ipyrad write out the .alleles format with phased alleles like pyrad used to?

We’re hoping to provide something similar eventually, the problem with the pyrad alleles file is that the alleles are only phased correctly when we enforce that reads must align almost completely, i.e., they are not staggered in their overlap. So the alleles are correct for RAD data, because the reads match up perfectly on their left side, however, staggered overlaps are common in other data sets that use very common cutters, like ezRAD and some GBS, and especially so when R1 and R2 reads merge. So we needed to change to an alternative way of coding the alleles so that we can store both phased and unphased alleles, and its just taking a while to do. So for now we are only providing unphased alleles, although we do save the estimated number of alleles for each locus. This information is kind of hidden under the hood at the moment though.

Why is my assembly taking FOREVER to run?

There have been a few questions recently about long running jobs (e.g., >150 hours), which in my experience should be quite rare when many processors are being used. In general, I would guess that libraries which take this long to run are probably overloaded with singleton reads, meaning reads are not clustering well within or across samples. This can happen for two main reasons: (1) Your data set actually consists of a ton of singleton reads, which is often the case in libraries that use very common cutters like ezRAD; or (2) Your data needs to be filtered better, because low quality ends and adapter contamination are causing the reads to not cluster.

If you have a lot of quality issues or if your assemby is taking a long time to cluster here are some ways to filter more aggressively, which should improve runtime and the quality of the assembly:

  • Set filter_adapters to 2 (stringent=trims Illumina adapters)
  • Set phred_Qscore_offset to 43 (more aggressive trimming of low quality bases from 3’ end of reads
  • Hard trim the first or last N bases from raw reads by setting e.g., trim_reads to (5, 5, 0, 0)
  • Add additional ‘adapter sequences’ to be filtered (any contaminant can be searched for, I have added long A-repeats in one library where this appeared common). This can be done easily in the API, but requires editing the JSON file for the CLI.

I still don’t understand the max_alleles_consens parameter

In step 5 base calls are made with a diploid model using the parameters estimated in step 4. The only special case in when max_alleles_consens = 1, in which case the step 4 heterozygosity estimate will be fixed to zero and the error rate will suck up all of the variation within sites, and then the step 5 base calls will be haploid calls. For all other values of max_alleles_consens, base calls are made using the diploid model using the H and E values estimated in step 4. After site base calls are made ipyrad then counts the number of alleles in each cluster. This value is now simply stored in step 5 for use later in step 7 to filter loci, under the assumption that if a locus has paralogs in one sample then it probably has them in other samples but there just wasn’t enough variation to detect them.

Why does it look like ipyrad is only using 1/2 the cores I assign, and what does the -t flag do?

Most steps of ipyrad perform parallelization by multiprocessing, meaning that jobs are split into smaller bits and distributed among all of the available cores. However, some parts of the analysis also use multithreading, where a single function is performed over multiple cores. More complicated, parts like step3 perform several multithreaded jobs in parallel using multiprocessing… you still with me? The -c argument is the total number of cores that are available, while the -t argument allows more fine-tuned control of how the multithreaded functions will be distributed among those cores. For example, the default with 40 cores and -t=2 would be to start 20 2-threaded vsearch jobs. There are some parts of the code that cannot proceed until other parts finish, so at some points the code may run while using fewer than the total number of cores available, which is likely what you are seeing in step 3. Basically, it will not start the aligning step until all of the samples have finished clustering. It’s all fairly complicated, but we generally try to keep everything working as efficiently as possible. If you have just one or two samples that are much bigger (have more data) than the rest, and they are taking much longer to cluster, then you may see a speed improvement by increasing the threading argument (e.g., -t 4).

How to fix the GLIBC error

If you ever see something that looks like this /lib64/libc.so.6: version `GLIBC_2.14’ not found it’s probably because you are on a cluster and it’s using an old version of GLIBC. To fix this you need to recompile whatever binary isn’t working on your crappy old machine. Easiest way to do this is a conda local build and install. Using bpp as the example:

git clone https://github.com/dereneaton/ipyrad.git
conda build ipyrad/conda.recipe/bpp/
conda install --use-local bpp

How do I interpret the distribution of SNPs (var and pis) per locus in the _stats.txt output file

Here is an example of the first few lines of this block in the stats file:

bash    var  sum_var    pis  sum_pis
0    661        0  10090        0
1   1660     1660   5070     5070
2   2493     6646   1732     8534
3   2801    15049    483     9983
4   2683    25781    147    10571
5   2347    37516     59    10866
6   1740    47956     17    10968
7   1245    56671      7    11017

pis is exactly what you think, it’s the count of loci with n parsimony informative sites. So row 0 is loci with no pis, row 1 is loci with 1 pis, and so on.

sum_pis keeps a running total of the counts for all pis across all loci up to that point, which is why the sum looks weird, but i assure you its fine. For the row that records 3 pis per site, you see the # pis = 483 and 483 * 3 + 8534 = 9983.

var is a little trickier and here’s where the docs are a little goofy. This keeps track of the number of loci with n variable sites including autapomorphies and pis within each locus. So row 0 is all totally monomorphic loci. row 1 is all loci with either one pis or one autapomorphy. Row 2 is all loci with either two pis, or two autapomorphies, OR one of each, and so on.

sum_var is calculated identical to sum_pis, so it does look weird but it’s right.

The reason the counts in, for example, row 1 do not appear to agree for var and pis is because the value of row 1 for pis includes all loci with only one pis irrespective of the number of autapomorphies, whereas the value for var records all loci with only one of either of these.

How to fix the IOError(Unable to create file IOError(Unable to create file… error

The HDF5_USE_FILE_LOCKING error is caused by the fact that your cluster filesystem is NFS (or some other network based filesystem). You can disable hdf5 file locking by setting an environment variable export HDF5_USE_FILE_LOCKING=FALSE. See here for more info:

http://hdf-forum.184993.n3.nabble.com/HDF5-files-on-NFS-td4029577.html

Why am I getting the ‘empty varcounts’ error during step 7?

Occasionally during step 7 you will see this error:

Exception: empty varcounts array. This could be because no samples
passed filtering, or it could be because you have overzealous filtering.
Check the values for `trim_loci` and make sure you are not trimming the
edge too far.

This can actually be caused by a couple of different problems that all result in the same behavior, namely that you are filtering out all loci.

trim_loci It’s true that if you set this parameter too aggressively all loci will be trimmed completely and thus there will be no data to output.

min_samples_locs Another way you can eliminate all data is by setting this parameter too high. Try dropping it way down, to like 3, then rerunning to get a better idea of what an appropriate value would be based on sample depths.

pop_assign_file A third way you can get this error is related to the previous one. The last line of the pop_assign_file is used for specifying min_sample per population for writing a locus. If you mis-specify the values for the pops in this line then it’s possible to filter out all your data and thus obtain the above error.

How do I fix this error: “OSError: /lib64/libpthread.so.0: version `GLIBC_2.12’ not found”?

This error crops up if you are running ipyrad on a cluster that has an older version of GLIBC. The way to work around this is to install specific versions of some of the requirements that are compiled for the older version. Thanks to Edgardo M. Ortiz for this solution.

First clean up your current environment:

module unload python2
rm -rf miniconda2 .conda

bash Miniconda2-latest-Linux-x86_64.sh
source ~/.bashrc

then install the old version of llvmlite (and optionally the old versions of pyzmq and ipyparallel if necessary):

conda install llvmlite=0.22

conda install pyzmq=16
conda install ipyparallel=5.2

and finally reinstall ipyrad:

conda install -c conda-forge -c bioconda ipyrad
conda install toytree -c eaton-lab

optional:

conda clean --all

Why am i getting the ‘ERROR R1 and R2 files are not the same length.’ during step 1?

This is almost certainly a disk space issue. Please be sure you have _plenty_ of disk space on whatever drive you’re doing your assembly on. Running out of disk can cause weird problems that seem to defy logic, and that are a headache to debug (like this one). Check your disk space: df -h

Why does the number of pis recovered in the output stats change when I change the value of max_snp_locus?

While it does seem that the # of pis shouldn’t change under varying max_snp_locus thresholds, it is in fact not true. This is because the setting is for max __SNP__ per locus, not max __PIS__. So for example if you have max_snp_locus set to 5, and you have a locus with 5 singleton snps and one doubleton snp (which is parsimony informative), then this locus would be filtered out. However if you set max_snp_locus to 10, then this locus would be included and the ‘pis’ counter would be incremented by 1. In this way you can see that the number of PIS recovered will change because of variation in this parameter setting.

Can ipyrad assemble MIG-seq data?

MIG-seq (multiplexed ISSR genotyping by sequencing) is a method proposed by Suyama and Matsuki (2015), which involves targeting variable regions between simple sequence repeats (SSR). The method produces data that is somewhat analogous to ddRAD, in that you have the variable region which is flanked on either side by sequences that are known to be repeated randomly and at some appreciable frequency throughout the genome. Check out the figure from the manuscript. Anyway…. yes, ipyrad can assemble this kind of data, though there are some tricks. Primarily we recommend higher values of filter_min_trim_len and clust_threshold. If sequenced on a desktop NGS platform (Ion Torrent PGM, MiSeq) it also helps to reduce both mindepth params to recover more clusters.

Why are my ipcluster engines dying silently on cluster compute notes?

This is a nasty bug that’s bitten me more than once. If you are having trouble with cluster engines running jobs and then dying silently it may be because the cluster is headless and the engines are trying to interact with a GUI backend. This causes nasty things to happen. Here are a couple links that provide workable solutions:

https://groups.google.com/a/continuum.io/forum/#!topic/anaconda/o0pnE9PEqA0

https://github.com/ipython/ipyparallel/issues/213

Why are my ipyrad.analysis.structure runs taking so long/not doing anything?

See the previous FAQ answer. It’s typical for HPC cluster systems to be configured without a GUI backend. Unfortunately ipyparallel and this particular GUI-less environment have a hard time interacting (for complicated reasons). We have derived a workaround that allows the parallelization to function. You should execute the following commands in a terminal on your cluster head node.

VERY IMPORTANT: This environment variable needs to be set in both .bashrc and .profile so that it is picked up when you run ipyparallel in either the head node of the cluster or on compute nodes.

$ echo "# Prevent ipyparallel engines from dying in a headless environment" >> ~/.bashrc
$ echo "export QT_QPA_PLATFORM=offscreen" >> ~/.bashrc
$ echo "export QT_QPA_PLATFORM=offscreen" >> ~/.profile
$ source ~/.bashrc
$ source ~/.profile
$ env | grep QT

Why is my structure analysis crashing when it looks like it should be working?

When running structure, specifically in the get_clumpp_table call, you might be told that “No files ready for XXX-K-2 in </your/structure/folder>”, when in fact there are files ready. Well it turns out that CLUMPP has a 100 character file name limit, and it’ll crash with names longer than this. The ipyrad.analysis.structure functions use absolute paths to specify file names, so it’s not hard to see how this 100 character limit could be violated. Try moving your structure analysis to a place higher in the file system hierarchy. Baffling!

Problems with SRATools analysis package

Occasionlly with the sratools package you might have some trouble with downloading. It could look something like this Exception in run() - index 29 is out of bounds for axis 0 with size 1. This is a problem with your esearch install, which does not have https support built in. You can verify with with the command esearch -db sra -query SRP021469, which should give you an https protocol support error. You can easily fix this by installing ssleay: conda install -c bioconda perl-net-ssleay. Thanks to @ksil91.

ValueError in step 7

During step 7 if you see something like this error in filter_stacks on chunk 0: ValueError(zero-size array to reduction operation minimum which has no identity) it means that one of your filtering parameters is filtering out all the loci. This is bad, obviously, and it’s probably because one of your filtering parameters is too strict. Take a look at a couple of the samples in the *_consens directory to make sure they are reasonable, then try adjusting your filtering parameters based on how the consensus reads look.

Release Notes

0.9.77

May 11, 2021

0.9.76

May 11, 2021

0.9.75

May 11, 2021

0.9.74

May 08, 2021

0.9.73

May 07, 2021

0.9.72

May 07, 2021

  • Docs: add some assembly guidelines
  • Docs: update sharing and popgen sumstats
  • Step1.FileLinker fix oops with trapping bz2 formatted input files.

0.9.71

April 15, 2021

  • demultiplex: Properly handle 3rad barcodes of different length

0.9.70

April 01, 2021

  • Allow snps_extractor to handle snps.hdf5 files with names not encoded as bytes
  • Fixing mismatch of #SBATCH and command parameters (#440)

0.9.69

March 19, 2021

  • ipa.structure: handle edge case with running more jobs with different K values

0.9.68

February 23, 2021

  • Handle i7 demux to strip trailing newline from barcode
  • demultiplex.py: Allow very short input fq files.
  • Fix project_dir inconsistency in merged assemblies.
  • Raise an error if setting a bad parameter in API mode #354

0.9.67

February 21, 2021

  • Don’t blank extra p3/p5 adapters in step2.check_adapters if filter_adapters != 3.

0.9.66

February 18, 2021

  • analysis.popgen: _collect_results() function to compile and organize results from the engines. It’s a little ugly.
  • analysis.popgen Change default minmap value to 4
  • analysis.popgen Chunking of loci and parallelization of analysis working.
  • analysis.popgen._fst_full() implemented. Had to slightly tweak the code that was there, and it works a bit different than the other sumstats, but whatever for now it works.
  • analysis.popgen implemented Dxy
  • analysis.popgen._process_locus_pops() to prepare data for pairwise summary statistics
  • analysis.popgen: Stats returned as dict instead of tuple, and remote processes dump full stats to pickle files.
  • Add a _process_locus function to remove a bunch of redundancy, and split consens bases to account for diploidy in all the calcs
  • analysis.popgen: If no imap then all samples are considered from the same population.
  • analysis.popgen: refactor _Watterson to accept sequence length and return raw and per base theta values.
  • analysis.popgen: Refactor into ‘public’ sumstat methods which can be called on array-like locus data, and ‘semi-private’ methods (e.g. _pi, or _Watterson) that are much more efficient but accept very specific information. The Processor.run() function will do some housekeeping per locus and call the ‘semi-private’ methods, for efficiency. Public methods are for testing.
  • analysis.popgen Tajima’s D function
  • analysis.popgen Watterson’s theta function
  • clustmap: Handle bam files with very large chromosomes (#435)
  • clustmap_across: Catch samtools indexing failure for large bam files and try again with .csi indexing with the flag.
  • Adopting the Processor() structure to enable parallelization. Also implemented nucleotide diversity.
  • analysis.popgen add parallelization support for calculating all sumstats per locus.
  • analysis.popgen _fst() is working for snp data
  • Add an option to LocusExtracter.get_locus to return a pd.DataFrame indexed by sample names
  • analysis.popgen getting it into current analysis tools format and starting to flesh it out.
  • CLI honors the -q flag now
  • analysis.sharing allow subsampling inside draw() to prevent recalculating data
  • analysis.shared allow sorting by mean pairwise sharing or missing
  • Allow analysis.sharing to reorder samples, and add a progress bar
  • Add new pairwise snp/missingness sharing analysis tool.
  • add locus sharing/missingness sharing analysis tool skel
  • pin numpy >=1.15, a known working version to address #429
  • Fix weird race condition with branching and pop_assign_files. See #430.

0.9.65

January 22, 2021

  • Fix #429 weird masking bug in older versions of numpy.
  • Add refs to the analysis.popgen tool.

0.9.64

January 16, 2021

  • replaced core.Assembly.database which actually wasn’t doing anything with snps_database and seqs_database to retain info about the hdf5 files in the assembly object
  • fix empirical api structure params format
  • Allow structure to accept vcf files and auto-convert to hdf5
  • fix oops in i7 demux cookbook

0.9.63

December 17, 2020

  • Fix off-by-one error in nexus output
  • update struct testdocs
  • Add Tajima’s D denominator equation to the popgen analysis tool, because I coded it before and it’s a nightmare.
  • use quiet in lex
  • plot posteriors range limits removed
  • Actually fix pca cookbook
  • fix malformatted pca cookbook
  • raxml w/ gamma rates

0.9.62

November 03, 2020

  • updating mb cookbook
  • bpp transform parser bugfix
  • add bpp dev nb
  • allow non-blocking calls again in bpp
  • bpp result split on delimiter
  • allow empty node dist in bpp
  • add option to extract as string/bytes
  • snps_extracter: report maf filter as unique filtered
  • snps extracter reports n unlinked snps
  • tmp dir for new cookbook docs
  • allow maf filter in structure
  • node_dists check for tips in bpp plot
  • new plotting tools for bpp results
  • added download util
  • add minmaf filter to snps_extracter, pca
  • additionall bpp summary tools

0.9.61

October 15, 2020

  • Fix nasty nasty bug in step 7 for issue #422

0.9.60

October 13, 2020

  • Fix mapping minus reads bug step 3

0.9.59

September 20, 2020

  • In structure._call_structure() self is not in scope, so now we pass in the full path to the structure binary.

0.9.58

September 10, 2020

  • fix oops in window_extracter
  • Allow scaffold idxs to be int
  • bugfix to skip scaffs with no hits even when end=None
  • end arg offset bugfix, may have affected size of windows in treeslider
  • Add handler for malformed loci in baba.py
  • fix to allow digest of single end RAD
  • Changed astral annotation to default (#417)
  • Update dependency documentation
  • baba2 draw fix
  • return canvas in draw
  • Fix an oops iff hackersonly.declone_PCR_duplicates && reference assembly
  • Merge pull request #409 from camayal/master
  • path fix for conda envs
  • path fix for conda envs
  • update for py38
  • Update faq.rst
  • pca docs
  • Change exit code for successful merging from 1 to 0.
  • Remove the png outfile from pca because it wants ghostscript package installed and apparently toytree doesn’t have that as a dependency. Annoying.
  • Update README.rst
  • Allow pca to write png as well.
  • Added fade option for blocks and tooltips
  • Merge pull request #1 from camayal/camayal-baba2-work
  • Changes in Drawing class

0.9.57

August 01, 2020

  • Add functionality to pca to allow adding easy text titles to the plots with the param
  • Document writing pca figure to a file
  • Allow pca.draw() to write out pdf and svg
  • force option to remove cftable in snaq
  • fix ntaxa in phy header for lex
  • mb ipcoal doc up
  • get locus phy uppers
  • changed default nboots in baba
  • Set pandas max_colwidth=250 to allow for very long sample names.
  • Fix oops in step 7 where trim_loci 3’ R1 wasn’t being used.
  • Allow imap pops to be specified as np.ndarray in snps_extracter
  • fix path snaq log
  • Fix branching oops
  • snaq working
  • snaq
  • network updated
  • testing network analysis
  • ast conda bin path fix
  • ast conda bin path fix
  • ast conda bin

0.9.56

June 29, 2020

  • revise installation docs to add conda-forge recommendation to reduce conflict errors
  • update README install instructions

0.9.55

June 26, 2020

  • bpp with ipcoal demoup
  • update tmx tool for toytree v2
  • conversion funcs in snps_extracter
  • binder add ipcoal and CF as first env
  • ipcoal terremix docs
  • tool and docs for genotype freq conversion output
  • tool and docs for genotype freq conversion output
  • baba ipcoal notebook
  • update 2brad docs slightly
  • Add force flag to force overwrite vcf_to_hdf5 (prevent redundant conversion)
  • better prior plots and transformers
  • mb more prior options and fixed tree support
  • pca allow setting colors
  • pca cna set opacity
  • pca cna set opacity
  • pca cna set opacity
  • pca cna set opacity
  • fix to allow custom colors in pca
  • docs
  • fix py2 compat
  • wex ints
  • fix api oops in window extracter docs for scaffold_idxs
  • wex start end as ints
  • re-supporting single or multiple locs/chrom in wex
  • option to not subsample SNPs
  • add mrbayes keepdir arg to organize files
  • simplified wex
  • extracter filters invariant sites when subsampled by IMAP
  • added pca panel plot
  • subsample loci jit for snps_extractor of linked SNPS
  • major baba2 update, to replace baba eventually
  • axes label fix and figt cleanup
  • handle int chrom names

0.9.54

May 31, 2020

  • off-by-one to ref pos in s3 applied again here

0.9.53

May 19, 2020

  • Fix off by 1 error in step 3 for PE data.
  • Fix toytree documentation in baba cookbook
  • Fix py2 compat by removing trailing commas from function argument lists in a couple of anaysis tools.
  • Fix oops in handling errors during convert_outputs

0.9.52

May 09, 2020

  • Fix nasty off-by-one error in reference positions
  • multiple default clock models
  • multiple default clock models
  • multiple default clock models
  • multiple default clock models
  • multiple default clock models
  • multiple default clock models
  • multiple default clock models
  • wex concat name drop fix
  • ts tmpdir renamed bootsdir
  • umap learn conda instructions
  • ts: added dryrun method
  • wex: remove print debug statement
  • Fix baba cookbook docs
  • add umap option
  • added pseudocode for a further imputer in prog
  • prettier bpp plot
  • pca analysis tool passes through quiet flag to subfunctions
  • warning about missing ref only up with no ref1 or ref2
  • merge fix
  • improving ipabpp summary funcs
  • ensure conda ipcluster bin on stop
  • bpp prior checks, new ctl build for 4.0, parsing results funcs
  • Add a helpful message if merging assemblies with technical replicates beyond step 3.
  • missing import
  • Handle empty imap population in snps_extractor
  • binary fix
  • syntaxerr on quiet
  • hide toytree dep in ast
  • ast better error message
  • ip assemble shows cluster on run by default
  • show_cluster func now listens to param arg
  • big update for bpp 4.0, uses lex
  • wex and ts both use idxs in param name now
  • simple astral run tool
  • lex: imap/minmap filtering fix
  • wex: imap/minmap filtering fix
  • fixed warning message
  • default minmap to 0 if imap and minmap empty
  • hide toyplot dependency
  • simple option to keep treefiles in treeslider
  • under the hood mods to pca draw func to make it more atomic
  • Update faq.rst
  • Set default filter_adapters parameter to 2
  • raise warning if ref+ or ref- and method not ref
  • notes on window extracter

0.9.51

April 17, 2020

  • 1 index POS in vcf output
  • minmap default is 0
  • bugfix: apply imapdrop only when imap
  • faster extraction and mincov after minmap in lex
  • mincov applies after minmap in wex
  • scaff arg entered later in cov tool
  • rmincov added to ts
  • option to keep all tmp files in treeslider
  • major fix to names sorting in wex
  • names offset by scaff length in cov plot
  • set default inner mate to 500 and use it unless user changes to None, in which case we estimate from reads
  • tmp working baba update
  • added locus extracter
  • option to keep all files in treeslider
  • added cov plot tool

0.9.50

April 05, 2020

  • Actually fix FASTQ+64 problem. Max fastq_qmax is 126, so this is set to 93 now (93+33=126)

0.9.49

April 02, 2020

  • Allow high fastq_qmax in pair merging to allow FASTQ+64 data

0.9.48

April 01, 2020

  • Record refseq mapped/unmapped for both SE & PE
  • wextract minmap+consred minmap default added
  • treeslider default args typed
  • tested working wextracter
  • baba merge
  • new dict for translation
  • updating bpp for 4.0

0.9.47

March 24, 2020

  • Fix snpstring length oops in .alleles outputs so they line up right.

0.9.46

March 24, 2020

  • Fix pd.as_matrix() call which is deprecated.
  • Force pca.draw() to honor the length of the color list if it is sufficiently long to color all samples in the imap, or at least use the length of the color list to set the value of the variable.
  • Fix oops in baba.py for importing msprime. Pushed it to Sim.__init__, since if you want to do baba, and don’t care about sims, then you shouldn’t have to install msprime.
  • h5py warning fix
  • use _pnames to use filtered names in run()

0.9.45

March 08, 2020

  • Allow more flexibility in sorted fastqs directory (DO NOT DELETE if it points to projectdir + _fastqs)
  • window extracter fix for multiple loci w/ reduce

0.9.44

March 04, 2020

  • Fix the treemix output so it actually generates. Took WAYYYY longer than I thought it would.
  • Update faq.rst
  • Update faq.rst

0.9.43

February 26, 2020

  • Fix off by 2 error in minsamp when using reference sequence
  • window extacter working for denovo loci
  • Cleaning up a TON of sphinx warnings from the docs and fixing a bunch of docs issues.
  • fix oops in baba.py (import sys)

0.9.42

February 19, 2020

  • Fix oops in step 6 which was leaving bad sample names hanging after alignment.

0.9.41

February 18, 2020

  • Set s6.data.ncpus value when routing around hierarchical clustering for ref based assemblies.
  • disable hierarchical clustering until further testing
  • split samples evenly among cgroups for hierarch clust
  • digest genomes uses qual score B instead of b

0.9.40

February 16, 2020

  • subsample loci func added
  • counts rm duplicates in denovo and works with step6 skipping alignment of loci w/ dups
  • denovo paired aligned separately again
  • fastq qmax error in merge denovo fixed

0.9.39

February 15, 2020

  • Why can’t i figure out how to comment out this plotting code right? wtf!

0.9.38

February 15, 2020

  • commented out the import of the baba_plot plotting function and the baba.plot() method as these are broken rn, and also the plotting/baba_plotting routine tries to access toyplot in a way that breaks the conda build since toyplot isn’t a strict requirement. We could fix this in the future, but i’m tring to get the bioconda package to build successfully rn.

0.9.37

February 15, 2020

  • fix import checking for baba_panel_plot.py

0.9.36

February 15, 2020

  • Handle external imports in the baba module in the same way as the other analysis tools to fix the broken bioconda build.
  • Add a pops file to the ipsimdata.tar.gz because it’s always useful.
  • “Updating ipyrad/__init__.py to version - 0.9.35

0.9.35

February 12, 2020

0.9.35

February 12, 2020

  • Fix a bug in step 5 handling of RemoteError during indexing alleles.
  • Report debug traceback for all crashes, not just API. This is essentially making the debug flag useless in v.0.9

0.9.34

February 09, 2020

  • Roll back baba code to 0.7 version which doesn’t use the current analysis format, but which still works. Saved ongoing baba code as baba.v0.9.py

0.9.33

February 06, 2020

  • Fix major oops in consens_se which failed step 5 every time. Bad!
  • In step 6 use the sample.files.consens info, rather than data.dirs to allow for merging assemblies after step 5 where data.dirs is invalid/empty.

0.9.32

February 04, 2020

  • #392 allow scaffold names to be int
  • Add sensible error handling if only a few samples fail step 5.
  • add docs to clustmap_across
  • fix for name re-ordering in window-extracter with multiple regions selected
  • added comments
  • added comments
  • added sys
  • Actually handle failed samples in step 2.
  • fix for new h5py warning
  • fix for new sklearn warning

0.9.31

January 19, 2020

  • Fix error in bucky (progressbar hell).
  • Add error handling in a couple cases if run() hasn’t been called, e.g. before draw, and also add the pcs() function as a convenience.
  • Removed support for legacy argument format from bpp.py and updated the docs.
  • Allow PCA() to import data as vcf.
  • Add support for importing VCF into PCA

0.9.30

January 16, 2020

  • Fix whoops with bucky progressbars

0.9.29

January 15, 2020

  • Fix bucky progressbar calls.
  • Fixed progressbar calls in bucky.py
  • Conda install instructions were wrong.
  • add future print import to fasttree.py
  • Add releasenotes to the TOC

0.9.28

January 12, 2020

  • Fix versioner.py to actually record releasenotes
  • Fix releasenotes in versioner script
  • Fix releasenotes

0.9.27

January 12, 2020

  • Fix releasenotes

0.9.26

January 01, 2020

  • During steps 5 & 6 honor the filter_min_trim_len parameter, which is useful in some cases (2brad).
  • In step 3, force vsearch to honor the filter_min_trim_len param, otherwise it defaults to –minseqlength 32, which can be undesirable in some cases.

0.9.25

December 31, 2019

  • concatedits files now write to the tmpdir, rather than edits (#378), also handle refmap samples with no reads that map to the reference, also change where edits files are pulling from during PE merging to allow for assembly merging after step 2. phew.
  • digested genome bugfix - check each fbit 0 and -1
  • digest genomes nscaffolds arg support
  • docs cookbook updates
  • new consens sampling function and support for window extracter to concatenate
  • comment about zlib

0.9.24

December 24, 2019

  • Fix IPyradError import

0.9.23

December 24, 2019

  • Regress baba.py

0.9.22

December 23, 2019

  • Add support for .ugeno file
  • Add support for .ustr format
  • Remove duplication of code in write_str()
  • Fix docs for output formats
  • Add back output formats documentation

0.9.21

December 23, 2019

  • Fix stupid bug introduced by fe8c2dfc282e177a7c18f6e2e23ef84d284a9e3f

0.9.20

December 18, 2019

  • Expose analysis.baba for testing
  • fasterq-dump seems to be only avail on linux
  • Fix bug in handling sample names in the pops file. re: #375.
  • Allow faidict scaffold names to be int (cast to dtype=object)

0.9.19

December 03, 2019

  • Fix step 6 with pop_assign_file
  • fix for empty samples after align
  • list missing as ./. in VCF (like we used to)

0.9.18

November 23, 2019

  • Fix oops handling missing data in vcf to hdf5
  • mb binary path bugfix
  • treeslider mb bugfix
  • treeslider mb working
  • treemix support for conda env installations
  • additional drawing options for pca
  • raxml cookbook update
  • tetrad notebook updated
  • Fix oops in params.py checking for lowercase overhangs seqs
  • Fix a nasty stupid bug setting the overhang sequence
  • Add back the docs about merging
  • Error checking in step 5.
  • Forbid lowercase in overhang sequence

0.9.17

November 04, 2019

  • Ooops. Allow popsfile w/o crashing, and allow populations to be integer values
  • cookbooks added link to nb
  • pca stores results as attr instead or returning

0.9.16

October 31, 2019

  • commented fix of optim chunksize calc
  • treeslider now working with mb
  • toggle to write in nexus
  • mb saves convergence stats as df
  • single-end mapping infiles bugfix
  • pca cookbook update
  • added fasttree tool
  • update cookbooks index
  • cookbooks updated headers
  • warning about denovo-ref to use new param
  • clustmap keeps i5s and can do ref minus
  • window extracter updated
  • mb load existing results and bugfix result paths
  • update treemix and mb docs
  • Fix calculation of optim during step 6 aligning. 3-4x speedup on this medium sized simulated data i’m working on.
  • Fix oops in how optim was being counted. Was counting using _unsorted_ seed handle, I switched it to use sorted and now it works more like expected
  • Clean up clust.txt files after step 3 finishes
  • Update docs and parameter descriptions to reflect new reality of several params
  • Add handlers for denovo +/- reference.
  • vcf tool docs
  • tools docs update
  • enable vcf_to_hdf5
  • pca reps legend looks nice
  • added replicate clouds to pca
  • vcf to hdf5 converter tested empirically
  • Add hils.py from the hotfix branch
  • Pull from the correct repo inside meta.yaml
  • VCF 9’s fixed to be .
  • Add back tetrad docs
  • default hackers set to yes merge tech reps, and cleanup
  • behavior for duplicates in barcodes file
  • bugfix: error reporting for barcodes within n
  • find binary from env or user entered
  • find ipcluster from conda env bin
  • bugfix: allow demux i7s even if datatype=pair3rad
  • add the notebook tunnel docs back
  • pedicularis cli tutorial updated
  • df index needed sorting
  • Allow sample names to be integers. wtf, how did this never come up before?
  • Add the McCartney-Melstad ref to the faq
  • fix typo in bpp cookbook
  • Fix bpp Params import
  • docs nav bar cleanup
  • testing binder w/o treemix
  • Add advanced tutorial back (?), maybe as a placeholder.
  • Remove references to smalt and replace with bwa. That’s some old-ass junk!
  • Fixed the versioner.py script and added the faq.rst to the newdocs

0.9.14

October 05, 2019

  • binder update
  • docs update
  • add bpp docs
  • indentation in docs
  • add i7 demux cookbook
  • mroe analysis cookbooks
  • analysis cookbooks
  • merged clustmap
  • Fix a nasty bug with stats for assemblies where chunks end up empty after filtering
  • Fix step 3 to allow some tmpchunks to be empty without raising an error during chunk aligning
  • Fix a bug in bpp.py
  • Fix a nasty error in jointestimate.stackarray() where some long reads were slipping in over the maxlen length and causing a broadcast error

0.9.13

  • py2 bug: print missing as float
  • py2 bug fix: database ordering
  • allow iterable params object for py2 and 3
  • Fix an edge case to protect against empty chunks post-filtering during step 7
  • install docs update
  • Fix CLI so merge works
  • Fix the max_shared_Hs param description to agree with only having one value, rather than 1 value perper R1/R2
  • Ooops. checked in a pdb.set_trace in write outfiles. sorry!
  • add deps to newdocs
  • bug fix for a rare trim that leaves >=1 all-N rows. Filter it.
  • documenting a hard coded backward compatibility in write_output.Processor()
  • hdf5 formatting for window slider in both denovo and ref
  • sratools up to date with CLI working too
  • Don’t pester about mpi4py if you’re not actually using MPI (CLI mode)
  • Allow for user to not input overhang sequences and jointestimate will just proceed with the edges included.
  • chunked downloads bug fix

<sunspots cause discontinuity in version history>

0.7.30

March 09, 2019

  • Fix pca for scikit 1.2.0 API and a few minor fixes.
  • Update faq.rst
  • Update faq.rst

0.7.29

January 21, 2019

  • Fix nasty ValueError bug in step 7 (re: merged PE loci)
  • Update faq.rst
  • Update faq.rst
  • Update faq.rst
  • Adding more docs
  • Starting list of papers related to assembly parameters
  • Remove ‘skip’ flags from meta.yaml, because False is default now
  • Add funcsigs dependency
  • Fix baba.py so max locus length is autodetected from the data, instead of being fixed at 300
  • Adding a nexus2loci.py conversion script which takes in a directory of nexus alignments and writes out a .loci file. This is as stupid as possible and it makes a lot of assumptions about the data, so don’t be surprised if it doesn’t work right.
  • added missing dependency on cutadapt (#314)
  • Add support for finding bins in a virtualenv environment installed with pip
  • add missing requirement: dask[array] (#313)
  • Update faq.rst
  • Update faq.rst
  • fix branching docs
  • Fix a nasty bug in sra tools if you try to dl more than 50 or 60 samples.
  • fix dox
  • Fix references to load_assembly to point to load_json
  • Removing docs of preview mode
  • Purge references to preview mode. Clean up some deprecated code blocks in demux.
  • Remove import of util.* from load, and include only the few things it needs, remove circular dependency.
  • Add docs about structure parallel runs failing silently
  • Removing the restriction on ipyparallel version to obtain the ‘IPython cluster’ tab in notebooks.
  • Adding docs about engines that die silently on headless nodes
  • Add title and save ability to pca.plot()
  • Make pca.plot() less chatty
  • Forbid nPCs < n samples
  • Update ipyrad meta.yaml to specify ipyparallel, and scikit-allel version.
  • Fix pis docs in faq
  • Update full_tutorial_CLI.rst
  • Update full_tutorial_CLI.rst
  • Update full_tutorial_CLI.rst
  • Update full_tutorial_CLI.rst
  • Adding scikit-allel dependency for pca analysis tool
  • Update cookbook-PCA-pedicularis.ipynb
  • Fix a bug that was causing _link_fastqs to fail silently.
  • fixing inconsistencies in the pedicularis CLI tutorial
  • Big update to the PCA cookbook.

0.7.28

June 18, 2018

  • Add functions for missingness, trim missing, and fill missing.
  • Adding PCA cookbook
  • pcs are now stored as pandas, also, you can specify ncomps

0.7.27

June 15, 2018

  • Add distance plot, and pca.pcs to hold coordinates per sample
  • remove some crust from pca.pywq

0.7.26

June 14, 2018

  • Adding analysis.pca
  • Allow passing in just a dict for specifying populations to _link_populations(), and assume all minsamps = 0
  • Some of step 2 docs were outdated
  • Fix stupid link
  • Adding some docs about MIG-seq.
  • Damn this cluster config mayhem is a mess.
  • Fix faq re pyzmq
  • adding docs about max_snp_locus settings
  • Fix merge conflict
  • Add docs to fix the GLIBC error
  • Docs for r1/r2 not the same length

0.7.25

May 17, 2018

  • nb showing fix for 6-7 branching
  • nb showing fix for 6-7 branching
  • fixed branching between 6-7 when using populations information
  • suppress h5py warning
  • Allow sample names to be numbers as well.

0.7.24

May 03, 2018

  • Better handling of utf-8 in sample names by default.
  • Add docs in the faq about the empty varcounts array
  • Catch an exception in sratools raised by non-existant sra directory.
  • Add HDF5 file locking fix to the faq.
  • Add docs to peddrad notebook.
  • Adding PE-ddRAD analysis notebook.
  • Add the right imports error message to the structure analysis tool.

0.7.23

February 21, 2018

  • some releasenotes fixes
  • Fix filter_min_trim_len not honoring the setting in the params file.

0.7.22

February 13, 2018

  • bug fix to bpp.py
  • updated tetrad cookbook
  • ipa: structure has max_var_multiple option, and documentation now includes it.
  • update baba cookbook
  • API user guide update
  • bug fix: allow for ‘n’ character in reftrick
  • ipa: can reload structure results, better API design for summarizing results, better documentation
  • allow subsetting in baba plot, and bug fix for generate_tests dynamic func
  • undo dumb commit
  • added –download to the docs example

0.7.21

January 23, 2018

  • Fix step 2 with imported fastq ungzipped.
  • docs update
  • update ipa structure notebook
  • update ipyparallel tutorial
  • update ipa structure notebook
  • docs updates
  • improved cleanup on sra tools
  • updated bucky cookbook
  • updated –help for sra download
  • updated docs for sra download

0.7.20

January 09, 2018

  • fixed gphocs output format
  • A note to add a feature for the future.
  • abba baba cookbook updated for new code
  • updated baba plot to work better with updated toytree
  • baba: added functions for parsing results of 5-taxon tests and improved plotting func.
  • added notes
  • added CLI command to do quick downloads from SRA. Useful for tutorials especially
  • update bpp cookbook
  • added functions to calculate Evanno K and to exlude reps based on convergence stats
  • added funcs to bpp tool to load existing results and to parse results across replicates
  • ipp jobs are submitted as other jobs finish so that RAM doesn’t fill up with queued arrays

0.7.19

November 16, 2017

  • bugfix; error was raised in no barcodes during step2 filtering for gbs data. Now just a warning is printed
  • Fixed structure conda meta.yaml
  • Fix ipcluster warning message.
  • Adding to the faq explaining stats better
  • new working meta.yaml
  • trying alternatives with setup files for jupyter conda bug fix
  • updating setup.py stuff to try to fix jupyter missing in conda install

0.7.18

November 13, 2017

  • allow user to set bpp binary path if different from default ‘bpp’
  • skip concat edits of merged reads if merge file exists unless force flag is set
  • added a progress bar tracker for reference indexing
  • speed improvement to refmapping, only tests merge of read pairs if their mapped positions overlap
  • update to docs
  • update API userguide
  • added twiist tool
  • update bpp notebook
  • tetrad bug fix for OSX users for setting thread limit
  • added check for structure path in structure.py
  • allow setting binary path and check for binary added to bpp.py
  • Update requirements.txt
  • Added to the faq how to fix the GLIBC error.
  • Fix logging of superints shape.
  • Test for samples in the populations file not in the assembly.

0.7.17

October 28, 2017

  • Properly handle empty chunks during alignment. Very annoying.

0.7.16

October 28, 2017

  • Fix SE reference bug causing lots of rm_duplicates.
  • Lowered min_se_refmap_overlap and removed useless code to recalibrate it based on filter_min_trim_len.
  • Actually fix conda package.
  • aslkfljsdjsdffd i don’t know how this shit works.
  • Fixing build still.
  • Fix typo in meta.yaml.

0.7.15

October 01, 2017

  • Fix conda build issue.

0.7.14

September 28, 2017

  • Fix orientation of R2 for pe refmap reads.
  • better error reporting, and ensure * at top of stacks
  • quickfix from last commit, keep first st seq after pop to seed in align
  • edge trim in s7 cuts at 4 or minsamp
  • added adapter-barcode order checking for cases where merged samples, and pegbs data is analyzed either as pe or forced into se.
  • update to gbs edge trimming, stricter filtering on partial overlapping seqs
  • Add a comment line to the pysam conda build to make it easier to build on systems with older glibc.
  • “Updating ipyrad/__init__.py to version - 0.7.13
  • API style modifications to tetrad

0.7.13

September 05, 2017

  • API style modifications to tetrad

0.7.13

September 04, 2017

  • Add support for optional bwa flags in hackersonly.
  • Force resetting the step 6 checkpointing if step 5 is re-run.
  • fix for max_shared_Hs when a proportion instead of a fixed number. Now the proportion is applied to every locus based on teh number of samples in that locus, not the total N samples
  • access barcode from assembly not sample unless multiple barcodes per sample. Simpler.
  • added back in core throttling in demux step b/c it is IO limited
  • fix to progress bar fsck, and fix to cluster location used in step4 that was breaking if assemblies were merged between 3 and 4
  • step 6 clustering uses threading options from users for really large systems to avoid RAM limits
  • fix for progress bar printing in tetrad, and to args entry when no tree or map file
  • fix to default ncbi sratools path

0.7.12

August 28, 2017

  • update ezrad notebook
  • ezrad-test notebook up
  • Update cookbook-empirical-API-1-pedicularis.ipynb
  • big improvements to sratools ipa, now better fetch function, easier renaming, and wraps utility to reassign ncbi dump locations
  • fix for bucky bug in error reporting
  • wrote tetrad CLI to work with new tetrad object
  • rewrite of tetrad, cleaner code big speed improvements
  • allow more flexible name entry for paired data, i.e., allow _R1.fastq, or _1.fastq instead of only _R1_.fastq, etc.
  • Fixed denovo+reference assembly method.
  • update bpp cookbook
  • update bpp cookbook
  • “Updating ipyrad/__init__.py to version - 0.7.11
  • removed repeat printing of error statements
  • added more warning and reports to bpp analysis tool

0.7.11

August 14, 2017

  • removed repeat printing of error statements
  • added more warning and reports to bpp analysis tool

0.7.11

August 14, 2017

  • better error checking in bucky run commandipa tools
  • added workdir default name to sra tools ipa tool
  • improved error checking in step 6
  • bugfix for VCF output where max of 2 alternative alleles were written although there could sometimes be 3

0.7.10

August 08, 2017

  • fix misspelled force option in ipa bucky tool
  • bpp ipa tool changed ‘locifile’ arg to ‘data’ but still support old arg, and removed ‘seed’ arg from run so that the only ‘seed’ arg is in paramsdict
  • bugfix to not remove nex files in bucky ipa tool

0.7.9

August 07, 2017

  • cleaner shutdown of tetrad on interrupt. Bugfix to stats counter for quartets sampled value. Cleaner API access by grouoping attributes into params attr.
  • cleanup rawedit to shutdown cleaner when interrupted
  • modified run wrapper in assembly object to allow for cleaner shutdown of ipyclient engines
  • bug fix so that randomize_order writes separate seqfiles for each rep in bpp analysis tool
  • Adding error handling, prevent tmp files being cleaned up during DEBUG, and fix tmp-align files for PE refmap.
  • Derep and cluster 2brad on both strands.
  • Actually fix refmap PE merging.
  • Fix merging for PE refmap.
  • Add a switch to _not_ delete temp files if DEBUG is on. Helpful.
  • 2 new merge functions for PE refmap. One is slowwwww, the other uses pipes, but doesn’t work 100% yet.
  • New hackersonly parameter to switch merging PE after refmap.
  • bugfix to ipa baba plotting function for updated toyplot
  • Reduce minovlen length for merging reference mapped PE reads.
  • docs update
  • docs update
  • improved design of –ipcluster flag in tetrad CLI
  • improved design of –ipcluster flag in tetrad CLI
  • improved design of –ipcluster flag in ipyrad CLI

0.7.8

July 28, 2017

  • bpp randomize-order argument bugfix
  • added .draw to treemix object
  • update tuts

0.7.7

July 27, 2017

  • Proper support for demux 2brad.

0.7.6

July 27, 2017

  • Fix very nasty refmap SE bug.
  • update tutorials – added APIs
  • update tutorials – added APIs
  • testing MBL slideshow
  • API cookbooks updated
  • cleanup of badnames in sratools

0.7.5

July 26, 2017

  • Added error handling in persistent_popen_align3
  • Catch bad seeds in step 6 sub_build_clustbits().

0.7.4

July 26, 2017

  • Actually fix the step 6 boolean mask error.
  • Fix for boolean mask array length bug in step 6.
  • add -noss option for treemix ipa
  • mods to tetrad and sratools ipa
  • ensure ints not floats for high depth base counts
  • sratools updates
  • improvements to sratools
  • added extra line ending to step7 final print statement
  • add dask to environment.yaml
  • added sratools to ipyrad.analysis

0.7.3

July 23, 2017

  • Better handling for restarting jobs in substeps of step 6.
  • Fixed the fscking pysam conda-build scripts for osx.
  • Add patch for pysam build on osx
  • Fix for conda-build v3 breaking meta.yaml
  • Using htslib internal to pysam and removing bcftools/htslib/samtools direct dependencies.
  • Add force flag to force building clusters if utemp exists.
  • conda recipe updates
  • updateing conda recipe
  • ensure stats are saved as floats
  • fix to bug introduced just now to track progress during s6 clustering
  • Fix an issue with merged assemblies and 3rad.
  • fix for step 6 checkpoints for reference-based analyses
  • conda recipe tweaking
  • conda recipe updates
  • fix to conda recipes
  • update bucky cookbook
  • added shareplot code
  • bucky ipa update remove old files
  • conda recipe updated
  • “Updating ipyrad/__init__.py to version - 0.7.2
  • update conda recipe
  • update pysam to correct version

0.7.2

July 10, 2017

  • update conda recipe
  • update pysam to correct version (0.11.2.2)
  • added bucky ipa code
  • bucky cookbook up
  • automatically merges technical replicates in demux
  • check multiple barcodes in samples that were merge of technical replicates
  • fix for alleles output error
  • Added checkpointing/restarting from interrupt to step 6.
  • Added cli detection for better spacer printing.
  • bpp bug fixe to ensure full path names
  • API user guide docs update.
  • cookcook updates tetrad and treemix
  • new _cli, _checkpoint, and _spacer attributes, and new ‘across’ dir for step 6
  • load sets cli=False by default, and it saves checkpoint info
  • allow profile with ipcluster
  • treemix report if no data is written (i.e., all filtered)
  • fix to allow setting nquartets again.
  • Better integration of API/CLI.
  • Bug fix to Tree drawing when no boots in tetrad.
  • tetrad fix for compatibility with new toytree rooting bug fix for saving features.
  • cli is now an attribute of the Assembly object that is set to True by __main__ at runtime, otherwise 0.
  • cluster_info() now prints instead of return
  • rehaul of bucky ipa tools
  • print cluster_info now skips busy engines
  • unroot tetrad tree on complete

0.7.1

June 16, 2017

  • Actually handle SE reference sequence clustering.
  • Prevent empty clust files from raising an error. Probably only impacts sim data.
  • If debug the retain the bed regions per sample as a file in the refmap directory.
  • updated tunnel docs
  • HPC tunnel update
  • support for parsing supervised structure analyses in ipa
  • HPC tunnel docs update
  • update analysis docs
  • ipa.treemix params
  • more params added to ipa.treemix
  • cookbook update treemix
  • fix to conda rec
  • treemix ipa updates

0.7.0

June 15, 2017

  • put a temporary block on denovo+ref
  • added treemix ipa funcs
  • update conda recipe
  • added notebook for structure with popdata
  • updated tetrad notebook
  • update bpp notebook
  • fix missing newline in alleles
  • ipa structure file clobber fix
  • cleaner and more consistent API attr on ipa objects
  • Added docs for the -t flag.
  • fix in ipa.structure so replicate jobs to do not overwrite
  • Fix bad link in docs.
  • better method to find raxml binary in analysis tools
  • consens bugfix for new ipmlementation
  • ensure h5 files are closed after dask func
  • fix to parse chrom pos info from new consens name format
  • removed deprecated align funcs
  • removed hardcoded path used in testing
  • removed deprecated align funcs. Made it so build_clusters() does nothing for ‘reference’ method since there is a separate method in ref for chunking clusters
  • some new simpler merge funcs
  • make new ref funcs work with dag map
  • new build funcs usign pysam

0.6.27

June 03, 2017

  • Step 6 import fullcomp from util.

0.6.26

June 01, 2017

  • Step 4 - Handle the case where no clusters have sufficient depth for statistical basecalling.

0.6.25

May 30, 2017

  • Fix a bug in refmap that was retaining the reference sequence in the final clust file on rare occasions.

0.6.24

May 25, 2017

  • Bug fix for “numpq” nameerror

0.6.23

May 24, 2017

  • bug fix for numq error in s5

0.6.22

May 22, 2017

  • Fixed bug in vcf output for reference mapped.

0.6.21

May 19, 2017

  • Fix new chrom/pos mechanism to work for all assembly methods.
  • Change chroms dtype to int64. Reference sequence CHROM is now 1-indexed. Anonymous loci are -1 indexed.
  • Switch chroms dataset dtype to int64.
  • Fix for alleles output.
  • Fix nasty PE refmap merging issue.
  • Fix massive bug in how unmapped reads are handled in refmap.
  • added md5 names to derep and simplified code readability within pairmerging
  • fix for binary finder
  • added dask to conda recipe
  • added dask dependency

0.6.20

May 10, 2017

  • added dask dependency
  • vcf building with full ref info
  • bug fix to alleles output and support vcf chrompos storage in uint64
  • simpler and slightly faster consens calls and lower memory and stores chrompos as uint64s
  • chrompos now stored as uint64
  • reducing memory load in race conditions for parallel cutadapt jobs
  • Squash Cosmetic commit logs in releasenotes. Add more informative header in step 7 stats file.
  • Trying to catch bad alignment for PE in step 6.

0.6.19

May 04, 2017

  • Handle empty locus when building alleles file. Solves the ValueError “substring not found” during step 7.
  • workshop notebook uploaded

0.6.18

May 03, 2017

  • update to analysis tools
  • accepted the local bpp notebook
  • complete bpp notebook up
  • notebook updates
  • raxml docs
  • raxml cookbook up
  • docs update
  • raxml docs updated
  • links to miniconda updated
  • fix for tetrad restarting bootstraps
  • removed bitarray dependency
  • adding restart checkpoints in step6

0.6.17

April 26, 2017

  • support for alleles file in bpp tools
  • align names in alleles output
  • bugfix to name padding in .alleles output
  • slight delay between jobs
  • bpp store asyncs
  • bpp store asyncs
  • update bpp cookbook
  • testing html
  • testing html
  • new filter_adapters=3 option adds filtering of poly-repeats
  • conda recipe update for cutadapt w/o need of add-channel

0.6.16

April 25, 2017

  • alleles output now supported
  • Additional documentation for max_alleles_consens parameter.
  • support alleles output, minor bug fixes for step6, much faster alignment step6
  • lower default ‘cov’ value for vsearch within clustering in RAD/ddrad/pairddrad
  • tetrad bug, use same ipyclient for consensus tree building
  • store asyncs in the structure object
  • allow passing in ipyclient explicitly in .run() in tetrad
  • fix for time stamp issue in tetrad
  • Better testing for existence of all R2 files for merged assemblies.
  • notebook updates
  • tunnel docs update
  • updated HPC docs
  • tetrad cookbook updated
  • HPC docs update
  • bpp cookbook good to go
  • update tetrad notebook
  • missing import

0.6.15

April 18, 2017

  • Actually fix gphocs output.
  • allow passing in ipyclient in API
  • baba notebook update
  • cleaner api for bpp object
  • new analysis setup
  • updated analysis tools without ete
  • adding doc string

0.6.14

April 13, 2017

  • Fixed CHROM/POS output for reference mapped loci.

0.6.13

April 13, 2017

  • Fix gphocs output format.
  • If the user removes the population assignment file blank out the data.populations dictionary.

0.6.12

April 10, 2017

  • Prevent versioner from including merge commits in the release notes cuz they are annoying.
  • Add the date of each version to the releasenotes docs, for convenience.
  • Experimenting with adding date to releasenotes.rst
  • added more attributres to tree
  • change alpha to >=
  • tip label and node label attributes added to tree
  • tetrad ensure minrank is int
  • fix structure obj removing old files
  • lots of cleanup to baba code
  • edit to analysis docs
  • Handle pop assignment file w/o the min sample per pop line.
  • merge conflict resolved
  • bug fix for tuples in output formats json
  • sim notebook started
  • cookbook abba-baba updated
  • tetrad cookbook api added
  • added option to change line spacing on progress bar
  • major overhaul to ipyrad.analysis and plotting
  • option to buffer line spacing on cluster report
  • Removed confusing punctuation in warning message
  • Make vcf and loci output files agree about CHROM number per locus.
  • Cosmetic change to debug output.
  • Make the new debug info append instead of overwrite.
  • Fix annoying bug with output_format param I introduced recently.
  • Add platform info to default log output on startup.
  • Actually write the error to the log file on cutadapt failure.
  • Write the version and the args used to the log file for each run. This might be annoying, but it could be useful.
  • bpp randomize option added to write
  • adding bpp cookbook update
  • updating analysis tools for new bpp baba and tree
  • merge resolved
  • analysis init update for new funcs
  • apitest update
  • abba cookbook update
  • update bpp cookbook
  • small edit to HPC docs
  • tetrad formatting changing
  • updated analysis tools cookbooks
  • docs analysis page fix
  • added header to bpp convert script

0.6.11

March 27, 2017

  • Fix a bug in PE refmapping.
  • Fix error reporting if when testing for existence of the clust_database file at beginning of step 7.
  • Fix bug reading output formats from params file.
  • Add docs for dealing with long running jobs due to quality issues.
  • bug fix for output format empty
  • structure cookbook update
  • pushing analysis tools
  • svg struct plot added
  • structure cookbook updates
  • struct image added for docs
  • update structure cookbook for new code
  • Actually fix the output_format default if blank.
  • Set blank output formats in params to default to all formats.
  • Add a filter flag for samtools to push secondary alignments to the unmapped file.
  • rm old files
  • shareplot code in progress
  • work in progress baba code notebook
  • a decent api intro but bland
  • beginnings of a migrate script
  • raxml docs updated, needs work still
  • analysis docs page update
  • structure parallel wrapper scripts up in analysis
  • simplifying analysis imports
  • cleanup top imports
  • Adding support for G-PhoCS output format.
  • Fix wacky reporting of mapped/unmapped reads for PE.
  • Document why we don’t write out the alleles format currently.
  • module init headers
  • added loci2cf script
  • update structure notebook with conda recipes
  • fileconversions updated
  • loci2cf func added
  • cookbook bucky docs up
  • loci2multinex and bucky notebook updated
  • BUCKy cookbook updated
  • bucky conda recipe up
  • fix to API access hint
  • cleaner code by moving msgs to the end
  • slight modification to paired adapter trimming code
  • cleaner Class Object in baba
  • minor change to cluster_info printing in API

0.6.10

  • Filter reference mapped reads my mapq < 30, and handle the occasional malformed region string in bam_region_to_fasta.
  • Handle PE muscle failing alignment.
  • Cosmetic faq.rst
  • Cosmetic faq.rst
  • Cosmetic
  • Cosmetic docs changes.
  • Add docs for step 3 crashing bcz of lack of memory.
  • Catch a bug in alignment that would crop up intermittently.
  • removed the –profile={} tip from the docs
  • Fix notebook requirement at runtime error.
  • Fix formatting of output nexus file.

0.6.9

  • Changed the sign on the new hackersonly parameter min_SE_refmap_overlap.
  • added a persistent_popen function for aligning, needs testing before implementing
  • debugger in demux was printing way too much
  • bugfix for empty lines in branching subsample file
  • Add a janky version checker to nag the user.

0.6.8

  • Actually remove the reference sequence post alignment in step 3. This was BREAKING STUFF.
  • updated notebook requirement in conda recipe
  • Handle conda building pomo on different platforms.
  • Ooops we broke the versioner.py script. Now it’s fixed.
  • conda recipe updates
  • conda recipe updates
  • conda recipe updates
  • testing git lfs for storing example data

0.6.7

  • Fixed stats reported for filtered_by_depth during step 5.
  • Add new hackersonly parameter min_SE_refmap_overlap and code to refmap.py to forbid merging SE reads that don’t significantly overlap.
  • Use preprocessing selectors for linux/osx for clumpp.
  • Add url/md5 for mac binary to clumpp meta.yaml
  • conda recipes update
  • getting ipyrad to conda install on other envs
  • updating versions for conda, rtd, setup.py
  • moving conda recipes
  • conda recipe dir structure
  • bpp install bug fix
  • bpp recipe fix
  • conda recipes added
  • Roll back change to revcomp reverse strand SE hits. Oops.
  • fix merge conflect with debug messages.
  • Fix a bug in refmap, and handle bad clusters in cluster_within.
  • Actually revcomp SE - strand reads.
  • updated HPC docs
  • updated HPC docs
  • updated HPC docs

0.6.6

  • bug fix in building_arrays where completely filtered array bits would raise index error -1
  • tunnel docs updates
  • method docs updated to say bwa
  • some conda tips added
  • fix for name parsing of non gzip files that was leaving an underscore
  • Allow get_params using the param string as well as param index
  • Update hpc docs to add the sleep command when firing up ipcluster manually.
  • Fixed some formatting issues in the FAQ.rst.

0.6.5

  • Fixed 2 errors in steps 3 and 4.
  • “Updating ipyrad/__init__.py to version - 0.6.4
  • left a debugging print statement in the code
  • removed old bin
  • “Updating ipyrad/__init__.py to version - 0.6.4

0.6.4

  • left a debugging print statement in the code
  • removed old bin
  • “Updating ipyrad/__init__.py to version - 0.6.4

0.6.4

0.6.4

  • update to docs parameters
  • bug fix for merging assemblies with a mix of same named and diff named samples

0.6.3

  • Fixed a bug i introduced to assembly. Autotroll.

0.6.2

  • Fix subtle bug with migration to trim_reads parameter.

0.6.1

  • Fixed malformed nexus output file.
  • cookbook updates to docs
  • updated cookbook structure pedicularis

0.6.0

  • trim reads default 0,0,0,0. Similar action to trim loci, but applied in step 2 to raws
  • trim_reads default is 0,0
  • raise default cov/minsl for gbs data to 0.5 from 0.33
  • prettifying docs
  • pedicularis docs update v6 way way faster
  • updated tutorial
  • fixing links in combining data docs
  • updating tutorial for latest version/speed
  • added docs for combining multiple plates
  • added docs for combining multiple plates
  • added docs for combining multiple plates
  • Removed from output formats defaults (it doesn’t do anything)
  • baba cookbooks [unfinished] up
  • finally added osx QMC and fixed bug for same name and force flag rerun
  • put back in a remove tmpdirs call
  • removed a superfluous print statement
  • bug fix to mapfile, now compatible with tetrad
  • paramsinfo for new trimreads param
  • branching fix for handling new param names and upgrading to them
  • better handling of pairgbs no bcode trimming. Now handles –length arg
  • better handling of KBD in demux. Faster compression.
  • forgot sname var in cutadaptit_single
  • Fix step 2 for PE reads crashing during cutatapt.
  • Test for bz2 files in sorted_fastq_path and nag the user bcz we don’t support this format.
  • Step 1 create tmp file for estimating optim chunk size in project_dir not ./
  • Add force flag to mapreads(), mostly to save time on rerunning if it crashes during finalize_mapping. Also fixed a nasty bug in refmapping.
  • Added text to faq about why PE original RAD is hard to assemble, cuz people always ask.
  • Better handling of loci w/ duplicate seqs per sample.
  • Fix a bug that munged some names in branching.
  • merge conflict
  • modified for new trim param names
  • support for new trim_loci param
  • support for updated cutadapt
  • bugfix for hackerdict modify of cov
  • chrom only for paired data
  • changed two parameter names (trims)
  • tested out MPI checks
  • cutadapt upgrade allow for –length option
  • Moved log file reset from init to main to prevent -r from blanking the log >:{
  • Moved log file reset from __init__ to __main__
  • Don’t bother aligning clusters with duplicates in step 6.
  • baba update
  • remove print statement left in code
  • same fix to names parser, better.
  • added comment ideas for chrompos in refmap
  • bug fix, Sample names were being oversplit if they had ‘.’ in them
  • test labels, improved spacing, collapse_outgroups options added to baba plots
  • Fix debug message in refmap and don’t raise on failure to parse reference sequence.
  • attempts to make better cleanup for interrupt in API
  • some cleanup to calling steps 1,2 funcs
  • speed testing demux code with single vs multicore
  • moved setting of [‘merged’] to replace filepath names to Assembly instead of main so that it also works for the API
  • added a np dict-like arr to be used in baba, maybe in ref.
  • baba plotting functions added
  • Better handling of tmpdir in step 6.
  • added baba cookbook
  • only map chrom pos if in reference mode
  • new batch and plotting functions
  • trim .txt from new branch name if accidentally added to avoid Assembly name error
  • added a name-checker to the branch-drop CLI command
  • Fixed legend on Pedicularis manuscript analysis trees.
  • Cosmetic change
  • Adding manuscript analysis tree plotting for empirical PE ddRAD refmap assemblies.
  • More or less complete manuscript analysis results.
  • Actually fix vcf writing CHROM/POS information from refseq mapped reads.
  • Handle monomorphic loci during vcf construction.
  • removed deprecated subsample option from jointestimate
  • –ipcluster method looks for default profile and cluster-id instance
  • clode cleanup and faster haploid E inference
  • simplified cluster info printing
  • enforce ipyclient.shutdown at end of API run() if engine jobs are not stopped
  • code cleanup. Trying to allow better KBD in step2
  • lots of cleanup to DAG code. Now ok for individual samples to fail in step3, others will continue. Sorts clusters by derep before align chunking
  • Allow assemblies w/o chrom/pos data in the hdf5 to continue using the old style vcf position numbering scheme.
  • Don’t print the error message about samples failing step 4 if no samples actually fail.
  • Set a size= for reference sequence to sort it to the top of the chunk prior to muscle aligning.
  • Allow samples with very few reads to gracefully fail step 4.
  • Better error handling during reference mapping for PE.
  • Fix error reporting in merge_pairs().
  • Add CHROM/POS info to the output vcf file. The sorting order is a little wonky.
  • Handle empty project_dir when running -r.
  • a clean bighorse notebook run on 100 cores
  • Fix minor merge conflict in ref_muscle_chunker.
  • Use one persistant subprocess for finalizing mapped reads. Big speed-up. Also fix a stupid bug in estimating insert size.
  • Better handling of errors in merge_pairs, and more careful cleanup on error.
  • If /dev/shm exists, use it for finalizing mapped reads.
  • Handle a case where one or the other of the PE reads is empty.
  • cleaner print cpus func
  • Adding a new dataset to the catg and clust hdf5 files to store CHROM and POS info for reference mapped reads.
  • added cleanhorse notebook
  • working on notebook
  • cleanup up redundancy
  • MUCH FASTER STEP 4 using numba array building and vectorized scipy
  • MUCH FASTER MUSCLE ALIGNING. And a bug fix to a log reporter
  • bug fix to error/log handler
  • Finish manuscript refmap results analysis. Added a notebook for plotting trees from manuscript Pedicularis assembly.
  • Better checking for special characters in assembly names, and more informative error message.
  • added a test on big data
  • broken notebook
  • development notebook for baba
  • working on shareplots
  • testing caching numba funcs for faster run starts
  • added optional import of subprocess32
  • docs update
  • progress on baba
  • added option to add additional adapters to be filtered from paired data
  • Adding pairwise fst to manuscript analysis results. Begin work on raxml for manuscript analysis results.
  • Change a log message from info to warn that handles exceptions in rawedit.
  • abba baba updated
  • Fixed link in tetrad doc and cosmetic change to API docs.
  • Add comments to results notebooks.
  • Adding manuscript reference mapping results.
  • Manuscript analysis reference sequence mapping horserace updates. Stacks mostly done. dDocent started.
  • Adding ddRAD horserace nb.
  • Better cleanup during refmap merge_pairs (#211).
  • update for raxml-HYBRID
  • update raxml docs
  • cleanup old code
  • update raxml docs
  • updating raxml docs
  • update to bucky cookbook

0.5.15

  • bug fix to ensure chunk size of the tmparray in make-arrays is not greater than the total array size
  • fix for vcf build chunk error ‘all input arrays must have the same number of dimensions’. This was raised if no loci within a chunk passed filtering
  • allow vcf build to die gracefully
  • api cleanup

0.5.14

  • updated docs for popfile
  • fix for long endings on new outfile writing method
  • Made max size of the log file bigger by a zero.
  • Be nice and clean up a bunch of temporary files we’d been leaving around.
  • Better handling for malformed R1/R2 filenames.
  • api notebook update
  • more verbose warning on ipcluster error
  • allow setting ipcluster during Assembly instantiation
  • improved populations parser, and cosmetic
  • greatly reduced memory load with new func boss_make_arrays that builds the arrays into a h5 object on disk, and uses this to build the various output files. Also reduced disk load significantly by fixing the maxsnp variable bug which was making an empty array that was waay to big. Also added support for nexus file format. Still needs partition info to be added.
  • CLI ipcluster cluster-id=’ipyrad-cli-xxx’ to more easily differentiate from API
  • added note on threading
  • API cleanup func names
  • write outfiles h5 mem limit work around for build-arrays
  • step 1 with sorted-fastq-path no longer creates empty fastq dirs

0.5.13

  • API user guide updated
  • Added ipyclient.close() to API run() to prevent ‘too many files open’ error.
  • Bug fix for concatenation error in vcf chunk writer
  • added smarter chunking of clusters to make for faster muscle alignments
  • closed many subprocess handles with close_fds=True
  • added closure for open file handle
  • cleanup of API attributes and hidden funcs with underscores

0.5.12

  • Refmap: actually fix clustering when there are no unmapped reads.
  • Updated docs for parameters.

0.5.11

  • Refmap: Handle case where all reads map to reference sequence (skip unmapped clustering).
  • More refined handling of reference sequences with wacky characters in the chrom name like | and (. Who would do that?
  • Raxml analysis code added to Analysis Tools: http://ipyrad.readthedocs.io/analysis.html
  • HPC tunneling documentation updated with more troubleshooting
  • Better handling of final alignments when they contain merged and unmerged sequences (#207)
  • added finetune option to loci2bpp Analysis tools notebook.
  • More improvements to manuscript analysis.
  • Finished simulated analysis results and plotting.
  • Improve communication if full raw path is wonky.
  • Horserace is complete for simulated and empirical. Continued improvement to gathering results and plotting.

0.5.10

  • Fix for 3Rad w/ only 2 cutters during filtering.
  • Better handling for malformed 3rad barcodes file.

0.5.9

0.5.8

  • improved progress bar
  • merge fix
  • notebook testing geno build
  • Fix to memory handling on vcf build, can now handle thousands of taxa. Also, now saves filepaths to json and API object.
  • progres on dstats package
  • More progress on manuscript horserace. Analysis is done, now mostly working on gathering results.

0.5.7

  • Fix error handing during writing of vcf file.

0.5.6

  • notebook testing
  • purge after each step to avoid memory spillover/buildup
  • better handling of memory limits in vcf build. Now producing geno output files. Better error reporting when building output files
  • added a global dict to util
  • new smaller limit of chunk sizes in h5 to avoid memory limits
  • analysis docs update
  • Document weird non-writable home directory on cluster issues.
  • docs update for filtering differences
  • merge fix
  • tetrad notebook edits
  • dstat calc script editing
  • Added code to copy barcodes during assembly merge. Barcodes are needed for all PE samples in step 2.

0.5.5

  • Better handling for PE with loci that have some merged and some unmerged reads.
  • Allow other output formats to try to build if vcf fails.
  • Fixed bug that was forcing creation of the vcf even if it wasn’t requested.

0.5.4

  • More improved handling for low/no depth samples.
  • Better handling for cleanup of samples with very few reads.

0.5.3

  • Catch sample names that don’t match barcode names when importing demux’d pair data.
  • Serious errors now print to ipyrad_log.txt by default.

0.5.2

  • Handle sample cleanup if the sample has no hidepth clusters.
  • Fix for declone_3rad on merged reads.
  • Better support for 3rad lining presorted fastqs.
  • bucky cookbook updated
  • dstat code updates
  • bucky cookbook uploaded

0.5.1

  • added tetrad docs
  • make tetrad work through API
  • added tetrad notebook

0.5.0

  • Swap out smalt for bwa inside refmapping. Also removes reindexing of reference sequence on -f in step 3.
  • fix for array error that was hitting in Ed’s data, related to 2X count for merged reads. This is now removed.
  • bug fix for 4/4 entries in vcf when -N at variable site.
  • prettier printing of stats file

0.4.9

  • fix for array error that was hitting in Ed’s data, related to 2X count for merged reads. This is now removed.
  • bug fix for 4/4 entries in vcf when -N at variable site.
  • prettier printing in s5 stats file
  • hotfix for large array size bug introduced in 0.4.8

0.4.8

  • bug fix to measure array dims from mindepth settings, uses statistical for s4, and majrule for s5
  • adding bwa binary for mac and linux
  • improved N removal from edges of paired reads with variable lengths
  • new parsing of output formats, and fewer defaults
  • only snps in the vcf is new default. Added pair support but still need to decide on spacer default. New cleaner output-formats stored as a tuple
  • small fix for better error catching
  • new hidepth_min attr to save the mindepth setting at the time when it is used
  • mindepth settings are now checked separately from other parameters before ‘run’ to see if they are incompatible. Avoids race between the two being compared individually in set-params.
  • new functions in steps 3-5 to accomodate changes to mindepth settings so that clusters-hidepth can be dynamically recalculated
  • fix to SSH tunnel docs
  • hotfix for step5 sample save bug. pushed to soon

0.4.7

  • make compatible with changes to s6
  • allow sample to fail s2 without crashing
  • cleaner progress bar and enforced maxlen trimming of longer reads
  • lowered maxlen addon, enforced maxlen trimming in singlecat
  • updates to docs
  • testing new maxlen calculation to better acommodate messy variable len paired data sets.
  • update to docs about pre-filtering
  • temporary fix for mem limit in step 6 until maxlen is more refined
  • Fix bug in refmap.

0.4.6

  • Nicely clean up temp files if refmap merge fails.

0.4.5

  • Add docs for running ipcluster by hand w/ MPI enabled.
  • Fix PE refmap bug #148
  • Documenting PYTHONPATH bug that crops up occasionally.
  • Adjusted fix to bgzip test.
  • Fixed a bug w/ testing for bgzip reference sequence. Also add code to fix how PE ref is handled to address #148.
  • fix for last fix
  • fix for last push gzip
  • collate with io.bufferedwriter is faster
  • faster collating of files
  • Continuing work on sim and empirical analysis.
  • rev on barcode in step2 filter pairgbs
  • faster readcounter for step1 and fullcomp on gbs filter=2 barcode in step2
  • tunnel docs update
  • working on a SSH tunnel doc page
  • Handle OSError in the case that openpty() fails.

0.4.4

  • Handle blank lines at the top of the params file.

0.4.3

  • making smoother progress bar in write vcfs
  • bugfix for jointestimate
  • testing bugfixes to jointestimate
  • default to no subsampling in jointestimate call
  • testing bugfixes to jointestimate
  • added hackersonly option for additional adapters to be filtered
  • bug fix to joint H,E estimate for large data sets introduced in v.0.3.14 that was yielding inflated rates.
  • fix for core count when using API
  • Added plots of snp depth across loci, as well as loci counts per sample to results notebook.
  • phylogenetic_invariants notebook up
  • some notes on output formats plans
  • removed leftjust arg b/c unnecessary and doesn’t work well with left trimmed data

0.4.2

  • Merging for Samples at any state, with warning for higher level states. Prettier printing for API. Fix to default cores setting on API.
  • fix for merged Assemblies/Samples for s2
  • fix for merged Assemblies&Samples in s3
  • removed limit on number of engines used during indexing
  • Added ddocent to manuscript analysis.
  • tutorial update
  • in progress doc notebook
  • parallel waits for all engines when engines are designated, up until timeout limit
  • parallelized loading demux files, added threads to _ipcluster dict, removed print statement from save
  • vcf header was missing
  • added step number to progress bar when in interactive mode
  • added warning message when filter=2 and no barcodes are present
  • improved kill switch in step 1
  • use select to improve cluster progress bar
  • added a CLI option to fine-tune threading
  • added dstat storage by default
  • new default trim_overhang setting and function (0,0,0,0)
  • fix for overzealous warning message on demultiplexing when allowing differences

0.4.1

  • Fixed reference before assignment error in step 2.

0.4.0

  • Cosmetic change
  • new sim data and notebook up
  • Added aftrRAD to the manuscript analysis horserace
  • made merging reads compatible with gzipped files from step2
  • modify help message
  • made TESTS global var, made maparr bug fix to work with no map info
  • More carefully save state after completion of each step.
  • limit vsearch merging to 2 threads to improve parallel, but should eventually make match to cluster threading. Added removal of temp ungzipped files.
  • more detailed Sample stats_df.s2 categories for paired data
  • made merge command compatible with gzip outputs from step2
  • simplified cutadapt code calls
  • updates to simdata notebook
  • merge conflict fix
  • new stats categories for step2 results
  • added adapter seqs to hackersdict
  • much faster vcf building
  • new step2 quality checks using cutadapt
  • small changes to use stats from new s2 rewrite. Breaks backwards compatibility with older assemblies at step3
  • massive rewrite of cluster across, faster indexing, way less memory overhead
  • just added a pylint comment
  • Adding cutadapt requirement for conda build
  • Suppress numpy mean of empty slice warnings.
  • Merged PR from StuntsPT. Fix to allow param restriction_overhang with only one enzyme to drop the trailing comma (,).
  • Merge branch ‘StuntsPT-master’
  • Adding a FAQ to the docs, including some basic ipyparallel connection debugging steps.
  • Adding documentation for the CLI flag for attaching to already running cluster.
  • Update docs to include more specifics about ambiguous bases in restriction overhang seqs.
  • Get max of max_fragment_length for all assemblies during merge()
  • Make gbs a special case for handling the restriction overhang.
  • Changed the way single value tuples are handled.
  • cleaning up releasenotes
  • added networkx to meta.yaml build requirements
  • “Updating ipyrad/__init__.py to version - 0.3.42

0.3.42

  • always prints cluster information when not using ipcluster[profile] = default
  • broke and then fixed samtools sorting on mac (BAM->bam)
  • better error message at command line
  • cleaned code base, deleting deprecated funcs.
  • revcomp function bug fix to preserve lower case pair splitter nnnn for pairgbs data
  • Adding requirement for numba >= 0.28 to support
  • Updating mac and linux vsearch to 2.0
  • docs updates (pull request #186) from StuntsPT/master
  • Added a troubleshooting note.
  • wrapped long running proc jobs so they can be killed easily when engines are interrupted
  • fix for API closing ipyclient view
  • fix for piping in subprocess
  • bug fix for missing subprocess module for zcat, and new simplified sps calls.
  • merge fix
  • allow for fuzzy match characters in barcode path
  • new simulated data set
  • uploaded cookbook for simulating data
  • no longer register ipcluster to die at exit, but rather call shutdown explicitly for CLI in the finally call of run()
  • massive code cleanup in refmapping, though mostly cosmetic. Simplified file paths and calls to subprocess.
  • massive restructuring to organize engine jobs in a directed acyclic graph to designate dependencies to ipyparallel. Lot’s of code cleanup for subprocess calls.
  • fix for progress bar cutting short in step 6. And simplified some code calling tmpdir.
  • Adding notebooks for ipyrad/pyrad/stacks simulated/emprical horserace.
  • Better handling for mindepth_statistical/majrule. Enforce statistical >= majrule.
  • Allow users with SE data to only enter a single value for edit_cutsites.
  • Properly finalize building database progress bar during step 6, even if some samples fail.
  • allow max_indels option for step 3 in API. Experimental.
  • bug fix to indel filter counter. Now applies in step7 after ignoring terminal indels, only applies to internal indels
  • much faster indexing using sorted arrays of matches from usort. Faster and more efficient build clusters func.
  • rewrote build_clusters func to be much faster and avoid memory limits. Other code cleanup. Allow max_indel_within option, though only in API currently.
  • numba update requirement to v.0.28

0.3.41

  • Reverting a change that broke cluster_within

0.3.40

  • Set vsearch to ignore max phred q score on merging pairs
  • Added bitarray dependency to conda build

0.3.39

  • Fix vsearch fastq max threshold arbitrarily high. Also remove debug crust.

0.3.38

  • Handle samples with few reads, esp the case where there are no matches during clustering.
  • Handle samples with few or no high depth reads. Just ignore them and inform the user.

0.3.37

  • Fix to allow pipe character in chrom names of reference sequences
  • Tweak to calculation of inner mate distance (round up and cast to int)
  • Refmap: fix calc inner mate distance PE, handle samples w/ inner mate distance > max, and handle special characters in ref seq chromosome names
  • Add a test to forbid spaces in project directory paths
  • Cosmetic docs fix
  • Cosmetic fix to advanced CLI docs
  • Added more explicit documentation about using the file to select samples during branching
  • Clarifying docs for qscore offset in the default params file
  • Cosmetic change to docs
  • Rolling back changes to build_clusters
  • “Updating ipyrad/__init__.py to version - 0.3.36
  • hotfix for edgar fix break
  • “Updating ipyrad/__init__.py to version - 0.3.36
  • hotfix for edgar fix break

0.3.36

  • hotfix for edgar fix break
  • “Updating ipyrad/__init__.py to version - 0.3.36
  • hotfix for edgar fix break

0.3.36

  • hotfix for edgar fix break

0.3.36

  • hotfix for memory error in build_clusters, need to improve efficiency for super large numbers of hits
  • more speed testing on tetrad
  • merge conflict
  • cleaner print stats for tetrad
  • finer tuning of parallelization tetrad

0.3.35

  • Handled bug with samtools and gzip formatted reference sequence
  • Fixed a bug where CLI was not honoring -c flag
  • debugging and speed tests
  • added manuscript dir
  • Update on Overleaf.
  • Manuscript project created
  • speed improvements to tetrad
  • smarter/faster indexing in tetrad matrix filling and speed up from skipping over invariant sites
  • finer tuning of bootstrap restart from checkpoint tetrad
  • print bigger trees for tetrad
  • fix to printing checkpoint info for tetrad
  • bug fix for limiting n cores in tetrad
  • made an extended majority rule consensus method for tetrad to avoid big import packages just for this.
  • testing timeout parallel
  • test notebook update
  • adding consensus mj50 function

0.3.34

  • new –ipcluster arg allows using a running ipcluster instance that has profile=ipyrad
  • temporary explicit printing during ipcluster launch for debugging
  • also make longer timeout in _ipcluster dict of Assembly object

0.3.33

  • temporary explicit printing during ipcluster launch for debugging
  • “Updating ipyrad/__init__.py to version - 0.3.33
  • also make longer timeout in _ipcluster dict of Assembly object

0.3.33

  • also make longer timeout in _ipcluster dict of Assembly object

0.3.33

  • increased timeout for ipcluster instance from 30 seconds to 90 seconds
  • Added sample populations file format example
  • quick api example up
  • merge conflict
  • removed chunksize=5000 option
  • Update README.rst

0.3.32

  • Fix optim chunk size bug in step 6 (very large datasets overflow hdf5 max chunksize 4GB limit)
  • Doc update: Cleaned up the lists of parameters used during each step to reflect current reality.
  • Fixed merge conflict in assembly.py
  • Fix behavior in step 7 if requested samples and samples actually ready differ
  • Removing references to deprecated params (excludes/outgroups)
  • Simple error handling in the event no loci pass filtering
  • changed tetrad default mode to MPI
  • release notes update

0.3.31

  • changed name of svd4tet to tetrad
  • improved message gives info on node connections for MPI
  • added a test script for continuous integration
  • big cleanup to ipcluster (parallel) setup, better for API/CLI both
  • modified tetrad ipcluster init to work the same as ipyrad’s
  • generalized ipcluster setup

0.3.30

  • Changed behavior of step 7 to allow writing output for all samples that are ready. Allows the user to choose whether to continue or quit.
  • Fixed very stupid error that was not accurately tracking max_fragment_length.
  • Better error handling on malformed params file. Allows blank lines in params (prevents that gotcha).
  • Cosmetic changes to step 7 interaction if samples are missing from db
  • prettier splash
  • edited splash length, added newclient arg to run
  • testing MPI on HPC multiple nodes
  • updating docs parameters

0.3.29

  • Temp debug code in jointestimate for tracking a bug
  • Step 5 - Fixed info message for printing sample names not in proper state. Cosmetic but confusing.

0.3.28

  • Added statically linked binaries for all linux progs. Updated version for bedtools and samtools. Updated vsearch but did not change symlink (ipyrad will still use 1.10)
  • Bugfix that threw a divide by zero error if no samples were actually ready for step 5

0.3.27

  • Fixed a race condition where sometimes last_sample gets cleaned up before the current sample finishes, caused a KeyError. Very intermittent and annoying, but should work now

0.3.26

  • fix merge conflict
  • removed future changes to demultiplex, fixed 1M array size error
  • added notes todo
  • removed unnecessary imports
  • removed backticks from printouts
  • removed backticks from printouts
  • removed unnecessary ‘’ from list of args
  • code cleanup for svd4tet
  • update to some error messages in svd4tet
  • slight modification to -n printout
  • updated analysis docs
  • minor docs edits
  • updated releasenotes

0.3.25

  • better error message if sample names in barcodes file have spaces in them
  • VCF now writes chr (‘chromosomes’ or ‘RAD loci’) as ints, since vcftools and other software hate strings apparently
  • fix for concatenating multiple fastq files in step2
  • fix for cluster stats output bug

0.3.24

  • added nbconvert as a run dependency for the conda build

0.3.23

  • svd4tet load func improved
  • fixed bug with floating point numbers on weights. More speed improvements with fancy matrix tricks.
  • added force support to svd4tet
  • update releasenotes
  • added stats storage to svd4tet
  • loci bootstrap sampling implemented in svd4tet
  • init_seqarray rearrangement for speed improvement to svd4tet
  • removed svd and dstat storage attributes from Assembly Class
  • added a plink map output file format for snps locations
  • further minimized depth storage in JSON file. Only saved here for a quick summary plot. Full info is in the catg file if needed. Reduces bloat of JSON.
  • huge rewrite of svd4tet with Quartet Class Object. Much more concise code
  • big rearrangement to svd4tet CLI commands
  • code cleanup

0.3.22

  • only store cluster depth histogram info for bins with data. Removes hugely unnecessary bloat to the JSON file.
  • fixed open closure
  • massive speed improvement to svd4tet funcs with numba jit compiled C code
  • added cores arg to svd4tet

0.3.21

  • new defaults - lower maxSNPs and higher max_shared_Hs
  • massive reworking with numba code for filtering. About 100X speed up.
  • reworking numba code in svd4tet for speed
  • added debugger to svd4tet
  • numba compiling some funcs, and view superseqs as ints instead of strings gives big speedups
  • fix to statcounter in demultiplex stats
  • improvement to demultiplexing speed
  • releasenotes update
  • minor fix to advanced tutorial
  • updated advanced tutorial
  • forgot to rm tpdir when done
  • testing s6

0.3.20

  • bug fix for max_fragment_len errors for paired data and gbs
  • fix for gbs data variable cluster sizes.
  • prettier printing, does not explicitly say ‘saving’, but it’s still doing it.
  • numba update added to conda requirements
  • Wrote some numba compiled funcs for speed in step6
  • New numba compiled svd func can speed up svd4tet
  • update to analysis tools docs

0.3.19

  • fix for bug in edge trimming when assembly is branched after s6 clustering, but before s7 filtering

0.3.18

  • Better error handling for alignment step, and now use only the consensus files for the samples being processed (instead of glob’ing every consens.gz in the working directory
  • Fix a bug that catches when you don’t pass in the -p flag for branching
  • cleaning up the releasenotes

0.3.17

  • removed the -i flag from the command line.
  • fix for branching when no filename is provided.
  • Fix so that step 6 cleans up as jobs finish. This fixes an error raised if a dummy job finishes too quick.
  • removed a redundant call to open the allhaps file
  • Added a check to ensure R2 files _actually exist. Error out if not. Updated internal doc for link_fastq().
  • tmp fix for svd4tet test function so we can put up this hotfix

0.3.16

  • working on speed improvements for svd4tet. Assembly using purging cleanup when running API.
  • fix for KeyError caused by cleanup finishing before singlecats in step6
  • update to empirical tutorial

0.3.15

  • write nexus format compatible with ape in svd4tet outputs.
  • closing pipe was causing a stall in step6.

0.3.14

  • merge conflict fix
  • set subsample to 2000 high depth clusters. Much faster, minimal decrease in accuracy. Slightly faster code in s4.
  • better memory handling. Parallelized better. Starts non-parallel cleanups while singlecats are running = things go faster.
  • cluster was commented out in s6 for speed testing

0.3.13

  • Replaced direct call to with ipyrad.bins.vsearch
  • Fixed reference to old style assembly method reference_sub
  • Added ability to optionally pass in a flat file listing subsample names in a column.
  • Set a conditional to make sure params file is passed in if doing -b, -r, or -s
  • Softened the warning about overlapping barcodes, and added a bit more explanation
  • Set default max barcode mismatch to 0

0.3.12

  • Fixed infinite while loop inside __name_from_file

0.3.11

  • Fixed commented call to cluster(), step 6 is working again
  • Added a check to ensure barcodes contain only IUPAC characters
  • Fixed demultiplex sorting progress bar
  • append data.name to the tmp-chunks directory to prevent users from running multiple step1 and stepping on themselves
  • Update README.rst
  • Added force flag for merging CLI
  • Bug in rawedit for merged assemblies
  • much faster indel entry in step6
  • chunks size optimization
  • optimizing chunk size step6
  • merge for lowmem fixes to step6
  • decided against right anchoring method from rad muscle alignments. Improved step6 muscle align progress bar
  • reducing memory load in step6
  • debug merge fix
  • improvement to debug flag. Much improved memory handling for demultiplexing

0.3.10

  • versioner now actually commits the releasenotes.rst

0.3.9

  • Versioner now updates the docs/releasenotes.rst
  • Eased back on the language in the performance expectations note
  • fixed all links to output formats file
  • blank page for recording different performance expectations

0.3.5

  • Added -m flag to allow merging assemblies in the CLI

0.2.6

  • Fix to SNP masking in the h5 data base so that stats counts match the number of snps in the output files.

0.1.39

  • Still in development

0.1.38

  • Still in development.
  • Step7 stats are now building. Extra output files are not.
  • New better launcher for Clients in ipyparallel 5

0.1.37

  • conda installation mostly working from ipyrad channel

Citation

Eaton DAR & Overcast I. “ipyrad: Interactive assembly and analysis of RADseq datasets.” Bioinformatics (2020).