Primary output files¶
- Compressed binary archive (GGR)
Harvest is a suite of core-genome alignment and visualization tools for quickly analyzing thousands of intraspecific microbial genomes, including variant calls, recombination detection, and phylogenetic trees.
Tools
Citation
Treangen TJ, Ondov BD, Koren S, Phillippy AM. The Harvest suite for rapid core-genome alignment and visualization of thousands of intraspecific microbial genomes. Genome Biology, 15 (11), 1-15 [PDF]
Download (v1.1, 6-Feb-2015)
Rapid core genome multi-alignment
Project home page: https://github.com/marbl/parsnp
Parsnp was designed to align the core genome of hundreds to thousands of bacterial genomes within a few minutes to few hours. Input can be both draft assemblies and finished genomes, and output includes variant (SNP) calls, core genome phylogeny and multi-alignments. Parsnp leverages contextual information provided by multi-alignments surrounding SNP sites for filtration/cleaning, in addition to existing tools for recombination detection/filtration and phylogenetic reconstruction.
Contents:
MUMi based recruitment is useful for quickly identifying clades of closely related genomes from a genomic DB
- However, it is conservative and can under recruit genomes
- To force inclusion of all genomes in a given directory, use the -c flag.
The precompiled binaries are created using Pyinstaller and packaged as a single file:
The most likely cause for failure is a lack of free space in the /tmp directory. By default, PyInstaller will search a standard list of directories and sets tempdir to the first one which the calling user can create files in.
The list of directories where the archive will be extracted:
- The directory named by the TMPDIR environment variable.
- The directory named by the TEMP environment variable.
- The directory named by the TMP environment variable.
Parsnp is distributed as a precompiled binary that should be devoid of external dependencies (all included in dist). The three steps below represent the fastest way to start using the software:
- wget https://github.com/marbl/parsnp/releases/download/v1.1/parsnp-OSX64-v1.1.tar.gz
- tar -xvf parsnp-OSX64-v1.1.tar.gz
- wget https://github.com/marbl/parsnp/releases/download/v1.1/parsnp-Linux64-v1.1.tar.gz
- tar -xvf parsnp-Linux64-v1.1.tar.gz
From command-line:
parsnp –p <threads> –d <directory of genomes> –r <ref genome>
Parsnp quick start for three example scenarios.
With reference & genbank file:
parsnp -g <reference_replicon1,reference_replicon2,..> -d <genome_dir> -p <threads>
NOTE:
- Genbank files are currently expected to have GI numbers for indexing. This means custom Genbank files (not downloaded from NCBI) will not have annotations appear in Gingr, though the alignment should still work. The dependency on GIs is expected to change in future versions.
- GenBank files can only be specific for the reference genome
- -g and -r are mutually exclusive; you can either provide a fasta file for your reference genome, or GenBank file, but not both.
- All non-reference genomes are captured with the -d parameter. These genomes must be in fasta format and located within the specified directory.
With reference but without genbank file:
parsnp -r <reference_genome> -d <genome_dir> -p <threads>
Autorecruit reference to a draft assembly:
parsnp -q <draft_assembly> -d <genome_db> -p <threads>
Input/output:
-c = <flag>: (c)urated genome directory, use all genomes in dir and ignore MUMi? (default = NO)
-d = <path>: (d)irectory containing genomes/contigs/scaffolds
-r = <path>: (r)eference genome (set to ! to pick random one from genome dir)
-g = <string>: Gen(b)ank file(s) (gbk), comma separated list (default = None)
-o = <string>: output directory? default [./P_CURRDATE_CURRTIME]
-q = <path>: (optional) specify (assembled) query genome to use, in addition to genomes found in genome dir (default = NONE)
MUMi:
-U = <float>: max MUMi distance value for MUMi distribution
-M = <flag>: calculate MUMi and exit? overrides all other choices! (default: NO)
-i = <float>: max MUM(i) distance (default: autocutoff based on distribution of MUMi values)
MUM search:
-a = <int>: min (a)NCHOR length (default = 1.1*Log(S))
-C = <int>: maximal cluster D value? (default=100)
-z = <path>: min LCB si(z)e? (default = 25)
LCB alignment:
-D = <float>: maximal diagonal difference? Either percentage (e.g. 0.2) or bp (e.g. 100bp) (default = 0.12)
-e = <flag> greedily extend LCBs? experimental! (default = NO)
-n = <string>: alignment program (default: libMUSCLE)
-u = <flag>: output unaligned regions? .unaligned (default: NO)
Recombination filtration:
-x = <flag>: enable filtering of SNPs located in PhiPack identified regions of recombination? (default: NO)
Misc:
-h = <flag>: (h)elp: print this message and exit
-p = <int>: number of threads to use? (default= 1)
-P = <int>: max partition size? limits memory usage (default= 15000000)
-v = <flag>: (v)erbose output? (default = NO)
-V = <flag>: output (V)ersion and exit
fixme
For your convenience, precompiled binaries provides at:
http://github.com/marbl/harvest
otherwise, to install from source:
git clone https://github.com/marbl/parsnp.git parsnp_src
cd parsnp_src
Before you start, if running OSX Mavericks, OpenMP is not supported via Clang, so you will not be able to build the source. You will need to install OpenMP and build gcc with OpenMP support. This can be accomplished a couple of ways:
Install Macports, then:
- sudo port install gcc49
- sudo port select gcc mp-gcc49
(or) Install Homebrew, then:
- brew install gcc49
(or) Build & install gcc from source with OpenMP
Download & install gcc 4.9
(or) Download & install gcc prebuilt binaries with OpenMP support
Final suggestion: If issues persist, we recommend using the precompiled binary until OpenMP is natively supported by Clang/OSX (likely to be so in Yosemite)
Once OpenMP support is added, the first (required!) step is to build libMUSCLE:
cd muscle
./autogen.sh
./configure --prefix=`pwd` CXXFLAGS=’-fopenmp’
make install
Then, build Parsnp:
cd ..
./autogen.sh
./configure
make install
Once both installed (to cwd install by default):
export PARSNPDIR=/path/to/parsnp/install
First (important!), build libMUSCLE:
cd muscle
./autogen.sh
./configure --prefix=/usr/local/
sudo make install
Then, build Parsnp:
cd ..
./autogen.sh
./configure --prefix=/usr/local/
sudo make install
To further demonstrate the functionality of Parsnp we have prepared two small tutorial datasets. The first dataset is a MERS coronavirus outbreak dataset involving 49 isolates. The second dataset is a selected set of 31 Streptococcus pneumoniae genomes. Both of these datasets should run on modestly equipped laptops in a few minutes.
49 MERS Coronavirus genomes
Download genomes:
Run parsnp with default parameters
./parsnp -g ./ref/EMC_2012.gbk -d ./mers49 -cCommand-line output
![]()
- Visualize with Gingr GGR
![]()
Configure parameters
95% of the reference is covered by the alignment. This is <100% mainly due to a 1kbp unaligned region from 26kbp to 27kbp.
To force alignment across large collinear regions, use the -C maximum distance between two collinear MUMs:
./parsnp -g ./ref/EMC_2012.gbk -d ./mers49 -C 1000 -cVisualize again with Gingr GGR
Zoom in with Gingr for nucleotide view of region
Inspect Output:
31 Streptococcus pneumoniae genomes
Download genomes:
Run parsnp
./parsnp -r ./strep31/NC_011900.fna -d ./strep31 -p <num threads>Command-line output:
![]()
Force inclusion of all genomes (-c)
./parsnp -r ./strep31/NC_011900.fna -d ./strep31 -p <num threads> -c
Command-line output:
![]()
- Parsnp takes both draft and finished genomes of closely related strains as input, performs conservative core genome alignment and as output returns multi-alignments (XMFA), variants (VCF), core genome phylogeny (Newick) and Gingr input format (GGR).
- The main advantages of Parsnp over alternative approaches is robust filtration of variant (SNP) calls, multiple alignments as output and superior speed. Parsnp can align 200-300 bacterial strains in <30 minutes on a 16-core server and ~1000 in a couple of hours.
- If you are interested in pan genome/whole genome alignment, existing tools for the job that perform well include Mauve, Mugsy, among others. In addition, Parsnp is tailored for intraspecific genome analysis (outbreak analysis of a pathogen, etc). One main limitation of Parsnp is that it cannot handle subsets (core genome only) and is not as sensitive as existing methods.
- Gingr (http://github.com/marbl/gingr) can open Parsnp output and provide an interactive display of multi-alignments, variants and the phylogenetic tree estimated from the core genome alignment.
- Within the log output, there are coverage values listed that individually indicate the percentage of a given genome that is included in the core genome alignment. Note, this includes the Muscle aligned-regions plus the maximal unique matches (MUMs). The core genome alignment size can then be calculated by multiplying the coverage value, for a given genome, by its length.
- Parsnp is a conservative core genome alignment method that necessarily requires that all genomes are present in each aligned regions. The focus is on aligning 1000s of closely related bacterial strains quickly while maintaining sensitivity comparable to existing WGA methods. In additon, the core genome has been shown to contain as few as 30-40% of the gene content (even in very closely-related clades) due to reductive genome evolution and/or a large accessory genome (with plenty of IS/phage elements). However, for increased sensitivity w.r.t aligned regions, and alignments containing subsets, both Mugsy and Mauve are terrific tools for the job.
- Since the core necessarily includes all genomes, the choice of reference does not matter (with a couple important exceptions, continue reading). Feel free to use the parameter ‘-r !’ to randomly select a reference if you are feeling indecisive or if all of the genomes contained in the genome directory are of similar quality. Typically, finished/closed genomes are used a the reference strain to ensure they are high-quality and do not contain assembly artifacts or contaminant.
- By default, parsnp calculates the MUMi distance between the reference and each of the genomes in the genome directory. All genomes with MUMi distance <= 0.01 are included, all others are discarded. To force all genomes present in the genome dir to be included simply include ‘-c’ as a command-line parameter.
- The goal of parsnp is to capture all informative signals found in the core genome of the specified clade of interest. Any SNPs in regions not shared by all genomes will not reported. Additionally, any SNP found in a likely poorly aligned region would also be discarded. Finally, parsnp does not perform LCB extension and therefore may miss SNPs appearing at the end of locally conserved blocks or clusters.
fixme
fixme
fixme
Interactive visualization of alignments, trees and variants
Gingr is an interactive tool for exploring large-scale phylogenies in tandem with their corresponding multi-alignments. Gingr can display informative overviews for hundreds or thousands of genomes, while allowing researchers to move quickly to more detailed views of specific subclades and genomic regions, even down to the nucleotide level of their multi-alignments. Additionally, its dynamic display of variants allows interactive selection of various filters, such as indels, poorly aligned regions and suspected sites of recombination. Gingr works chiefly in tandem with Parsnp, an efficient tool for core-genome multi-alignment and phylogenetic reconstruction. It is also applicable, however, to other analytical tools, accepting standard file formats such as multi-Fasta, XMFA, Newick and VCF.
Download (v1.2)
Documentation
- The Gingr binary should work with most recent (within ~5 years) versions of common Linux distributions, e.g.:
- CentOS (6+)
- Ubuntu (9+)
- Fedora (10+)
- ...and many others
- If the Gingr binary does not work on a particular distribution, it may be possible to build from source
- gcc 4.8+ is required for building
- Right click on Gingr.app
- Select “Open” from the menu
- Click the “Open” button at the next prompt
- Click on the “gingr” binary
- Navigate to the folder with the “gingr” binary
- Run ”./gingr”
The flowchart below describes the various file formats that can be imported or exported to/from Gingr (or the harvesttools command line utility).
Resources
Archiving and postprocessing
HarvestTools is a utility for creating and interfacing with Gingr files, which are efficient archives that the Harvest Suite uses to store reference-compressed multi-alignments, phylogenetic trees, filtered variants and annotations. Though designed for use with Parsnp and Gingr, HarvestTools can also be used for generic conversion between standard bioinformatics file formats.
Download (v1.2)
Documentation
- harvest-tools VCF outputs indels in non standard format:
- currently column based, not row based. excluding indel rows (default behavior) converts file into valid VCF format.
- this will be updated in future version
- Genbank annotation input:
- Multiple Genbank files must be specified with multiple -g parameters
- In addition, genome identifier (gi) must match the header of the reference fasta file
harvest-tools is distributed as a precompiled binary. The three steps below represent the fastest way to start using the software:
- wget https://github.com/marbl/harvest-tools/releases/download/v1.2/harvesttools-OSX64-v1.2.zip
- tar -xvf harvesttools-OSX64-v1.2.tar.gz
- wget https://github.com/marbl/harvest-tools/releases/download/v1.2/harvesttools-Linux64-v1.2.tar.gz
- tar -xvf harvesttools-Linux64-v1.2.tar.gz
From command-line:
harvesttools –x <input xmfa> -f <input reference fasta> -g <reference genbank formatted annotations> -n <newick formatted tree>
harvest-tools quick start for three example scenarios.
With reference & genbank file as input:
harvesttools -g <reference_genbank_file1> -r <reference fasta file> -x <XMFA file> -o hvt.ggr
With harvest-tools file as input, XMFA output:
harvesttools -i input.ggr -X output.xmfa
With harvest-tools file as input, fasta formatted SNP file as output:
harvesttools -i input.ggr -S output.snps
- -i <Gingr input>
- -b <bed filter intervals>,<filter name>,”<description>”
- -B <output backbone intervals>
- -f <reference fasta>
- -F <reference fasta out>
- -g <reference genbank>
- -a <MAF alignment input>
- -m <multi-fasta alignment input>
- -M <multi-fasta alignment output (concatenated LCBs)>
- -n <Newick tree input>
- -N <Newick tree output>
- –midpoint-reroot (reroot the tree at its midpoint after loading)
- -o <Gingr output>
- -S <output for multi-fasta SNPs>
- -u 0/1 (update the branch values to reflect genome length)
- -v <VCF input>
- -V <VCF output>
- -x <xmfa alignment file>
- -X <output xmfa alignment file>
- -h (show this help)
- -q (quiet mode)
Example VCF output file:
##FILTER=<ID=IND,Description="Column contains indel"> ##FILTER=<ID=N,Description="Column contains N"> ##FILTER=<ID=LCB,Description="LCB smaller than 200bp"> ##FILTER=<ID=CID,Description="SNP in aligned 100bp window with < 50% column % ID"> ##FILTER=<ID=ALN,Description="SNP in aligned 100b window with > 20 indels"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT England1.fna.ref Al-Hasa_18_2013.fna Al-Hasa_1_2013.fna England-Qatar_2012.fna KSA-CAMEL-376.fna NC_019843.2.fna gi|471258596|gb|KC164505.2| 1603 GTACTATGTA.CTTTGTGCCT C T 40 PASS NA GT 0 0 0 0 1 0 gi|471258596|gb|KC164505.2| 1684 GGAACAAGGT.CACTCAAATT C T 40 PASS NA GT 0 0 0 0 1 0 gi|471258596|gb|KC164505.2| 2502 ATATTCCCAT.CGGGAACCTA C T 40 PASS NA GT 0 0 0 0 1 0 gi|471258596|gb|KC164505.2| 3275 TTCTCATGAG.ATTTCTGACG A G 40 PASS NA GT 0 1 1 0 1 0 gi|471258596|gb|KC164505.2| 4396 TTCAAGCAGG.GAGTGTCGTG G T 40 PASS NA GT 0 1 1 0 0 0 . . .
fixme
- Autoconf ( http://www.gnu.org/software/autoconf/ )
- Protocol Buffers ( https://code.google.com/p/protobuf/ )
- Zlib ( http://www.zlib.net/ )
For your convenience, precompiled binaries provides at:
http://github.com/marbl/harvest
otherwise, to install from source:
git clone https://github.com/marbl/harvest-tools.git hvt_src
cd hvt_src
./bootsrap.sh
./configure [--prefix=...] [--with-protobuf=...]
make
[sudo] make install
command line tool ( <prefix>/bin/harvest )
static library ( <prefix>/lib/libharvest.a )
- includes
- Harvest, Protocol Buffer message class ( <prefix>/include/harvest.pb.h )
- HarvestIO, Harvest wrapper with file IO ( <prefix>/include/HarvestIO.h )
When running ./configure, use –prefix to install somewhere other than /usr/local. Use –with-protobuf if the Protocol Buffer libraries are not in their default location (/usr/local). Zlib should be installed in a standard system location. Sudo will be necessary for ‘make install’ if write permission
fixme
fixme
fixme
fixme
Resources