Virus AsseMbly Pipeline (VAMP) documentation¶
Contents:
Introduction¶
The Virus AsseMbly Pipeline (VAMP) is a set of tools designed to assist with reference guided assembly of viral genomes from paired-end Illumina sequence data.
The pipeline portion of VAMP has been replaced by VirGA (paper in progress, Moriah Szpara and Lance Parsons).
The main tools used by VirGA are maf_net.py and compare_genomes.py. There are a number of other tools which may be useful.
Installation¶
Install Prequisites¶
- Cython
- pip install cython
Install VAMP¶
Download VAMP from https://bitbucket.org/szparalab/vamp/downloads
- Unarchive into directory
- tar xzvf vamp-x.x.tar.gz.
- Install VAMP (plus python dependencies):
- cd vamp-x.x
- python setup.py install
- (optional) Build documentation
- cd docs
- make html
Notes¶
- OSX
- XCode and Command Line Utilities must be installed prior to installing many of the required tools for VAMP. See the MacPorts XCode installtion instructions for more information.
Non-root users¶
If you are not root or just want to install this locally, one option is to use the --user parameter when installing. e.g.:
pip install --user cython python setup.py install --user
Usage¶
Align Contigs to Reference¶
The first step is to align contigs to a reference genome and output the result in a MAF formatted file. There are many options for alignment tools, however, we have had success with Mugsy, a very fast multiple whole genome alignment tool.
Stitch Together Scaffold¶
Once you have aligned the contigs to the reference, the next step is to stitch together the various alignment blocks into a scaffold. The maf_net.py utility does this by reassembling the reference sequence from the MAF blocks and using the highest scoring block for each location in the genome to assemble a scaffold genome.
Genome Annotation and Comparison¶
Once a draft of a genome has been completed, it can be useful to migrate annotations from an annotated reference to the new genome. In addition, this step generates a summary of the changes at the nucleic acid as well as amino acid level.
Run compare_genomes.py to migrate annotations and generate a list of differences between two species. The script requires an aligned fasta file (typically use the one generated from the previous scaffold stitching step) and a GFF file of features (genes, exons, etc.) to migrate.
The coding sequences can be checked by translating them to protein sequences using translate_cds.py. Translation errors such as missing start or stop codons, extra stop codons, etc. will be printed to STDERR.
Commands¶
compare_genomes.py¶
Compares genomes using an aligned fasta file and migrates annotations from a reference to the other sequences in the alignment
Usage:
compare_genomes.py [-r REFERENCE] [--align_format FORMAT] [-o PREFIX]
[--gff_feature_types GFF_FEATURE_TYPES]
[--gff_attributes GFF_ATTRIBUTES] [-v] [--version]
[-h]
aligned_fasta gene_gff
Required Arguments:
aligned_fasta An aligned fasta file
gene_gff An gff file with features to be migrated
-r REFERENCE, --reference REFERENCE
Sequence id of reference sequence in aligned fasta
file
Optional Arguments:
--align_format FORMAT
Alignment format (default: fasta)
-o PREFIX, --output PREFIX
Output prefix (default: compare_genomes_output/)
--gff_feature_types GFF_FEATURE_TYPES
Comma separated list of gff feature types toparse
(default: CDS,exon,gene,mRNA,stem_loop)
--gff_attributes GFF_ATTRIBUTES
Comma separated list of feature attributes tocarry
over (default: ID,Parent,Note,gene,function,product)
-v, --verbose verbose output
--version show program's version number and exit
-h, --help show this help message and exit
fastq_to_fasta.py¶
Convert a FASTQ file to a FASTA file
Usage:
fastq_to_fasta.py [-h] [-w WRAP] [-v] [--version] fastq_file fasta_file
Required Arguments:
fastq_file
fasta_file
Optional Arguments:
-h, --help show this help message and exit
-w WRAP, --wrap WRAP Maximum length of lines, 0 means do not wrap (default:
0)
-v, --verbose verbose output
--version show program's version number and exit
find_contig_deletions.py¶
Find contigs with deletions from the contig composition file output from compare_genomes.py
Usage:
find_contig_deletions.py [-h] [-o OUTPUT_DIR] [-q] [-v] [--version]
contig_composition aligned_fasta contigs_fasta
Find contigs with deletions from the contig composition file output from compare_genomes.py
Required Arguments:
contig_composition Contig composition file output from compare_genomes.py
aligned_fasta Aligned FASTA file
contigs_fasta Contigs FASTA file
Optional Arguments:
-h, --help show this help message and exit
-o OUTPUT_DIR, --output_dir OUTPUT_DIR
Directory to store output files, default is
aligned_fasta directory
-q, --quiet Quiet, replace all deletions found, no prompts
-v, --verbose verbose output
--version show program's version number and exit
gff2gtf_simple.py¶
Simple conversion of GFF files to GTF files.
Usage:
gff2gtf_simple.py [-h] [-v] [--version] gff_file
Required Arguments:
gff_file GFF file to convert
Optional Arguments:
-h, --help show this help message and exit
-v, --verbose verbose output
--version show program's version number and exit
maf_net.py¶
Output an aligned fasta file by stitching together a specified reference sequence in the MAF file and using the highest scoring block for each section.
Usage:
maf_net.py [-r REFERENCE] [-c CHROMOSOME] [-s SPECIES] [-o OUTPUT_DIR]
[--consensus_sequence] [--reference_fasta REFERENCE_FASTA]
[-v] [--version] [-h]
maf_file
Required Arguments:
maf_file MAF file to stitch together
-r REFERENCE, --reference REFERENCE
Reference species (e.g. scerevisiae)
-c CHROMOSOME, --chromosome CHROMOSOME
Sequence ID of the chromosome for which to generate
the alignment net (e.g. chrI)
-s SPECIES, --species SPECIES
List of species to include, comma separated (e.g.
scerevisiae,sbayanus)
Optional Arguments:
-o OUTPUT_DIR, --output_dir OUTPUT_DIR
Directory to store output file, default is maf file
directory
--consensus_sequence Output "consensus sequence" for each species in files
named [species].[chromosome].consensus.fasta
--reference_fasta REFERENCE_FASTA
Check MAF file against this fasta (for
troubleshooting, debugging)
-v, --verbose verbose output
--version show program's version number and exit
-h, --help show this help message and exit
Output:
Aligned Fasta File: BASENAME.net.afa
This file contains an aligned fasta file created by stitching together MAF blocks based on the reference sequence. Where two blocks overlap, the higher scoring block is used.
Optional Output (one per species):
Consensus Sequence: SPECIES.consensus.fasta
A FASTA file containing the consensus sequence for this species. N’s in the sequence represent sections where no contigs mapped to a section of the reference (i.e. potential gaps in the scaffold).
Consensus Contig Composition GFF: SPECIES.consensus_contig_composition.gff
GFF formatted file describes intervals in the SPECIES genome. The attributes contain information about the contigs used to determine the sequence in this interval. The attributes are:
- src_seq
- src_seq_start
- src_seq_end
- src_strand
- src_size
- maf_block
- block_start
- block_end
- ref_src
- ref_start
- ref_end
- ref_strand
Consensus Contig Composition Summary: SPECIES.consensus_contig_composition_summary.txt
Tab delimited file with the following columns that describes intervals in the SPECIES genome and the contigs that were used for the sequence.
- seq - sequence id of the interval in the SPECIES genome
- start - start position of the interval
- end - end position of the interval
- contig - contig id that was used to “build” this interval. If None, that means no contig was found for the analogous region in the reference.
- contig_start - the start position of the contig that aligned to this start interval
- contig_end - the end position of the contig that aligned to the end position of this interval
- contig_strand - the direction that the contig aligned to the reference (if ‘-‘, the reverse complement of the contig aligned to the reference in this interval)
- contig_size - the full size of the contig (including those bases that did not aligned to this interval)
makePairedOutput2EQUALfiles_vamp.pl¶
Modified versions of scripts provided by SSAKE. They are used to prepare two separate paired end fastq files for use by SSAKE. The modifications made were to accommodate new Illumina style sequence identifiers introduced with CASAVA 1.8.:
Usage: makePairedOutput2EQUALfiles_vamp.pl <fasta file 1> <fasta file 2> <library insert size>
--- ** Both files must have the same number of records & arranged in the same order
makePairedOutput2UNEQUALfiles_vamp.pl¶
See makePairedOutput2EQUALfiles_vamp.pl:
Usage: makePairedOutput2UNEQUALfiles_vamp.pl <fasta file 1> <fasta file 2> <library insert size>
--- files could have different # of records & arranged in different order but template ids must match
TQSfastq_vamp.py¶
Preforms quality trimming as per the original SSAKE script. It was modified to accommodate larger, zipped fastq files.
Usage:
TQSfastq_vamp.py [options]
Optional Arguments:
-h, --help show this help message and exit
-f FASTQFILE, --fastq file=FASTQFILE
Sanger encoded fastq file - PHRED quality scores,
ASCII+33
-t THRESHOLD, --Phred quality threshold=THRESHOLD
Base intensity threshold value (Phred quality scores 0
to 40, default=10)
-c CONSEC, --consec=CONSEC
Minimum number of consecutive bases passing threshold
values (default=20)
-v, --verbose Runs in Verbose mode.
-q, --qualities Outputs Qualities to FASTQ file (default is FASTA)
-z, --zip Compress output with gzip
-o OUTPUT_BASE, --output=OUTPUT_BASE
Output filename base
translate_cds.py¶
Extracts the coding sequences (CDS) regions from a fasta reference and gff file and translates them into amino acid sequences, output in FASTA format to STDOUT
Usage:
translate_cds.py [--notrans] [-i IDATTR] [-t FEATURETYPE]
[--table TABLE] [-v] [--version] [-h]
gff_file fasta_file
Required Arguments:
gff_file GFF file containing CDS records to be translated
fasta_file FASTA file containing the nucleotide sequences
referenced in the GFF file
Optional Arguments:
--notrans Do not translate to amino acid sequence, output DNA
-i IDATTR, --idattr IDATTR
GFF attribute to use as gene ID. Features with the
same ID will be considered parts of the same gene. The
default "gene_id" is suitable for GTF files.
-t FEATURETYPE, --featuretype FEATURETYPE
GFF feature type(s) (3rd column) to be used. Specify
the option multiple times for multiple feature types.
The default is "CDS" for GFF files and "CDS" and
"stop_codon" for GTF files.
--table TABLE NCBI Translation table to use when translating DNA
(see http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprint
gc.cgi). Default: 1.
-v, --verbose verbose output
--version show program's version number and exit
-h, --help show this help message and exit
Modules¶
vamp.utils¶
Utilities for working with multiple sequence alignments and MAF objects
Copyright 2012, 2013, 2014 Lance Parsons <lparsons@princeton.edu> All rights reserved.
BSD 2-Clause License http://www.opensource.org/licenses/BSD-2-Clause
- class vamp.utils.ContigComposition¶
Represents the composition of one interval by another interval.
Association of two genomic intervals, used to represent the composition of one interval by another.
- seq str¶
Sequence id of the interval being described
- start str¶
The start position of the interval being described (1-based)
- end str¶
The end position of the interval being described (1-based)
- contig str¶
The sequence id of the second interval
- contig_start str¶
The start position of the second interval (1-based)
- contig_end str¶
The end position of the second interval (1-based)
- strand str¶
The strand (‘+’ or ‘-‘) of the interval being described
- contig_size str¶
The complete length of the sequence of the second interval.
- static tab_headings()¶
Static method that retuns a tab delimited string of header names
Returns: A tab delimited string representing the headers in the order used by the to_tab() method. Return type: string
- to_tab()¶
Return a tab delimited string of the ContigComposition
Returns: A tab delimited string representing the ContigComposition. Return type: string
- vamp.utils.find_deletions(contig_composition_list, verbose=False)¶
Find contigs with deletions.
From a contig composition list, find contigs that have deleted sections. When a contig has deleted sections, the pieces of the contig may be replaced by a contiguous section. This function returns tuples containing the indices of the contigs pieces to be replaced along with a replacement ContigComposition object consisting of the contiguous section.
e.g.:
([2, 3], {'seq': 'chr', 'start': 3, 'end': 5, 'contig': 'contig1', 'contig_start': 5, 'contig_end': 10, 'strand': '+', 'contig_size': 20})
indicates that we may wish to replace contig_composition_list[2:3] with the new ContigComposition specified.
Parameters: - contig_composition_list (list) – A list of ContigComposition objects.
- verbose (bool, optional) – If true, output additional debug info (default is False).
Returns: A list of tuples with the indices of ContigComposition objects in the list to be replaced along with replacement ContigComposition objects.
Return type: list
- vamp.utils.get_block_by_label(maf_filename, label)¶
Return the MAF block with the specified label
Parameters: - maf_filename (string) – The name of the MAF file.
- label (string) – The label of the block in the MAF file to search for.
Returns: The first block found in the MAF file that has the given label
Return type: block
- vamp.utils.get_sequence_length_from_maf(maf_file, reference_species, sequence_id)¶
Return length of the reference_species.sequence_id
Parameters: - maf_filename – The filename of the MAF file.
- reference_species – The name species used as the reference.
- sequence_id – The sequence_id used as the reference. The format of sequence names in the MAF file is assumed to be ‘species.sequence_id’ (e.g. ‘scerevisiae.chrI’)
Returns: The length of the specified sequence in the first component containing that sequence in the MAF file, or None if no matching componenets were found in the MAF file.
Return type: integer
- vamp.utils.get_sequence_net_alignment(maf_filename, reference_species, sequence_id, species, verbose=False)¶
Return the alignment created by stitching MAF blocks together
Stitches MAF blocks together along an entire reference sequence (including gaps). For regions covered by more than one block, the highest scoring block is used.
Parameters: - maf_filename – The filename of the MAF file.
- reference_species – The name species used as the reference.
- sequence_id – The sequence_id used as the reference. The format of sequence names in the MAF file is assumed to be ‘species.sequence_id’ (e.g. ‘scerevisiae.chrI’)
- species – A list of the species names to be returned
- verbose (bool, optional) – If True, print debug information (default: False)
Returns: A tuple containing a Bio.Align.MutlipleSeqAlignment object and a list of intervals. The multiple sequence alignments contains each the alignment of each species from the MAF file created by stitching blocks together based on the specified reference sequence The list of intervals is relative to the alignment that indicate the MAF block, block start, and block end of the source of that piece of the alignment.
Return type: tuple
- vamp.utils.get_vamp_home()¶
Return the directory where the VAMP module is installed
- vamp.utils.read_contig_composition_summary(filename)¶
- Generator that reads a contig composition summary file and returns
- attributes.
Parameters: filename (string) – The name of contig composition summary file as output by compare_genomes.py. Yields: ContigComposition – A ContigComposition object for each line in the contig composition summary file.
- vamp.utils.replace_alignment_with_block(alignment, block, reference_species, sequence_id, verbose=False)¶
Update the multiple sequence alignment with the specified MAF block
Use the MAF block alignment to replace the appropriate section of the given multiple sequence alignment by using the specified reference species and sequence as guide
Parameters: - block (maf block) – MAF block
- reference_species (str) – The name species used as the reference.
- sequence_id (str) – The sequence_id used as the reference. The format of sequence names in the MAF file is assumed to be ‘species.sequence_id’ (e.g. ‘scerevisiae.chrI’)
- verbose (bool, optional) – If True, print debug information (default: False)
Returns: The updated alignment and a Pybedtools interval of the section of the alignment that was replaced. The interval contains the following attributes: maf_block; block_start; block_end which indicate the MAF block label and start and end position on the block used in the replacement
Return type: tuple
- vamp.utils.subtract_intervals(interval1, interval2)¶
Subtract two pybedtools intervals, return list of resulting intervals
Parameters: - interval1 – A pybedtools interval
- interval2 – A pybedtools interval to subtract from interval1
Returns: A list of pybedtools intervals that contain the region(s) of interval1 that are not overlapped by interval2
Return type: list
- vamp.utils.summarize_contig_composition(interval_list, src_tag, start_tag, end_tag, strand_tag, source_size_tag)¶
Summarize the contig composition of a stitched MAF file.
Parameters: - interval_list (list) – A list of Pybedtools interval objects
- src_tag (string) – The attribute containing the contig name
- start_tag (string) – The attribute containing the start position in the contig
- end_tag (string) – The attribute containing the end position in the contig
- strand_tag (string) – The attribute containing the strand
- source_size_tag (string) – The attribute containing the contig size
Returns: A list of dictionaries with the following keys: (seq, start, end, contig, contig_start, contig_end, strand, contig_size)
Return type: list
- vamp.utils.update_contig_composition_summary(contig_composition_summary, replacements)¶
Update list of ContigComposition objects with replacements.
Replacements are a list of tuples containing a list of indices of contigs to be replaced along with replacements. The replacements must be non-overlapping and sorted.
Parameters: - contig_compostion_summary (list) – List of dictionaries as returned by summarize_contig_composition()
- replacements (list) – A list of ContigCompostion objects
- Retuns:
- list: A updated contig composition summary
- vamp.utils.update_sequence_with_replacements(seq, replacements, replacement_seq_dict)¶
Update Seq object with replacements.
The replacements specified by ContigComposition objects and must be non-overlapping and sorted.
Parameters: - seq (Bio.Seq) – The sequence object to be updated.
- replacements (list) – A list ContigComposition objects
- replacement_seq_dict (dict) – A dictionary to the replacement sequences.
Returns: Bio.Seq: An updated sequence object with replacements made
- vamp.utils.verify_maf_fasta(maf_filename, reference_species, fasta_filename, verbose=False)¶
Verify the consistency between the sequence in a MAF and a FASTA file
Checks all compenents in all blocks of the MAF file for the specified species and checks that the sequence matches that in the FASTA file.
Parameters: - maf_filename (string) – The name of the MAF file.
- reference_species (string) – The species to select from the MAF file.
- fasta_filename (string) – The name of the FASTA file to check against.
- verbose (bool, optional) – If true, output additional debugging info (default is False).
Returns: Prints to STDOUT if there is a mismatch.
Return type: None
seq_utils.convert_coordinates¶
Convert coordinates from GFF or BED file using multi-fasta alignments
- seq_utils.convert_coordinates.find_aligned_position(gap_positions, pos)¶
Update position by adding preceeding gaps
Parameters: - gap_positions (list) – list of gaps (must include start and end methods to return the start and end of a gap, typically they are re.MatchObjects)
- pos (int) – the position to adjust by adding preceding gaps
Returns: The new position, accounting for preceeding gaps
Return type: int
seq_utils.fasta_from_gff¶
Extract fasta sequences from regions defined in GFF/BED file and output fasta to stdout
seq_utils.summarize_alignments¶
Summarize the differences between sequences in an aligned FASTA file.
This script will output summarize the differences between sequences in an aligned FASTA file.
Usage:
summarize_alignments.py aligned_fasta reference_sequence [-h,--help]
[-v,--verbose] [--version]
- seq_utils.summarize_alignments.main()¶
Runs summary_of_alignment function on input files from the command line.
- seq_utils.summarize_alignments.mismatch_string(mismatches)¶
Generate a string from a list of mismatches.
Parameters: mismatches (list) – A list of mismatches. A mismatch is a dictionary with a position (pos), reference genotype (ref), and alternate genotype (alt). Returns: A comma separated string of the mismatches Return type: string
- seq_utils.summarize_alignments.parse_event(event, reference_sequence, alternate_sequence)¶
Parse an event (sequence of differences) for VCF output.
Parse a simple event with reference_position, reference_base, and new_base and determine the type and add padding if necessary (for VCF compatibility)
Parameters: - event (dictonary) – An event has at least a position (pos), reference genotype (ref), and aternate genotype (alt). May also have a flag indicating if it is a snp (snp).
- reference_sequence (str) – The complete reference sequence
- alternate_sequence (str) – The complete alternate sequence
Returns: An event with additional padding to the start of the variant and an added type attribute, for VCF compatibility
Return type: dictonary
- seq_utils.summarize_alignments.summary_of_alignment(alignment, reference_sequence_id)¶
Summarizes changes in given alignment
Parameters: - alignment (Bio.AlignIO object) – Alignment object
- reference_index (int) – index of the reference sequence in alignment (default is 1)
Returns: - A dctionary with a key for each non-reference sequence
in the alignment
Each entry is another dictionary with the following keys:
match_count: The number of matching bases
mismatch_count: The number of mismatching bases, including indels
mismatches: list of mismatches by base: RefBase(RefPos)NewBase
- contiguous_change_count: the number of contiguous change
“events”
Return type: dictionary
seq_utils.utils¶
Utility classes and methods for working with sequence data
- seq_utils.utils.convert_interval_gapped_to_nongapped(seq, start, end)¶
Take position with gaps and return position without gaps
Uses 0-based positions
Parameters: - seq (str) – sequence string (with gaps included)
- start (int) – starting position of interval (including gaps)
- end (int) – ending postion of interval (including gaps)
Returns: (start, end) the start and end positions after removing gaps in the sequence
Return type: tuple
- seq_utils.utils.convert_interval_nongapped_to_gapped(seq, start, end, include_end_gaps=False)¶
Take position without gaps and return position with gaps
Uses 0-based positions
Parameters: - seq (str) – sequence string (with gaps added)
- start (int) – starting position of interval (excluding gaps)
- end (int) – ending postion of interval (excluding gaps)
- include_end_gaps (bool, optional) – if true, include gap positions that directly follow the end positions in the new interval, default is False and such end positions are not included
Returns: (start, end) the start and end positions after accouting for gaps in the sequence
Return type: tuple