micca - MICrobial Community Analysis¶
micca (MICrobial Community Analysis) is a software pipeline for the processing of amplicon sequencing data, from raw sequences to OTU tables, taxonomy classification and phylogenetic tree inference. The pipeline can be applied to a range of highly conserved genes/spacers, such as 16S rRNA gene, Internal Transcribed Spacer (ITS) 18S and 28S rRNA. micca is an open-source, GPLv3-licensed software.
Key features:
- supports single-end (Roche 454, Illumina MiSeq/HiSeq ,Ion Torrent) and overlapping paired-end reads (Illumina MiSeq/HiSeq);
- multithread de novo greedy, closed-reference, open-reference and swarm OTU picking protocols;
- denoising of Illumina reads;
- state-of-the-art taxonomic classification algorithms (RDP and consensus-based classifier);
- fast and and memory efficient NAST multiple sequence alignment (MSA);
- filters low quality sequences according to the maximum allowed expected error (EE) rate %;
- runs on Linux, Mac OS X and MS Windows (through Docker containers)
- simple, easy to use.
Docker images are available (compmetagen/micca) starting from version 1.2.2, see the documentation (>=1.3.0) to learn how to use them. Docker hub page.
How to cite: Davide Albanese, Paolo Fontana, Carlotta De Filippo, Duccio Cavalieri and Claudio Donati. MICCA: a complete and accurate software for taxonomic profiling of metagenomic data. Scientific Reports 5, Article number: 9743 (2015), doi:10.1038/srep09743, Link. Dataset download: ftp://ftp.fmach.it/metagenomics/micca/scirep/.
micca wraps third party software packages and these should be cited if they are used:
- VSEARCH (doi: 10.7717/peerj.2584) used in
classify
,filter
,mergepairs
,otu
andmsa
commands - MUSCLE (doi: 10.1093/nar/gkh340) used in
msa
andtree
commands - FastTree (doi: 10.1371/journal.pone.0009490) used in the
tree
command - Cutadapt (doi: 10.14806/ej.17.1.200) used in the
trim
command - RDP classifier (doi: 10.1128/AEM.00062-07) used in the
classify
command - swarm (doi: 10.7717/peerj.1420) used in the
otu
command
Install¶
Using Docker (that is, on MS Windows, Mac OS X and Linux!)¶
The easiest way to run micca is through Docker. Docker works similarly to a virtual machine image, providing a container in which all the software has already been installed, configured and tested.
Run the
Docker Quickstart Terminal
(Mac OS X, Windows) or thedocker
daemon (Linux,sudo service docker start
).Download the latest version:
docker pull compmetagen/miccaRun an instance of the image, mounting the host working directory (e.g.
/Users/davide/micca
) on to the container working directory/micca
:docker run --rm -ti --user $(id -u):$(id -g) -v /Users/davide/micca:/micca -w /micca compmetagen/micca /bin/bashYou need to write something like
-v //c/Users/davide/micca:/micca
if you are in Windows or-v /home/davide/micca:/micca
in Linux. The--rm
option automatically removes the container when it exits.Now you can use micca:
name@68f6784e1101:/micca# micca -h
Note
The RDP classifier is preinstalled in the Docker image, so you can check the
software version by typing echo $RDPPATH
Using pip¶
At the moment, only Python 2.7 is supported.
On Ubuntu >= 12.04 and Debian >=7¶
We suggest to install the following packages through the package manager:
sudo apt-get update
sudo apt-get install build-essential python-numpy gcc gfortran python-dev libblas-dev liblapack-dev cython pkg-config libfreetype6 libfreetype6-dev libpng-dev
Then, upgrade pip and install setuptools:
pip install --upgrade pip
pip install 'setuptools >=14.0'
Finally, install micca:
sudo pip install micca
On Mac OS X¶
In Mac OS X, we recommend to install Python from Homebrew:
Install the GNU Fortran and the NumPy package:
brew install gcc
pip install numpy
Finally, install micca:
sudo pip install micca
Installation problems¶
BIOM fatal error: ‘numpy/arrayobject.h’. If the installation process returns a message like this:
biom/_filter.c:258:10: fatal error: 'numpy/arrayobject.h' file not found #include "numpy/arrayobject.h" ^ 1 error generated. error: command 'clang' failed with exit status 1
then you need to run:
pip install --global-option=build_ext --global-option="-I/usr/local/lib/python2.7/site-packages/numpy/core/include/" biom-format
After that you can install the micca package.
Install micca from source¶
In order to install micca from sources (with the standard procedure
python setup.py install
), in addition to Python (>=2.7) and NumPy
(>=1.8.0), the following Python packages must be installed:
- SciPy >=0.13.0
- Pandas >=0.17.0
- matplotlib >=1.3.0
- Biopython >=1.50
- cutadapt >=1.9
- biom-format >=1.3.1
The easiest way to install these packages is to is using pip:
sudo pip install 'scipy >=0.13.0' 'pandas >=0.17.0' 'matplotlib >=1.3.0' 'biopython >= 1.50' 'cutadapt >=1.9' 'biom-format >=1.3.1'
Download the latest version from https://github.com/compmetagen/micca/releases and complete the installation:
tar -zxvf micca-X.Y.Z.tar.gz
sudo python setup.py install
If you don’t have root access¶
Install micca in a local directory by specifying the --prefix
argument. Then
you need to set the environment variable PYTHONPATH
:
python setup.py install --prefix=/path/to/modules
export PYTHONPATH=$PYTHONPATH:/path/to/modules/lib/python{version}/site-packages
Note
In order to export the variable permanently add the command
at the bottom of your ~/.bash_profile
or ~/.bashrc
file.
Testing the installation¶
micca -h
Install RDP classifier (optional)¶
The RDP Classifier is a naive bayesian classifier for taxonomic assignments
(http://sourceforge.net/projects/rdp-classifier/). The RDP classifier can be
used in the classify command (option -m/--method rdp
).
Warning
Only RDP Classifier version >2.8 is supported. Install the standard Java or
Java compatible runtime (sudo apt-get install default-jre
in
Ubuntu/Debian or go to the Oracle Java homepage for OS X)
Download and unzip the file (RDP classifier 2.11 2015-09-14):
wget https://sourceforge.net/projects/rdp-classifier/files/rdp-classifier/rdp_classifier_2.11.zip
unzip rdp_classifier_2.11.zip
Now you must set the environment variable RDPPATH
by typing:
$ export RDPPATH=/path-to-rdp-classifier/rdp_classifier_2.11/
e.g. export RDPPATH=/Users/David/rdp_classifier_2.11
.
Note
In order to export the variable permanently add the latest command
at the bottom of your .bashrc
file.
Supported databases¶
16S rRNA¶
- Greengenes ftp://greengenes.microbio.me/greengenes_release/
- Greengenes Core Set (useful for NAST) http://greengenes.lbl.gov/Download/Sequence_Data/Fasta_data_files/core_set_aligned.fasta.imputed
- QIIME-formatted SILVA https://www.arb-silva.de/download/archive/qiime/
ITS¶
UNITE https://unite.ut.ee/repository.php (QIIME releases)
Note
In the QIIME releases of the UNITE database the
_s
in the filename (e.g.sh_qiime_release_s_DD.MM.YYY.zip
) specifies that the database includes singletons.
Paired-end sequencing - 97% OTU¶
This tutorial describes a standard micca pipeline for the analysis of overlapping paired-end sequences. The main products of this pipeline are:
- the Operational Taxonomic Units (OTUs), defined clustering the processed sequences at a similarity threshold of 97%;
- an OTU table, containining the number of times each OTU is observed in each sample;
- a taxonomic classification for each OTU;
- an OTU phylogenetic tree.
This pipeline is intended for different platforms, such as Illumina MiSeq and Illumina HiSeq. Although this tutorial explains how to apply the pipeline to 16S rRNA amplicons, it can be adapted to others markers gene/spacers, e.g. Internal Transcribed Spacer (ITS), 18S or 28S.
Table of Contents
Dataset download¶
The following paired-end 16S rRNA dataset contains 34 samples in FASTQ format (V3-V4 region, 341F 5’-CCTACGGGNGGCWGCAG-3’, 806Rmod 5’-GACTACNVGGGTWTCTAATCC-3’).
Samples comes from the paper “Diversity and Cyclical Seasonal Transitions in the Bacterial Community in a Large and Deep Perialpine Lake” were seasonal variations in the bacterioplankton community composition in the lake Garda were analized. Sampling was carried out at monthly intervals in three layers representative of the epilimnetic and euphotic zones of the lake, 1, 10, and 20 m. The dataset contains only a subset of the entire study (2014 samples only) and raw data were randomly subsampled at 3000 sequences per sample.
The 2x300-bp paired-end sequencing was carried out on an Illumina MiSeq.
Open a terminal, download the data and prepare the working directory:
wget ftp://ftp.fmach.it/metagenomics/micca/examples/garda.tar.gz
tar -zxvf garda.tar.gz
cd garda
Merge paired-end sequences¶
Overlapping paired-end sequences must be merged to obtain consensus sequences (sometimes called assembly). This operation can be performed with the mergepairs command.

Moreover, the command merges the samples in a single file where sample names are appended to the sequence identifier, as in merge and split commands.
Since the sequenced region is about of 465-bp (806-341) and the reads are of 300-bp, the overlap region is quite large ((2x300)-465=135 bp), as rule of thumb we set a minimum overlap length of 100 and maximum number of allowed mismatches of about 1/3, say 30:
micca mergepairs -i fastq/*_R1*.fastq -o merged.fastq -l 100 -d 30
Note
Starting from micca 1.6.0 staggered read pairs (staggered pairs are pairs
where the 3’ end of the reverse read has an overhang to the left of the 5’
end of the forward read) will be merged by default. To override this feature
(and therefore to discard staggered alignments) set the -n/--nostagger
option.
Note
mergepairs works with FASTQ files only.
Note
Reverse file names will be constructed by replacing the string _R1
in
the forward file name with _R2
(typical in Illumina file names, see
options -p/--pattern
and -e/--repl
).
Compute reads statistics¶
We can report sequences statistics computed on the file merged.fastq
. Run
the command stats:
micca stats -i merged.fastq -o stats_merged
The command reports in 3 text files and in the relative plots (in PNG format) the length distribution, the Q score distribution and a quality summary. The quality summary plot (stats_merged/stats_qualsumm_plot.png) is reported below:

Primer trimming¶
Segments which match PCR primers should be now removed. For paired-end (already merged) reads, we recommend to trim both forward and reverse primers and discard reads that do not contain the forward OR the reverse primer.
These operations can be performed with the trim command:
micca trim -i merged.fastq -o trimmed.fastq -w CCTACGGGNGGCWGCAG -r GACTACNVGGGTWTCTAATCC -W -R -c
The option -W/--duforward
and -R/--dureverse
ensures that reads that do
not contain the forward or the reverse primer will be discarded. With the option
-c/--searchrc
the command searches reverse complement primers too.
Quality filtering¶
Producing high-quality OTUs requires high-quality reads. filter filters sequences according to the maximum allowed expected error (EE) rate %. We recommend values <=1%.
For paired-end reads, we recommend to merge pairs first, then quality filter using a maximum EE threshold with no length truncation.
Warning
Parameters for the filter command should be chosen using the tool filterstats.
Choosing parameters for filtering¶
The command filterstats reports the fraction of reads that would pass for each specified maximum expected error (EE) rate %:
micca filterstats -i trimmed.fastq -o filterstats
Open the PNG file filterstats/stats_plot.png
:

In this case (overlapping paired paired-end reads) we are interested in the plot
on top (minimum length filtering only). A minimum read length (L) of 400 and a
maximum error rate of 0.75% seems to be a good compromise between the
expected error rate and the number of reads remaining. Inspecting the file
filterstats/filterstats_minlen.txt
, you can see that more than 85% reads will
pass the filter:
L 0.25 0.5 0.75 1.0 1.25 1.5
...
399 63.856 77.766 85.664 90.844 94.484 96.853
400 63.856 77.765 85.661 90.842 94.481 96.850
401 63.842 77.747 85.643 90.822 94.459 96.827
...
Note
To obtain general sequencing statistics, run the micca command
stats on the file trimmed.fastq
.
Filter sequences¶
Now we can run the filter command with the selected parameters:
micca filter -i trimmed.fastq -o filtered.fasta -e 0.75 -m 400
Note
The maximum number of allowed Ns after truncation can be also specified in filterstats and in filter.
OTU picking¶
To characterize the taxonomic structure of the samples, the sequences are now organized into Operational Taxonomic Units (OTUs) at varying levels of identity. An identity of 97% represent the common working definition of bacterial species. The otu command implements several state-of-the-art approaches for OTU clustering, but in this tutorial we will focus on the de novo greedy clustering (see OTU picking and Denoising):
micca otu -m denovo_greedy -i filtered.fasta -o denovo_greedy_otus -t 4 -c
The otu command returns several files in the output directory,
including the SV table (otutable.txt
) and a FASTA file containing the
representative sequences (otus.fasta
).
Note
See OTU picking and Denoising to see how to apply the de novo swarm, closed-reference and the open-reference OTU picking strategies to these data.
Assign taxonomy¶
Now we can assign taxonomy to each representative sequence using the classify command. In this tutorial we use the RDP (https://doi.org/10.1128/AEM.00062-07) classifier.
Note
See Install on how to install the RDP classifier on your system.
micca classify -m rdp -i denovo_greedy_otus/otus.fasta -o denovo_greedy_otus/taxa.txt
classify returns a taxonomy file like this:
DENOVO1 Bacteria;Cyanobacteria/Chloroplast;Cyanobacteria
DENOVO2 Bacteria;Cyanobacteria/Chloroplast;Cyanobacteria;Family II;Family II;GpIIa
DENOVO3 Bacteria;Chloroflexi;Anaerolineae;Anaerolineales;Anaerolineaceae
DENOVO4 Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Limnohabitans
...
Infer the phylogenetic tree¶
These steps are necessary if you want to use phylogenetic-based metrics such as the UniFrac distance (https://doi.org/10.1128/AEM.01996-06) in the downstream analysis.
Multiple Sequence Alignment (MSA)¶
The msa command provides two approaches for MSA: MUSCLE (https://doi.org/10.1093/nar/gkh340) (de novo alignment) and Nearest Alignment Space Termination (NAST) (https://doi.org/10.1093/nar/gkl244) (which uses a template alignment). In this tutorial we will use the NAST alignment method. For 16S rRNA sequences, a good template alignment is the Greengenes Core Set:
wget ftp://ftp.fmach.it/metagenomics/micca/dbs/core_set.tar.gz
tar -zxvf core_set.tar.gz
At this point we can run the msa command:
micca msa -m nast -i denovo_greedy_otus/otus.fasta -o denovo_greedy_otus/msa.fasta \
--nast-template core_set_aligned.fasta.imputed --nast-threads 4
Build the phylogenetic tree¶
At this point we can build the phylogenetic tree from the MSA using tree:
micca tree -i denovo_greedy_otus/msa.fasta -o denovo_greedy_otus/tree.tree
Note
The output tree is in Newick format.
Midpoint rooting¶
UniFrac metrics require phylogenetic trees to be rooted. The tree can be rooted (in this case at midpoint between the two most distant tips of the tree) using the root command:
micca root -i denovo_greedy_otus/tree.tree -o denovo_greedy_otus/tree_rooted.tree
Note
Tree can also be rooted with the outgroup clade containing selected targets, see root.
Build the BIOM file¶
The Biological Observation Matrix (BIOM) is a common format for representing OTU tables and metadata and is the core data type for downstream analyses in QIIME and in phyloseq. tobiom converts the OTU table and the taxonomy table produced by the previous steps to the BIOM format. In addition, the Sample data can be added:
micca tobiom -i denovo_greedy_otus/otutable.txt -o denovo_greedy_otus/tables.biom \
-t denovo_greedy_otus/taxa.txt -s sampledata.txt
Denoising (Illumina only)¶
Usually, amplicon sequences are clustered into Operational Taxonomic Units (OTUs) using a similarity threshold of 97%, which represents the common working definition of bacterial species.
Another approach consists to identify the Sequence Variants (SVs, see OTU picking and Denoising for details). This approach avoids clustering sequences at a predefined similarity threshold and usually includes a denoising algorithm in order to identify SVs.
In this tutorial we show how to perform the denoising of Illumina overlapping paired-end sequences in order to detect the SVs. Athough this tutorial explains how to apply the pipeline to 16S paired-end Illumina reads, it can be adapted to Illumina single-end sequening or to others markers gene/spacers, e.g. Internal Transcribed Spacer (ITS), 18S or 28S.
Table of Contents
Data download and preprocessing¶
In this tutorial we analyze the same dataset used in Paired-end sequencing - 97% OTU. Reads merging, primer trimming and quality filtering are the same as in Paired-end sequencing - 97% OTU:
wget ftp://ftp.fmach.it/metagenomics/micca/examples/garda.tar.gz
tar -zxvf garda.tar.gz
cd garda
micca mergepairs -i fastq/*_R1*.fastq -o merged.fastq -l 100 -d 30
micca trim -i merged.fastq -o trimmed.fastq -w CCTACGGGNGGCWGCAG -r GACTACNVGGGTWTCTAATCC -W -R -c
micca filter -i trimmed.fastq -o filtered.fasta -e 0.75 -m 400
Denoising - Sequence Variants identification¶
The otu command implements the UNOISE3 protocol
(denovo_unoise
) which includes dereplication, denoising and chimera
filtering:
micca otu -m denovo_unoise -i filtered.fasta -o denovo_unoise_otus -t 4 -c
The otu command returns several files in the output directory,
including the SV table (otutable.txt
) and a FASTA file containing the
representative sequences (otus.fasta
).
Note
See OTU picking and Denoising to see how to apply the de novo swarm, closed-reference and the open-reference OTU picking strategies to these data.
Single-end sequencing¶
This tutorial describes a standard micca pipeline for the analysis of single-end amplicon data. This pipeline is intended for different platforms, such as Roche 454, Illumina MiSeq/HiSeq and Ion Torrent. Although this tutorial explains how to apply the pipeline to 16S rRNA amplicons, it can be adapted to others markers gene/spacers, e.g. Internal Transcribed Spacer (ITS), 18S or 28S.
Table of Contents
Dataset download¶
The dataset used in this tutorial is taken from the Barelli et al. paper Habitat fragmentation is associated to gut microbiota diversity of an endangered primate: implications for conservation (https://doi.org/doi:10.1038/srep14862). The dataset contains only a subset of the entire study (Mwanihana samples only) for a total of 15 samples (in FASTQ format) and 235179 16S rRNA amplicon reads (V1-V3 hypervariable regions, 27F 5’-AGAGTTTGATCMTGGCTCAG, 533R 5’-TTACCGCGGCTGCTGGCAC). The 454 pyrosequencing was carried out on the GS FLX+ system using the XL+ chemistry.
Open a terminal, download the data and prepare the working directory:
wget ftp://ftp.fmach.it/metagenomics/micca/examples/mwanihana.tar.gz
tar -zxvf mwanihana.tar.gz
cd mwanihana
Merge files¶
Now the FASTQ files must be merged in a single file. This operation can be performed with the merge command. Sample names will be included into the sequence indentifiers.
micca merge -i fastq/*.fastq -o merged.fastq
Note
The merge command works with FASTQ or FASTA files. If your sequences are in a different format (e.g. SFF or FASTA+QUAL) use convert to convert them.
Warning
In the case of multiplexed reads (with 5’ barcode sequences) use split instead of merge. This command will perform demultiplexing and merging at the same time.
Note
In the case of overlapping paired-end reads go to Paired-end sequencing - 97% OTU or Denoising (Illumina only).
Primer trimming¶
Segments which match PCR primers should be now removed. Typical Roche 454 reads start with a sequence key (e.g. TCAG) followed by the barcode (if it was not previously removed) and the forward primer. For these types of data (and in general, for single-end sequencing) we recommend to trim both forward reverse primers and discard reads that do not contain the forward primer. Moreover, sequence preceding (for the forward) or succeding (for the reverse, if found) primers should be removed:

These operations can be performed with the trim command:
micca trim -i merged.fastq -o trimmed.fastq -w AGAGTTTGATCMTGGCTCAG -r GTGCCAGCAGCCGCGGTAA -W
The option -W/--duforward
ensures that reads that do not contain
the forward primer will be discarded.
Warning
Do not use the -R/--dureverse
with single-end reads.
Quality filtering¶
Producing high-quality OTUs requires high-quality reads. The filter command filters sequences according to the maximum allowed expected error (EE) rate %. We recommend values <=1%. Moreover, to obtain good results in clustering (see otu), reads should be truncated at the same length when they cover partial amplicons or if quality deteriorates towards the end (common when you have long amplicons in 454 or Illumina single-end sequencing).
Warning
Parameters for the filter command should be chosen using the command filterstats.
Choosing parameters for filtering¶
The command filterstats reports the fraction of reads that would pass for each specified maximum expected error (EE) rate %:
micca filterstats -i trimmed.fastq -o filterstats
Open the PNG file filterstats/stats_plot.png
:

In this case we are interested in the plot below (minimum length filtering +
truncation). A truncation length of 350 and a maximum error rate of 0.5%
seems to be a good compromise between read read length, expected error rate and
number of reads remaining. Inspecting the file
filterstats/trunclen_stats.txt
, you can see that more than 92% reads
will pass the filter:
L 0.25 0.5 0.75 1.0 1.25 1.5
...
349 78.905 92.472 97.425 99.135 99.705 99.897
350 78.639 92.385 97.389 99.126 99.704 99.896
351 78.369 92.300 97.357 99.116 99.700 99.892
...
Note
To obtain general sequencing statistics, run stats.
Filter sequences¶
Now we can run the filter command with the selected parameters:
micca filter -i trimmed.fastq -o filtered.fasta -e 0.5 -m 350 -t
Note
The maximum number of allowed Ns after truncation can be also specified in filterstats and in filter.
OTU picking¶
To characterize the taxonomic structure of the samples, the sequences are now organized into Operational Taxonomic Units (OTUs) at varying levels of identity. An identity of 97% represent the common working definition of bacterial species. The otu command implements several state-of-the-art approaches for OTU clustering, but in this tutorial we will focus on the de novo greedy clustering (see OTU picking and Denoising):
micca otu -i filtered.fasta -o denovo_greedy_otus -d 0.97 -c -t 4
The otu command returns several files in the output directory,
including the OTU table (otutable.txt
) and a FASTA file containing the
representative sequences (otus.fasta
).
An introduction to the downstream analysis with R and phyloseq¶
In this tutorial we describe a R pipeline for the downstream analysis starting from the output of micca. In particular, we will discuss the following topics:
- rarefaction;
- taxonomy and relative abundances;
- alpha diversity and non-parametric tests;
- beta diversity and PERMANOVA;
- differential abundance testing with DESeq2.
You can download the script here
.
Warning
- This tutorial requires Paired-end sequencing - 97% OTU to be done.
- The tutorial is tested on R 3.5.3, phyloseq 1.26.1, ggplot2 3.1.0, vegan 2.5-4 and DESeq2 1.22.2.
(Optional) Using the Dockerized RStudio environment¶
The tutorial can be run using a Docker image with
the required packages installed. Run the following command line mounting the
host working directory (i.e. the directory containing the micca output files, in
this case /Users/davide/micca
) into the /home/rstudio/micca
folder:
docker run --rm -d -p 8787:8787 -e DISABLE_AUTH=true --name rstudio-micca \
-v /Users/davide/micca:/home/rstudio/micca \
compmetagen/rstudio-micca
Open a browser and go to 127.0.0.1:8787.
Warning
Files stored outside the micca
directory will be lost when you stop the
container.
You can stop (and destroy) the container using the following line:
docker stop rstudio-micca
Import data and preparation¶
Import the micca processed data (the BIOM file, the phylogenetic tree and the
representative sequences) into the R environment using the import_biom()
function available in phyloseq library.
> library("phyloseq")
> library("ggplot2")
> library("vegan")
> library("DESeq2")
> setwd("/home/rstudio/micca/garda/denovo_greedy_otus") # set the working directory
> ps = import_biom("tables.biom", treefilename="tree_rooted.tree", refseqfilename="otus.fasta")
> sample_data(ps)$Month <- as.numeric(sample_data(ps)$Month)
> ps
phyloseq-class experiment-level object
otu_table() OTU Table: [ 529 taxa and 34 samples ]
sample_data() Sample Data: [ 34 samples by 4 sample variables ]
tax_table() Taxonomy Table: [ 529 taxa by 6 taxonomic ranks ]
phy_tree() Phylogenetic Tree: [ 529 tips and 528 internal nodes ]
refseq() DNAStringSet: [ 529 reference sequences ]
The import_biom()
function returns a phyloseq object which includes the OTU
table (which contains the OTU counts for each sample), the sample data matrix
(containing the metadata for each sample), the taxonomy table (the predicted
taxonomy for each OTU), the phylogenetic tree, and the OTU representative
sequences.
Print the metadata using the phyloseq function sample_data()
:
> sample_data(ps)
Sample Data: [34 samples by 4 sample variables]:
Season Depth Month Year
B0214D1-PL1-D1 Winter 1 2 14
B0214D2-PL1-E1 Winter 10 2 14
B0214D3-PL1-F1 Winter 20 2 14
B0314D1-PL1-G1 Spring 1 3 14
B0314D2-PL1-H1 Spring 10 3 14
B0314D3-PL1-A2 Spring 20 3 14
B0414D1-PL1-B2 Spring 1 4 14
B0414D2-PL1-C2 Spring 10 4 14
B0414D3-PL1-D2 Spring 20 4 14
B0514D1-PL1-E2 Spring 1 5 14
B0514D2-PL1-F2 Spring 10 5 14
B0514D3-PL1-G2 Spring 20 5 14
B0614D1-PL1-H2 Summer 1 6 14
B0614D2-PL1-A3 Summer 10 6 14
B0714D2-PL1-B3 Summer 10 7 14
B0714D3-PL1-C3 Summer 20 7 14
B0814D1-PL1-D3 Summer 1 8 14
B0814D2-PL1-E3 Summer 10 8 14
B0814D3-PL1-F3 Summer 20 8 14
B0914D1-PL1-G3 Fall 1 9 14
B0914D2-PL1-H3 Fall 10 9 14
B0914D3-PL1-A4 Fall 20 9 14
B1014D1-PL1-B4 Fall 1 10 14
B1014D2-PL1-C4 Fall 10 10 14
B1014D3-PL1-D4 Fall 20 10 14
B1114D1-PL1-E4 Fall 1 11 14
B1114D2-PL1-F4 Fall 10 11 14
B1114D3-PL1-G4 Fall 20 11 14
B1214D1-PL1-H4 Winter 1 12 14
B1214D2-PL1-A5 Winter 10 12 14
B1214D3-PL1-B5 Winter 20 12 14
Bar0114D1-PL1-A1 Winter 1 1 14
Bar0114D2-PL1-B1 Winter 10 1 14
Bar0114D3-PL1-C1 Winter 20 1 14
The sample data contains 4 features for each sample: the season of sampling, the sampling depth (in m), the month and the year of sampling .
Plot the rarefaction curves using vegan function rarecurve()
:
> rarecurve(t(otu_table(ps)), step=50, cex=0.5)

otu_table()
is a phyloseq function which extract the OTU table from the
phyloseq object.
Rarefy the samples without replacement. Rarefaction is used to simulate even number of reads per sample. In this example, the rarefaction depth chosen is the 90% of the minimum sample depth in the dataset (in this case 459 reads per sample).
> # rarefy without replacement
> ps.rarefied = rarefy_even_depth(ps, rngseed=1, sample.size=0.9*min(sample_sums(ps)), replace=F)
Warning
- Rarefaction can waste a lot of data and would not be necessary. See https://doi.org/10.1371/journal.pcbi.1003531.
- Remember to set the random seed (
rngseed
) for repeatable experiments.
Exercise
Plot the samples depths before and after the rarefaction using the
phyloseq function sample_sums()
.
Plot abundances¶
Using the rarefied dataset, make a stacked barplot of the abundances (read
counts) and color each OTU (i.e. each bar) according its classified phylum (in
this case Rank2
):
> plot_bar(ps.rarefied, fill="Rank2")

The plot_bar()
function returns a ggplot2 object that can be customized
with additional options, in this case we separate the samples in 4 panels
according to the season:
> plot_bar(ps.rarefied, fill="Rank2") + facet_wrap(~Season, scales="free_x", nrow=1)

Alternatively, we can merge the OTUs at the phylum level and build a new phyloseq
object. Given a taxonomic rank (in this case the phylum), the phyloseq function
tax_glom
merges the OTUs with the same taxonomy, summing the abundances:
> ps.phylum = tax_glom(ps.rarefied, taxrank="Rank2", NArm=FALSE)
> ps.phylum
phyloseq-class experiment-level object
otu_table() OTU Table: [ 35 taxa and 34 samples ]
sample_data() Sample Data: [ 34 samples by 4 sample variables ]
tax_table() Taxonomy Table: [ 35 taxa by 6 taxonomic ranks ]
phy_tree() Phylogenetic Tree: [ 35 tips and 34 internal nodes ]
refseq() DNAStringSet: [ 35 reference sequences ]
The option NArm
set to FALSE
forces the function to keep the
unclassified OTUs at the phylum level. Now we can make a cleaner bar plot:
> plot_bar(ps.phylum, fill="Rank2") + facet_wrap(~Season, scales= "free_x", nrow=1)

Exercise
Make a stacked barplot at class level (Rank3
).
Alpha diversity¶
Plot the number of OTUs at each month coloring the points according to the sampling depth:
> plot_richness(ps.rarefied, x="Month", color="Depth", measures=c("Observed"))

Make a boxplot of the number of OTUs and the Shannon entropy grouping the different months by season:
> plot_richness(ps.rarefied, x="Season", measures=c("Observed", "Shannon")) + geom_boxplot()

We can export a data.frame
containig a number of standard alpha diversity
estimates using the phyloseq function estimate_richness()
> rich = estimate_richness(ps.rarefied)
> rich
Observed Chao1 se.chao1 ACE se.ACE Shannon Simpson InvSimpson Fisher
B0214D1.PL1.D1 106 197.8667 35.57985 188.3066 8.170040 3.687611 0.9299652 14.27862 43.21532
B0214D2.PL1.E1 102 143.1304 16.39579 161.0871 6.968287 3.689071 0.9314271 14.58303 40.65808
B0214D3.PL1.F1 103 184.0588 30.82336 190.4337 7.690088 3.611560 0.9227125 12.93871 41.28956
B0314D1.PL1.G1 88 137.4000 21.40127 142.2737 6.479689 3.534831 0.9325188 14.81895 32.34465
B0314D2.PL1.H1 100 222.7692 47.63464 203.5988 7.938369 3.504056 0.9304873 14.38587 39.41058
B0314D3.PL1.A2 103 178.2000 30.13564 160.8535 6.547177 3.787005 0.9486475 19.47324 41.28956
B0414D1.PL1.B2 98 143.0000 20.26436 136.2743 5.823351 4.086749 0.9750571 40.09153 38.18345
B0414D2.PL1.C2 109 224.9091 47.96245 172.8367 7.082246 4.000190 0.9664754 29.82883 45.18882
B0414D3.PL1.D2 114 186.5455 26.01395 211.5217 8.993286 3.932662 0.9602954 25.18601 48.58680
B0514D1.PL1.E2 72 99.1875 13.13050 109.1346 6.234068 3.124113 0.9126215 11.44446 23.97705
B0514D2.PL1.F2 78 109.1667 14.13628 122.0444 6.234465 3.223947 0.9125835 11.43949 26.97943
B0514D3.PL1.G2 91 128.0588 16.43157 126.6355 5.731954 3.524923 0.9258547 13.48704 34.04531
B0614D1.PL1.H2 90 123.0000 15.00832 128.4771 5.792422 3.816668 0.9577323 23.65873 33.47364
B0614D2.PL1.A3 102 151.2857 19.37303 167.6238 7.074761 3.757622 0.9423821 17.35571 40.65808
B0714D2.PL1.B3 110 172.6364 23.12117 187.8670 8.028159 3.709128 0.9258547 13.48704 45.85743
B0714D3.PL1.C3 96 141.5556 19.02400 151.4630 6.818280 3.850288 0.9634946 27.39319 36.97645
B0814D1.PL1.D3 96 178.5000 34.94025 155.8289 6.447118 3.654719 0.9326233 14.84192 36.97645
B0814D2.PL1.E3 106 155.5000 19.71247 162.2091 6.744816 4.022988 0.9689815 32.23887 43.21532
B0814D3.PL1.F3 116 216.6471 36.87625 215.7956 8.770340 3.911931 0.9456952 18.41456 49.98502
B0914D1.PL1.G3 108 168.2727 22.42221 201.5552 9.294336 3.891102 0.9617763 26.16180 44.52562
B0914D2.PL1.H3 103 162.3684 23.12990 180.7485 8.445071 3.886107 0.9643964 28.08706 41.28956
B0914D3.PL1.A4 123 178.0000 19.47167 199.4132 8.292517 4.090999 0.9670545 30.35312 55.06042
B1014D1.PL1.B4 101 173.5263 27.19010 193.4237 8.337151 3.469170 0.9060428 10.64314 40.03176
B1014D2.PL1.C4 97 251.0000 63.34083 207.5726 8.807031 3.352156 0.8968440 9.69406 37.57745
B1014D3.PL1.D4 108 180.0588 27.98694 171.2683 6.839082 3.851583 0.9479830 19.22447 44.52562
B1114D1.PL1.E4 138 244.6364 35.62005 235.2076 8.598060 4.349086 0.9764620 42.48457 66.94886
B1114D2.PL1.F4 142 217.6774 24.31684 250.3584 9.765194 4.391405 0.9794808 48.73491 70.36907
B1114D3.PL1.G4 129 206.5385 26.10650 225.4320 8.773816 4.210509 0.9742881 38.89256 59.64440
B1214D1.PL1.H4 118 240.0625 44.22653 241.1003 9.310808 4.091076 0.9714972 35.08426 51.40601
B1214D2.PL1.A5 121 185.5652 23.38079 199.4590 8.499590 4.159264 0.9720763 35.81183 53.58096
B1214D3.PL1.B5 130 256.1364 40.94272 298.4156 10.584524 4.162425 0.9733673 37.54785 60.43014
Bar0114D1.PL1.A1 123 190.7778 23.19105 215.1598 8.974270 4.021200 0.9614251 25.92359 55.06042
Bar0114D2.PL1.B1 120 216.3158 34.30966 222.7492 9.064837 4.028745 0.9586721 24.19674 52.85012
Bar0114D3.PL1.C1 116 187.8696 25.47702 221.1842 8.864324 3.932334 0.9560141 22.73454 49.98502
Test whether the observed number of OTUs differs significantly between seasons. We make a non-parametric test, the Wilcoxon rank-sum test (Mann-Whitney):
> pairwise.wilcox.test(rich$Observed, sample_data(ps.rarefied)$Season)
Pairwise comparisons using Wilcoxon rank sum test
data: rich$Observed and metasample_data(ps.rarefied)data$Season
Fall Spring Summer
Spring 0.112 - -
Summer 0.270 0.681 -
Winter 1.000 0.025 0.112
P value adjustment method: holm
By default, the function pairwise.wilcox.test()
reports the pairwise
adjusted (Holm) p-values.
Exercise
Repeat the test on the Shannon indexes.
Beta diversity¶
Plot the PCoA using the unweighted UniFrac as distance:
> # PCoA plot using the unweighted UniFrac as distance
> wunifrac_dist = phyloseq::distance(ps.rarefied, method="unifrac", weighted=F)
> ordination = ordinate(ps.rarefied, method="PCoA", distance=wunifrac_dist)
> plot_ordination(ps.rarefied, ordination, color="Season") + theme(aspect.ratio=1)

Test whether the seasons differ significantly from each other using the permutational ANOVA (PERMANOVA) analysis:
> adonis(wunifrac_dist ~ sample_data(ps.rarefied)$Season)
Call:
adonis(formula = wunifrac_dist ~ sample_data(ps.rarefied)$Season)
Permutation: free
Number of permutations: 999
Terms added sequentially (first to last)
Df SumsOfSqs MeanSqs F.Model R2 Pr(>F)
sample_data(ps.rarefied)$Season 3 1.3011 0.43372 4.1604 0.29381 0.001 ***
Residuals 30 3.1274 0.10425 0.70619
Total 33 4.4286 1.00000
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Exercise
Make the PCoA and the PERMANOVA using the Bray-Curtis dissimilarity instead.
OTU differential abundance testing with DESeq2¶
To test the differences at OTU level between seasons using DESeq2, we need to
convert the Season
column into factor. Note that we use the data without
rarefaction (i.e. ps
object):
> sample_data(ps)$Season <- as.factor(sample_data(ps)$Season)
Convert the phyloseq object to a DESeqDataSet and run DESeq2:
> ds = phyloseq_to_deseq2(ps, ~ Season)
> ds = DESeq(ds)
Extract the result table from the ds
object usind the DESeq2 function
results
and filter the OTUs using a False Discovery Rate (FDR) cutoff of
0.01. In this example we return the significantly differentially abundant OTU
between the seasons “Spring” and “Fall”:
> alpha = 0.01
> res = results(ds, contrast=c("Season", "Spring", "Fall"), alpha=alpha)
> res = res[order(res$padj, na.last=NA), ]
> res_sig = res[(res$padj < alpha), ]
> res_sig
log2 fold change (MLE): Season Spring vs Fall
Wald test p-value: Season Spring vs Fall
DataFrame with 62 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
DENOVO17 22.7436598625802 -4.1529844728879 0.552035702386233 -7.52303601911288 5.35186717121325e-14 1.24163318372147e-11
DENOVO35 10.6015033917283 -7.36751901929925 1.01933372324247 -7.22777913779147 4.90956301343594e-13 5.6950930955857e-11
DENOVO91 5.31287448011852 -6.51255526618412 0.947998700432628 -6.86979345352695 6.42949270405053e-12 4.97214102446574e-10
DENOVO2 82.4704545010533 -4.14259840011034 0.673404296938788 -6.15172552201119 7.66444402875036e-10 4.44537753667521e-08
DENOVO7 15.6311735008548 5.91263059667889 0.979789881740526 6.0345903819455 1.59366414316775e-09 7.39460162429838e-08
... ... ... ... ... ... ...
DENOVO83 3.63662006180492 1.92505847356698 0.617438877584007 3.11781221341228 0.00182198852945677 0.00728795411782707
DENOVO89 2.68296393708501 2.84137889985046 0.912892035548744 3.11250267195342 0.00185508334637251 0.00729456502302411
DENOVO72 4.86241695816352 2.71763740147229 0.895564240058129 3.03455327927775 0.00240892202480818 0.00931449849592497
DENOVO21 17.208142677795 -1.1266184329166 0.373108760004578 -3.01954430901804 0.002531552600065 0.00962820005270621
DENOVO55 6.24723247307275 2.09415598552554 0.695335908667259 3.01171845063975 0.00259773414843998 0.00972055358771089
The result table reports base means across samples, log2 fold changes, standard errors, test statistics, p-values and adjusted p-values.
Make a genus vs log2FC plot of the significant OTUs:
> res_sig = cbind(as(res_sig, "data.frame"), as(tax_table(ps)[rownames(res_sig), ], "matrix"))
> ggplot(res_sig, aes(x=Rank6, y=log2FoldChange, color=Rank2)) +
geom_jitter(size=3, width = 0.2) +
theme(axis.text.x = element_text(angle = -90, hjust = 0, vjust=0.5))

Exercise
Test the differences between summer and fall and compare the results with those above.
Compute basic statistics, rarefy and summarize OTU/SV tables using micca¶
Note
This tutorial requires Paired-end sequencing - 97% OTU to be done.
The command tablestats reports a sample summary, an OTU summary and the rarefaction curves for the input OTU/SV table:
micca tablestats -i denovo_greedy_otus/otutable.txt -o tablestats
Inspecting the file tablestats/tablestats_samplesumm.txt
you can see that
the less abundant sample contains 512 reads:
Sample Depth NOTU NSingle
B1114D1-PL1-E4 512 145 68
B1014D2-PL1-C4 1356 152 57
B0214D3-PL1-F1 1665 192 74
... ... ... ...
Note
Rarefaction curves can be inspected through
tablestats/tablestats_rarecurve.txt
and
tablestats/tablestats_rarecurve_plot.png
.
To compare different samples, the OTU/SV table must be subsampled (rarefied) using the command
tablerare. In this case we are interested in rarefy the table
with the depth of the less abundant sample (B1114D1-PL1-E4
):
micca tablerare -i denovo_greedy_otus/otutable.txt -o denovo_greedy_otus/otutable_rare.txt -d 500
Now we can summarize communities by their taxonomic composition. The
tabletotax creates in the output directory a table for each
taxonomic level (taxtable1.txt
, …, taxtableN.txt
). OTU counts are
summed together if they have the same taxonomy at the considered level.
micca tabletotax -i denovo_greedy_otus/otutable_rare.txt -t denovo_greedy_otus/taxa.txt -o taxtables
Finally, we can generate a relative abundance bar plot from generated taxa tables, using the command tablebar. In this case only the bar plot relative to the taxonomy level 2 (phylum) will be generated:
micca tablebar -i taxtables/taxtable2.txt -o taxtables/taxtable2.png

Picking OTUs for use in PICRUSt¶
PICRUSt (doi: 10.1038/nbt.2676) is a software designed to predict metagenome functional content from marker gene (e.g., 16S rRNA) surveys and full genomes. This tutorial covers how to pick OTUs from 16S rRNA sequences data to use with PICRUSt.
Note
Requires Quality filtering in Single-end sequencing to
be done and the PICRUSt software to be installed in your system.
Warning: PICRUSt 1.0.0 requires the biom-format package v1.3.1 to
be installed in your system (from the command line run: pip
install biom-format==1.3.1
, for more information see
http://biom-format.org/).
PICRUSt requires an Closed-reference OTU table computed against the Greengenes reference (clustered at 97% identity). Download the reference database (Greengenes, version 2013/05), clustered at 97% identity:
wget ftp://ftp.fmach.it/metagenomics/micca/dbs/gg_2013_05.tar.gz
tar -zxvf gg_2013_05.tar.gz
Run the micca closed-reference protocol:
micca otu -m closed_ref -i filtered.fasta -o closed_ref_otus -r 97_otus.fasta -d 0.97 -t 4
cd closed_ref_otus
Report the sample summary:
micca tablestats -i otutable.txt -o tablestats
head tablestats/tablestats_samplesumm.txt
Sample Depth NOTU NSingle
Mw_03 1084 132 39
Mw_06 1387 122 27
Mw_11 1485 155 44
Mw_07 1528 150 36
Mw_01 1537 143 35
Mw_15 1565 144 35
Mw_14 1610 149 42
Mw_02 1670 143 43
Mw_12 1710 153 54
Rarefy the OTU table for the PICRUSt analysis is always a good idea (see https://groups.google.com/forum/#!topic/picrust-users/ev5uZGUIPrQ), so we will rarefy the table at 1084 sequences per sample using tablerare:
micca tablerare -i otutable.txt -o otutable_rare.txt -d 1084
Convert the rarefied OTU table into the BIOM format replacing the OTU IDs with the original sequence IDs using the tobiom command:
micca tobiom -i otutable_rare.txt -o tables.biom -u otuids.txt
Normalize the OTU table by dividing each OTU by the known/predicted 16S copy
number abundance using the PICRUSt script normalize_by_copy_number.py
:
normalize_by_copy_number.py -i tables.biom -o normalized_otus.biom
Create the final metagenome functional predictions using the PICRUSt script
predict_metagenomes.py
:
predict_metagenomes.py -i normalized_otus.biom -o metagenome_predictions.biom
Now you can analyze the PICRUSt predicted metagenome as described in http://picrust.github.io/picrust/tutorials/downstream_analysis.html#downstream-analysis-guide.
OTU picking and Denoising¶
Usually, amplicon sequences are clustered into Operational Taxonomic Units (OTUs) using a similarity threshold of 97%, which represents the common working definition of bacterial species.
Another approach consists to define the so-called Sequence Variants (SVs, a.k.a Amplicon Sequence Variants - ASVs, Exact Sequence Variants ESVs, zero-radius OTUs - ZOTU, unique sequence variants or sub-OTUs). This approach avoids clustering sequences at a predefined similarity threshold and usually includes a denoising algorithm in order to identify SVs (see UNOISE, DADA2, Deblur, oligotyping and swarm).
The otu command assigns similar sequences (marker genes such as 16S rRNA and the fungal ITS region) to operational taxonomic units or sequence variants (OTUs or SVs).
Methods¶
De novo greedy¶
In denovo greedy clustering (parameter --method denovo_greedy
), sequences
are clustered without relying on an external reference database, using an
approach similar to the UPARSE pipeline (https://doi.org/10.1038/nmeth.2604) and
tested in https://doi.org/10.7287/peerj.preprints.1466v1. This protocol
includes in a single command dereplication, clustering and chimera filtering:
- Dereplication. Predict sequence abundances of each sequence by dereplication, order by abundance and discard sequences with abundance value smaller than MINSIZE (option
-s/--minsize
, default value 2);- Greedy clustering. Distance (DGC) and abundance-based (AGC) strategies are supported (option
--greedy
, see https://doi.org/10.1186/s40168-015-0081-x and https://doi.org/10.7287/peerj.preprints.1466v1 ). Therefore, the candidate representative sequences are obtained;- Chimera filtering (optional). Remove chimeric sequences from the representatives performing a de novo chimera detection (option
--rmchim
);- Map sequences. Map sequences to the representatives.
Example:
micca otu -m denovo_greedy -i filtered.fasta -o denovo_greedy_otus -d 0.97 -c -t 4
De novo UNOISE¶
Denoise amplicon sequences using the UNOISE3 protocol. The method is designed for Illumina (paired or unpaired) reads. This protocol includes in a single command dereplication, denoising and chimera filtering:
- Dereplication; Predict sequence abundances of each sequence by dereplication, order by abundance and discard sequences with abundance value smaller than MINSIZE (option
-s/--minsize
, default value 8);- Denoising;
- Chimera filtering (optional);
- Map sequences. Map sequences to the representatives.
Example:
micca otu -m denovo_unoise -i filtered.fasta -o denovo_unoise_otus -c -t 4
Closed-reference¶
Sequences are clustered against an external reference database and reads that
could not be matched are discarded (method closed_ref
). Example:
Download the reference database (Greengenes), clustered at 97% identity:
wget ftp://ftp.fmach.it/metagenomics/micca/dbs/gg_2013_05.tar.gz
tar -zxvf gg_2013_05.tar.gz
Run the closed-reference protocol:
micca otu -m closed_ref -i filtered.fasta -o closed_ref_otus -r 97_otus.fasta -d 0.97 -t 4
Simply perform a sequence ID matching with the reference taxonomy file (see classify):
cd closed_ref_otus
micca classify -m otuid -i otuids.txt -o taxa.txt -x ../97_otu_taxonomy.txt
Open-reference¶
Open-reference clustering (open_ref
): sequences are clustered against an
external reference database (as in Closed-reference) and reads that
could not be matched are clustered with the De novo greedy protocol.
Example:
Download the reference database (Greengenes), clustered at 97% identity:
wget ftp://ftp.fmach.it/metagenomics/micca/dbs/gg_2013_05.tar.gz
tar -zxvf gg_2013_05.tar.gz
Run the open-reference protocol:
micca otu -m open_ref -i filtered.fasta -o open_ref_otus -r 97_otus.fasta -d 0.97 -t 4 -c
Run the VSEARCH-based consensus classifier or the RDP classifier (see classify):
cd open_ref_otus
micca classify -m cons -i otus.fasta -o taxa.txt -r ../97_otus.fasta -x ../97_otu_taxonomy.txt -t 4
De novo swarm¶
In denovo swarm clustering (doi: 10.7717/peerj.593, doi: 10.7717/peerj.1420,
https://github.com/torognes/swarm, parameter --method denovo_swarm
),
sequences are clustered without relying on an external reference database. From
https://github.com/torognes/swarm:
The purpose of swarm is to provide a novel clustering algorithm that handles massive sets of amplicons. Results of traditional clustering algorithms are strongly input-order dependent, and rely on an arbitrary global clustering threshold. swarm results are resilient to input-order changes and rely on a small local linking threshold d, representing the maximum number of differences between two amplicons. swarm forms stable, high-resolution clusters, with a high yield of biological information.
otu includes in a single command dereplication, clustering and de novo chimera filtering:
- Dereplication. Predict sequence abundances of each sequence by dereplication, order by abundance and discard sequences with abundance value smaller than MINSIZE (option
--minsize
default value is 1, i.e. no filtering);- Swarm clustering. Fastidious option is recommended (
--swarm-fastidious
);- Chimera filtering (optional).
Warning
Removing ambiguous nucleotides (N
) (with the option --maxns 0
in
filter) is mandatory if you use the de novo swarm clustering
method.
Example:
micca filter -i trimmed.fastq -o filtered.fasta -e 0.5 -m 350 -t --maxns 0
micca otu -m denovo_swarm -i filtered.fasta -o otus_denovo_swarm -c --minsize 1 --swarm-fastidious -t 4
Definition of identity¶
In micca, the pairwise identity (except for ‘de novo swarm’ and ‘denovo unoise’) is defined as the edit distance excluding terminal gaps (same as in USEARCH and BLAST):
Output files¶
The otu command returns in a single directory 5 files:
- otutable.txt
TAB-delimited file, containing the number of times an OTU is found in each sample (OTU x sample, see Supported file formats):
OTU Mw_01 Mw_02 Mw_03 ... DENOVO1 151 178 177 ... DENOVO2 339 181 142 ... DENOVO3 533 305 63 ... ... ... ... ... ...
- otus.fasta
FASTA containing the representative sequences (OTUs):
>DENOVO1 GACGAACGCTGGCGGCGTGCCTAACACATGCAAGTCGAACGGGG... >DENOVO2 GATGAACGCTAGCTACAGGCTTAACACATGCAAGTCGAGGGGCA... >DENOVO3 AGTGAACGCTGGCGACGTGGTTAAGACATGCAAGTCGAGCGGTA... ...
- otuids.txt
TAB-delimited file which maps the OTU ids to original sequence ids:
DENOVO1 IS0AYJS04JQKIS;sample=Mw_01 DENOVO2 IS0AYJS04JL6RS;sample=Mw_01 DENOVO3 IS0AYJS04H4XNN;sample=Mw_01 ...
- hits.txt
TAB-separated file, three-columns, where each column contains: the matching sequence, the representative (seed) and the identity (if available, see Definition of identity):
IS0AYJS04JE658;sample=Mw_01; IS0AYJS04I4XYN;sample=Mw_01 99.4 IS0AYJS04JPH34;sample=Mw_01; IS0AYJS04JVUBC;sample=Mw_01 98.0 IS0AYJS04I67XN;sample=Mw_01; IS0AYJS04JVUBC;sample=Mw_01 99.7 ...
- otuschim.fasta
- (only for ‘denovo_greedy’, ‘denovo_swarm’ and ‘open_ref’ mathods, when
-c/--rmchim
is specified) FASTA file containing the chimeric otus.
Warning
Trimming the sequences to a fixed position before clustering is strongly recommended when they cover partial amplicons or if quality deteriorates towards the end (common when you have long amplicons and single-end sequencing), see Quality filtering.
Note
De novo OTUs are renamed to DENOVO[N]
and reference OTUs to REF[N]
.
Supported file formats¶
Sequence files¶
FASTA and FASTQ Sanger/Illumina 1.8+ format (phred+33) formats are supported. micca provides the convert command to convert between sequence file formats.
Taxonomy files¶
Taxonomy files map sequence IDs to taxonomy. Input taxonomy files must be TAB-delimited files where rows are either in the form:
SEQID[TAB]k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__;g__;
SEQID[TAB]Bacteria;Firmicutes;Clostridia;Clostridiales;;;
SEQID[TAB]Bacteria;Firmicutes;Clostridia;Clostridiales
SEQID[TAB]D_0__Bacteria;D_1__Firmicutes;D_2__Clostridia;D_3__Clostridiales;D_4__;D_5__;
Compatible taxonomy files are:
- Greengenes (http://greengenes.secondgenome.com/downloads);
- QIIME-formatted SILVA (https://www.arb-silva.de/download/archive/qiime/);
- UNITE (https://unite.ut.ee/repository.php);
- Human Oral Microbiome Database (HOMD) (http://www.homd.org/).
The output taxonomy file returned by classify is a TAB-delimited file where each row is in the format:
SEQID[TAB]Bacteria;Firmicutes;Clostridia;Clostridiales
OTU/SV tables and taxonomy tables¶
The OTU table returned by otu is an OTU x sample, TAB-delimited text file, containing the number of times an OTU is found in each sample:
OTU Mw_01 Mw_02 Mw_03 ...
DENOVO1 151 178 177 ...
DENOVO2 339 181 142 ...
DENOVO3 533 305 63 ...
DENOVO4 166 299 115 ...
... ... ... ... ...
The tabletotax command returns the “taxonomy tables” for each taxonomic level, e.g.:
OTU Mw_01 Mw_02 Mw_03 ...
Bacteria;Bacteroidetes 1363 1543 1168 ...
Bacteria;Cyanobacteria/Chloroplast 0 0 0 ...
Bacteria;Firmicutes 6257 5780 6761 ...
Bacteria;Lentisphaerae 0 1 0 ...
... ... ... ... ...
Sample data¶
The sample data file contains all of the information about the samples. In QIIME this file is called Mapping File. In micca, the sample data file must be a TAB-delimited text file (a row for each sample). The first column must be the sample identifier (assigned in merge, split or mergepairs):
ID Group Altitude
Mw_01 Mw1 492
Mw_02 Mw1 492
Mw_09 Mw1 492
Mw_12 Mw1 492
... ... ...
Phylogenetic tree¶
Only the Newick format is supported.
BIOM file¶
The tobiom command generates OTU/SV tables in the biom version 1.0 JSON file format (http://biom-format.org/documentation/format_versions/biom-1.0.html).
Changes¶
Version 1.7.0¶
- VSEARCH updated to version 2.7.1;
- Denoising protocol added to the otu command for Illumina reads (UNOISE);
- -S/–chim-abskew option added to micca otu command;
- Documentation updated;
- multithread option added to the mergepairs command.
Version 1.6.2 (bug fix release)¶
- Definitely fix the “new-line error” in classify;
- Fix bar plots in “stats”.
Version 1.6.1 (bug fix release)¶
- Fix the “new-line error” in classify.
Version 1.6.0¶
- swarm updated to version 2.1.12;
- Now the mergepairs command allows merging staggered reads by default.
With the new option
-n/--nostagger
the command produces the same results of the previous versions (<=1.5); - classify and tabletotax commands now strip the
D_X__
prefix from the Silva taxonomy files; - Documentation updated;
- Fix: remove duplicate file closing in micca.api.merge().
Version 1.5.0¶
- Now the NAST algorithm trims candidate sequences to that which is bound by the
beginning and end points of the alignment span; with the new option
--nast-notrim
inmicca msa
produces the same results of the previous version (<=1.4.0); - Y-scale in
micca tablebar
when--raw
fixed; - “text file busy error” in
micca msa
(nast) fixed; - pandas deprecation warnings fixed;
- documentation updated.
Version 1.4.0¶
- No-filter option added to
micca.api.msa.nast()
(do not remove positions which are gaps in every sequenceces) and to themsa
command (--nast-nofilter
option); - documentation improved.
Version 1.3.0¶
- Swarm clustering algorithm added to
micca otu
; micca.api.otu.denovo_swarm()
function added;- micca v1.3.0 includes: VSEARCH v1.9.5, MUSCLE v3.8.31, FastTree v2.1.8, swarm v2.1.8;
- documentation updated.
Version 1.2.2¶
- Mow micca can generate plots with matplotlib when the
DISPLAY
environment variable is undefined; MANIFEST.in
, Dockerfile updated.
Version 1.2.1¶
- Dockerfile added;
- Documentation improved.
Version 1.2.0¶
- Hits output file option added to
micca.api.msa.nast()
(hits_fn
option) and to themsa
command (--nast-hits
option); - setup.py improved.
Version 1.1.0¶
- Strand option added in
classify
(consensus-based classifier),msa
(NAST) andotu
(closed-reference and open reference OTU picking protolcols) commands. Now these commands search both strand (default) instead the plus strand only.
Version 1.0.0¶
- micca 1.0.0 includes: VSEARCH v1.9.5, MUSCLE v3.8.31, FastTree v2.1.8.
classify¶
usage: micca classify [-h] -i FILE -o FILE [-m {cons,rdp,otuid}] [-r FILE]
[-x FILE] [--cons-id CONS_ID]
[--cons-maxhits CONS_MAXHITS]
[--cons-minfrac CONS_MINFRAC]
[--cons-mincov CONS_MINCOV] [--cons-strand {both,plus}]
[--cons-threads THREADS]
[--rdp-gene {16srrna,fungallsu,fungalits_warcup,fungalits_unite}]
[--rdp-maxmem GB] [--rdp-minconf RDP_MINCONF]
micca classify assigns taxonomy for each sequence in the input file
and provides three methods for classification:
* VSEARCH-based consensus classifier (cons): input sequences are
searched in the reference database with VSEARCH
(https://github.com/torognes/vsearch). For each query sequence the
method retrives up to 'cons-maxhits' hits (i.e. identity >=
'cons-id'). Then, the most specific taxonomic label that is
associated with at least 'cons-minfrac' of the hits is
assigned. The method is similar to the UCLUST-based consensus
taxonomy assigner presented in doi: 10.7287/peerj.preprints.934v2
and available in QIIME.
* RDP classifier (rdp): only RDP classifier version >= 2.8 is
supported (doi:10.1128/AEM.00062-07). In order to use this
classifier RDP must be installed (download at
http://sourceforge.net/projects/rdp-classifier/files/rdp-classifier/)
and the RDPPATH environment variable setted. The available
databases (--rdp-gene) are:
- 16S (16srrna)
- Fungal LSU (28S) (fungallsu)
- Warcup ITS (fungalits_warcup, doi: 10.3852/14-293)
- UNITE ITS (fungalits_unite)
For more information about the RDP classifier go to
http://rdp.cme.msu.edu/classifier/classifier.jsp
* OTU ID classifier (otuid): simply perform a sequence ID matching
with the reference taxonomy file. Recommended strategy when the
closed reference clustering (--method closedref in micca-otu) was
performed. OTU ID classifier requires a tab-delimited file where
the first column contains the current OTU ids and the second column
the reference taxonomy ids (see otuids.txt in micca-otu), e.g.:
REF1[TAB]1110191
REF2[TAB]1104777
REF3[TAB]1078527
...
The input reference taxonomy file (--ref-tax) should be a
tab-delimited file where rows are either in the form:
1. SEQID[TAB]k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__;g__;
2. SEQID[TAB]Bacteria;Firmicutes;Clostridia;Clostridiales;;;
3. SEQID[TAB]Bacteria;Firmicutes;Clostridia;Clostridiales
4. SEQID[TAB]D_0__Bacteria;D_1__Firmicutes;D_2__Clostridia;D_3__Clostridiales;D_4__;D_5__;
Compatible reference database are Greengenes
(http://greengenes.secondgenome.com/downloads), QIIME-formatted SILVA
(https://www.arb-silva.de/download/archive/qiime/) and UNITE
(https://unite.ut.ee/repository.php).
The output file is a tab-delimited file where each row is in the
format:
SEQID[TAB]Bacteria;Firmicutes;Clostridia;Clostridiales
optional arguments:
-h, --help show this help message and exit
arguments:
-i FILE, --input FILE
input FASTA file (for 'cons' and 'rdp') or a tab-
delimited OTU ids file (for 'otuid') (required).
-o FILE, --output FILE
output taxonomy file (required).
-m {cons,rdp,otuid}, --method {cons,rdp,otuid}
classification method (default cons)
-r FILE, --ref FILE reference sequences in FASTA format, required for
'cons' classifier.
-x FILE, --ref-tax FILE
tab-separated reference taxonomy file, required for
'cons' and 'otuid' classifiers.
VSEARCH-based consensus classifierspecific options:
--cons-id CONS_ID sequence identity threshold (0.0 to 1.0, default 0.9).
--cons-maxhits CONS_MAXHITS
number of hits to consider (>=1, default 3).
--cons-minfrac CONS_MINFRAC
for each taxonomic rank, a specific taxa will be
assigned if it is present in at least MINFRAC of the
hits (0.0 to 1.0, default 0.5).
--cons-mincov CONS_MINCOV
reject sequence if the fraction of alignment to the
reference sequence is lower than MINCOV. This
parameter prevents low-coverage alignments at the end
of the sequences (default 0.75).
--cons-strand {both,plus}
search both strands or the plus strand only (default
both).
--cons-threads THREADS
number of threads to use (1 to 256, default 1).
RDP Classifier/Database specific options:
--rdp-gene {16srrna,fungallsu,fungalits_warcup,fungalits_unite}
marker gene/region
--rdp-maxmem GB maximum memory size for the java virtual machine in GB
(default 2)
--rdp-minconf RDP_MINCONF
minimum confidence value to assign taxonomy to a
sequence (default 0.8)
Examples
Classification of 16S sequences using the consensus classifier and
Greengenes:
micca classify -m cons -i input.fasta -o tax.txt \
--ref greengenes_2013_05/rep_set/97_otus.fasta \
--ref-tax greengenes_2013_05/taxonomy/97_otu_taxonomy.txt
Classification of ITS sequences using the RDP classifier and the
UNITE database:
micca classify -m rdp --rdp-gene fungalits_unite -i input.fasta \
-o tax.txt
OTU ID matching after the closed reference OTU picking protocol:
micca classify -m otuid -i otuids.txt -o tax.txt \
--ref-tax greengenes_2013_05/taxonomy/97_otu_taxonomy.txt
convert¶
usage: micca convert [-h] -i FILE -o FILE [-q FILE] [-d DEFAULTQ]
[-f INPUT_FORMAT] [-F OUTPUT_FORMAT]
micca convert converts between sequence file formats. See
http://biopython.org/wiki/SeqIO#File_Formats for a comprehnsive list
of the supported file formats.
Supported input formats:
abi, abi-trim, ace, embl, embl-cds, fasta, fasta-qual, fastq, fastq-illumina,
fastq-sanger, fastq-solexa, gb, genbank, genbank-cds, ig, imgt, pdb-atom,
pdb-seqres, phd, pir, qual, seqxml, sff, sff-trim, swiss, tab, uniprot-xml
Supported output formats:
embl, fasta, fastq, fastq-illumina, fastq-sanger, fastq-solexa, gb, genbank,
imgt, phd, qual, seqxml, sff, tab
optional arguments:
-h, --help show this help message and exit
arguments:
-i FILE, --input FILE
input sequence file (required).
-o FILE, --output FILE
output sequence file (required).
-q FILE, --qual FILE input quality file (required for 'fasta-qual' input
format.
-d DEFAULTQ, --defaultq DEFAULTQ
default phred quality score for format-without-quality
to format-with-quality conversion (default 40).
-f INPUT_FORMAT, --input-format INPUT_FORMAT
input file format (default fastq).
-F OUTPUT_FORMAT, --output-format OUTPUT_FORMAT
input file format (default fasta).
Examples
Convert FASTA+QUAL files into a FASTQ (Sanger/Illumina 1.8+) file:
micca convert -i input.fasta -q input.qual -o output.fastq \
-f fasta-qual -F fastq
Convert a SFF file into a FASTQ (Sanger/Illumina 1.8+) file:
micca convert -i input.sff -o output.fastq -f sff -F fastq
filter¶
usage: micca filter [-h] -i FILE -o FILE [-e MAXEERATE] [-m MINLEN] [-t]
[-n MAXNS] [-f {fastq,fasta}]
micca filter filters sequences according to the maximum allowed
expected error (EE) rate %%. Optionally, you can:
* discard sequences that are shorter than the specified length
(suggested for Illumina overlapping paired-end (already merged)
reads) (option --minlen MINLEN);
* discard sequences that are shorter than the specified length AND
truncate sequences that are longer (suggested for Illumina and 454
unpaired reads) (options --minlen MINLEN --trunc);
* discard sequences that contain more than a specified number of Ns
(--maxns).
Sequences are first shortened and then filtered. Overlapping paired
reads with should be merged first (using micca-mergepairs) and then
filtered.
The expected error (EE) rate %% in a sequence of length L is defined
as (doi: 10.1093/bioinformatics/btv401):
sum(error probabilities)
EE rate %% = ------------------------ * 100
L
Before filtering, run 'micca filterstats' to see how many reads will
pass the filter at different minimum lengths with or without
truncation, given a maximum allowed expected error rate %% and maximum
allowed number of Ns.
micca-filter is based on VSEARCH (https://github.com/torognes/vsearch).
optional arguments:
-h, --help show this help message and exit
arguments:
-i FILE, --input FILE
input FASTQ file, Sanger/Illumina 1.8+ format
(phred+33) (required).
-o FILE, --output FILE
output FASTA/FASTQ file (required).
-e MAXEERATE, --maxeerate MAXEERATE
discard sequences with more than the specified expeced
error rate % (values <=1%, i.e. less or equal than one
error per 100 bases, are highly recommended).
Sequences are discarded after truncation (if enabled)
(default 1).
-m MINLEN, --minlen MINLEN
discard sequences that are shorter than MINLEN
(default 1).
-t, --trunc truncate sequences that are longer than MINLEN
(disabled by default).
-n MAXNS, --maxns MAXNS
discard sequences with more than the specified number
of Ns. Sequences are discarded after truncation
(disabled by default).
-f {fastq,fasta}, --output-format {fastq,fasta}
file format (default fasta).
Examples
Truncate reads at 300 bp, discard low quality sequences
(with EE rate > 0.5%%) and write a FASTA file:
micca filter -i reads.fastq -o filtered.fasta -m 300 -t -e 0.5
filterstats¶
usage: micca filterstats [-h] -i FILE [-o DIR] [-t TOPN]
[-e MAXEERATES [MAXEERATES ...]] [-n MAXNS]
micca filterstats reports the fraction of reads that would pass for each
specified maximum expected error (EE) rate %% and the maximum number of
allowed Ns after:
* discarding sequences that are shorter than the specified length
(suggested for Illumina overlapping paired-end (already merged)
reads);
* discarding sequences that are shorter than the specified length AND
truncating sequences that are longer (suggested for Illumina and 454
unpaired reads);
Parameters for the 'micca filter' command should be chosen for each
sequencing run using this tool.
micca filterstats returns in the output directory 3 files:
* filterstats_minlen.txt: fraction of reads that would pass the filter after
the minimum length filtering;
* filterstats_trunclen.txt: fraction of reads that would pass the filter after
the minimum length filtering + truncation;
* filterstats_plot.png: plot in PNG format.
optional arguments:
-h, --help show this help message and exit
arguments:
-i FILE, --input FILE
input FASTQ file, Sanger/Illumina 1.8+ format
(phred+33) (required).
-o DIR, --output DIR output directory (default .).
-t TOPN, --topn TOPN perform statistics on the first TOPN sequences
(disabled by default)
-e MAXEERATES [MAXEERATES ...], --maxeerates MAXEERATES [MAXEERATES ...]
max expected error rates (%). (default [0.25, 0.5,
0.75, 1, 1.25, 1.5])
-n MAXNS, --maxns MAXNS
max number of Ns. (disabled by default).
Examples
Compute filter statistics on the top 10000 sequences, predicting
the fraction of reads that would pass for each maximum EE error
rate (default values):
micca filterstats -i input.fastq -o stats -t 10000
merge¶
usage: micca merge [-h] -i FILE [FILE ...] -o FILE [-s SEP] [-f {fastq,fasta}]
micca merge merges several FASTQ or FASTA files in a single file.
Different samples will be merged in a single file and sample names
will be appended to the sequence identifier
(e.g. >SEQID;sample=SAMPLENAME). Sample names are defined as the
leftmost part of the file name splitted by the first occurence of '.'
(-s/--sep option). Whitespace characters in names will be replaced
with a single character underscore ('_').
optional arguments:
-h, --help show this help message and exit
arguments:
-i FILE [FILE ...], --input FILE [FILE ...]
input FASTQ/FASTA file(s) (required).
-o FILE, --output FILE
output FASTQ/FASTA file (required).
-s SEP, --sep SEP Sample names are defined as the leftmost part of the
file name splitted by the first occurence of 'SEP'
(default .)
-f {fastq,fasta}, --format {fastq,fasta}
file format (default fastq).
Examples
Merge files in FASTA format:
micca merge -i in1.fasta in2.fasta in3.fasta -o merged.fasta \
-f fasta
mergepairs¶
usage: micca mergepairs [-h] -i FILE [FILE ...] -o FILE [-r FILE]
[-l MINOVLEN] [-d MAXDIFFS] [-p PATTERN] [-e REPL]
[-s SEP] [-n] [--notmerged-fwd FILE]
[--notmerged-rev FILE] [-t THREADS]
micca mergepairs merges paired-end sequence reads into one sequence.
A single merging of a pair of FASTQ files can be simply performed
using both -i/--input and -r/--reverse options.
When the option -r/--reverse is not specified:
1. you can indicate several forward files (with the option -i/--input);
2. the reverse file name will be constructed by replacing the string
'_R1' in the forward file name with '_R2' (typical in Illumina
file names, see options -p/--pattern and -e/--repl);
3. after the merging of the paired reads, different samples will be
merged in a single file and sample names will be appended to the
sequence identifier (e.g. >SEQID;sample=SAMPLENAME), as in 'micca
merge' and 'micca split'. Sample names are defined as the leftmost
part of the file name splitted by the first occurence of '_'
(-s/--sep option). Whitespace characters in names will be replaced
with a single character underscore ('_').
micca mergepairs wraps VSEARCH (https://github.com/torognes/vsearch).
Statistical testing of significance is performed in a way similar to
PEAR (doi: 10.1093/bioinformatics/btt593). The quality of merged bases
is computed as in USEARCH (doi: 10.1093/bioinformatics/btv401).
By default staggered read pairs (staggered pairs are pairs where the 3'
end of the reverse read has an overhang to the left of the 5’ end
of the forward read) will be merged. To override this feature (and
therefore to discard staggered alignments) set the -n/--nostagger
option.
optional arguments:
-h, --help show this help message and exit
arguments:
-i FILE [FILE ...], --input FILE [FILE ...]
forward FASTQ file(s), Sanger/Illumina 1.8+ format
(phred+33) (required).
-o FILE, --output FILE
output FASTQ file (required).
-r FILE, --reverse FILE
reverse FASTQ file, Sanger/Illumina 1.8+ format
(phred+33).
-l MINOVLEN, --minovlen MINOVLEN
minimum overlap length (default 32).
-d MAXDIFFS, --maxdiffs MAXDIFFS
maximum number of allowed mismatches in the overlap
region (default 8).
-p PATTERN, --pattern PATTERN
when the reverse filename is not specified, it will be
constructed by replacing 'PATTERN' in the forward file
name with 'REPL' (default _R1).
-e REPL, --repl REPL when the reverse filename is not specified, it will be
constructed by replacing 'PATTERN' in the forward file
name with 'REPL' (default _R2).
-s SEP, --sep SEP when the reverse file name is not specified, sample
names are appended to the sequence identifier (e.g.
>SEQID;sample=SAMPLENAME). Sample names are defined as
the leftmost part of the file name splitted by the
first occurence of 'SEP' (default _)
-n, --nostagger forbid the merging of staggered read pairs. Without
this option the command will merge staggered read
pairs and the 3' overhang of the reverse read will be
not included in the merged sequence.
--notmerged-fwd FILE write not merged forward reads.
--notmerged-rev FILE write not merged reverse reads.
-t THREADS, --threads THREADS
number of threads to use (1 to 256, default 1).
Examples
Merge reads with a minimum overlap length of 50 and maximum number
of allowed mismatches of 3:
micca mergepairs -i reads1.fastq -r reads2.fastq -o merged.fastq \
-l 50 -d 3
Merge several illumina paired reads (typically named *_R1*.fastq and
*_R2*.fastq):
micca mergepairs -i *_R1*.fastq -o merged.fastq --notmerged-fwd \
notmerged_fwd.fastq --notmerged-rev notmerged_rev.fastq
msa¶
usage: micca msa [-h] -i FILE -o FILE [-m {muscle,nast}]
[--muscle-maxiters MUSCLE_MAXITERS] [--nast-template FILE]
[--nast-id NAST_ID] [--nast-threads NAST_THREADS]
[--nast-mincov NAST_MINCOV] [--nast-strand {both,plus}]
[--nast-notaligned FILE] [--nast-hits FILE] [--nast-nofilter]
[--nast-notrim]
micca msa performs a multiple sequence alignment (MSA) on the input
file in FASTA format. micca msa provides two approaches for MSA:
* MUSCLE (doi: 10.1093/nar/gkh340). It is one of the most widely-used
multiple sequence alignment software;
* Nearest Alignment Space Termination (NAST) (doi:
10.1093/nar/gkl244). MICCA provides a very fast and memory
efficient implementation of the NAST algorithm. The algorithm is
based on VSEARCH (https://github.com/torognes/vsearch). It requires
a pre-aligned database of sequences (--nast-template). For 16S
data, a good template file is the Greengenes Core Set
(http://greengenes.lbl.gov/Download/Sequence_Data/Fasta_data_files/
core_set_aligned.fasta.imputed).
optional arguments:
-h, --help show this help message and exit
arguments:
-i FILE, --input FILE
input FASTA file (required).
-o FILE, --output FILE
output MSA file in FASTA format (required).
-m {muscle,nast}, --method {muscle,nast}
multiple sequence alignment method (default muscle).
MUSCLE specific options:
--muscle-maxiters MUSCLE_MAXITERS
maximum number of MUSCLE iterations. Set to 2 for a
good compromise between speed and accuracy (>=1
default 16).
NAST specific options:
--nast-template FILE multiple sequence alignment template file in FASTA
format.
--nast-id NAST_ID sequence identity threshold to consider a sequence a
match (0.0 to 1.0, default 0.75).
--nast-threads NAST_THREADS
number of threads to use (1 to 256, default 1).
--nast-mincov NAST_MINCOV
reject sequence if the fraction of alignment to the
template sequence is lower than MINCOV. This parameter
prevents low-coverage alignments at the end of the
sequences (default 0.75).
--nast-strand {both,plus}
search both strands or the plus strand only (default
both).
--nast-notaligned FILE
write not aligned sequences in FASTA format.
--nast-hits FILE write hits on a TAB delimited file with the query
sequence id, the template sequence id and the
identity.
--nast-nofilter do not remove positions which are gaps in every
sequenceces (useful if you want to apply a Lane mask
filter before the tree inference).
--nast-notrim force to align the entire candidate sequence (i.e. do
not trim the candidate sequence to that which is bound
by the beginning and end points of of the alignment
span
Examples
De novo MSA using MUSCLE:
micca msa -i input.fasta -o msa.fasta
Template-based MSA using NAST, the Greengenes alignment as
template (clustered at 97% similarity) 4 threads and a sequence
identity threshold of 75%:
micca msa -i input.fasta -o msa.fasta -m nast --nast-threads 4 \
--nast-template greengenes_2013_05/rep_set_aligned/97_otus.fasta
otu¶
See OTU picking and Denoising for details.
usage: micca otu [-h] -i FILE [-o DIR] [-r FILE]
[-m {denovo_greedy,denovo_unoise,denovo_swarm,closed_ref,open_ref}]
[-d ID] [-n MINCOV] [-t THREADS] [-g {dgc,agc}] [-s MINSIZE]
[-a {both,plus}] [-c] [-S CHIM_ABSKEW]
[--swarm-differences SWARM_DIFFERENCES] [--swarm-fastidious]
[--unoise-alpha UNOISE_ALPHA]
micca otu assigns similar sequences (marker genes such as 16S rRNA and
the fungal ITS region) to operational taxonomic units (OTUs) or sequence
variants (SVs).
Trimming the sequences to a fixed position before clustering is
*strongly recommended* when they cover partial amplicons or if quality
deteriorates towards the end (common when you have long amplicons and
single-end sequencing).
Removing ambiguous nucleotides 'N' (with the option --maxns 0 in micca
filter) is mandatory if you use the de novo swarm clustering method.
micca otu provides the following protocols:
* de novo greedy clustering (denovo_greedy): useful for for the
identification of 97% OTUs;
* de novo unoise (denovo_unoise): denoise Illumina sequences using
the UNOISE3 protocol;
* de novo swarm (denovo_swarm): a robust and fast clustering method
(deprecated, it will be removed in version 1.8.0);
* closed-reference clustering (closed_ref): sequences are clustered
against an external reference database and reads that could not be
matched are discarded.
* open-reference clustering (open_ref): sequences are clustered
against an external reference database and reads that could not be
matched are clustered with the 'de novo greedy' protocol.
Outputs:
* otutable.txt: OTU x sample, TAB-separated OTU table file,
containing the number of times an OTU is found in each sample.
* otus.fasta: FASTA file containing the representative sequences (OTUs);
* otuids.txt: OTU ids to original sequence ids (tab-delimited text
file);
* hits.txt: three-columns, TAB-separated file with matching sequence,
representative (seed) and identity (if available, else '*');
* otuschim.fasta (only for 'denovo_greedy', 'denovo_swarm' and
'open_ref' when --rmchim is specified): FASTA file containing the
chimeric otus.
optional arguments:
-h, --help show this help message and exit
arguments:
-i FILE, --input FILE
input fasta file (required).
-o DIR, --output DIR output directory (default .).
-r FILE, --ref FILE reference sequences in fasta format, required for
'closed_ref' and 'open_ref' clustering methods.
-m {denovo_greedy,denovo_unoise,denovo_swarm,closed_ref,open_ref}, --method {denovo_greedy,denovo_unoise,denovo_swarm,closed_ref,open_ref}
clustering method (default denovo_greedy)
-d ID, --id ID sequence identity threshold (for 'denovo_greedy',
'closed_ref' and 'open_ref', 0.0 to 1.0, default
0.97).
-n MINCOV, --mincov MINCOV
reject sequence if the fraction of alignment to the
reference sequence is lower than MINCOV (for
'closed_ref' and 'open_ref' clustering methods,
default 0.75).
-t THREADS, --threads THREADS
number of threads to use (1 to 256, default 1).
-g {dgc,agc}, --greedy {dgc,agc}
greedy clustering strategy, distance (DGC) or
abundance-based (AGC) (for 'denovo_greedy' and
'open_ref' clustering methods) (default dgc).
-s MINSIZE, --minsize MINSIZE
discard sequences with an abundance value smaller than
MINSIZE after dereplication (>=1, default values are 2
for 'denovo_greedy' and 'open_ref', 1 for
'denovo_swarm' and 8 for 'denovo_unoise').
-a {both,plus}, --strand {both,plus}
search both strands or the plus strand only (for
'closed_ref' and 'open_ref' clustering methods,
default both).
Chimera removal specific options:
-c, --rmchim remove chimeric sequences (ignored in method
'closed_ref'
-S CHIM_ABSKEW, --chim-abskew CHIM_ABSKEW
abundance skew. It is used to distinguish in a three-
way alignment which sequence is the chimera and which
are the parents. If CHIM_ABSKEW=2.0, the parents
should be at least 2 times more abundant than their
chimera (defaults values are 16.0 for 'denovo_unoise',
2.0 otherwise).
Swarm specific options:
--swarm-differences SWARM_DIFFERENCES
maximum number of differences allowed between two
amplicons. Commonly used d values are 1 (linear
complexity algorithm), 2 or 3, rarely higher. (>=0,
default 1).
--swarm-fastidious when working with SWARM_DIFFERENCES=1, perform a
second clustering pass to reduce the number of small
OTUs (recommended option).
UNOISE specific options:
--unoise-alpha UNOISE_ALPHA
specify the alpha parameter (default 2.0).
Examples
De novo clustering with a 97% similarity threshold and remove
chimeric OTUs:
micca otu -i input.fasta --method denovo_greedy --id 0.97 -c
Open-reference OTU picking protocol with a 97% similarity
threshold, without removing chimeras in the de novo protocol step
and using 8 threads:
micca otu -i input.fasta --method open_ref --threads 8 --id 0.97 \
--ref greengenes_2013_05/rep_set/97_otus.fasta
De novo swarm clustering with the protocol using 4 threads:
micca otu -i input.fasta --method denovo_swarm --threads 4 \
--swarm-fastidious --rmchim --minsize 1
root¶
usage: micca root [-h] -i FILE -o FILE [-m {midpoint,outgroup}]
[-t TARGETS [TARGETS ...]]
micca root reroot the input tree:
* at the calculated midpoint between the two most distant tips of the
tree (--method midpoint);
* with the outgroup clade containing the given taxa (leaf nodes),
i.e. the common ancestor of the outgroup (--method outgroup).
optional arguments:
-h, --help show this help message and exit
arguments:
-i FILE, --input FILE
input FASTA file (required).
-o FILE, --output FILE
output MSA file in FASTA format (required).
-m {midpoint,outgroup}, --method {midpoint,outgroup}
rooting method (default midpoint).
-t TARGETS [TARGETS ...], --targets TARGETS [TARGETS ...]
list of targets defining the outgroup (required for
the outgroup method).
Examples
Midpoint rooting:
micca root -i input.tree -o input_rooted.tree
Rooting with outgroup:
micca root -i input.tree -o input_rooted.tree -m outgroup DENOVO1 DENOVO2
split¶
usage: micca split [-h] -i FILE -o FILE -b FILE [-n FILE] [-c FILE] [-s N]
[-e MAXE] [-t] [-f {fastq,fasta}]
micca split assign the multiplexed reads to samples based on their 5'
nucleotide barcode (demultiplexing) provided by the FASTA file
(--barcode). micca split creates a single FASTQ or FASTA file with
sample information (e.g. >SEQID;sample=SAMPLENAME) appended to the
sequence identifier. Barcode and the sequence preceding it is removed
by default, e.g.:
Barcode file: Input file:
>SAMPLE1 >SEQ1
TCAGTCAG TCAGTCAGGCCACGGCTAACTAC...
... ...
the output will be:
>SEQ1;sample=SAMPLE1
GCCACGGCTAACTAC...
...
optional arguments:
-h, --help show this help message and exit
arguments:
-i FILE, --input FILE
input FASTQ/FASTA file (required).
-o FILE, --output FILE
output FASTQ/FASTA file (required).
-b FILE, --barcode FILE
barcode file in FASTA format (required).
-n FILE, --notmatched FILE
write reads in which no barcode was found.
-c FILE, --counts FILE
write barcode counts in a tab-delimited file.
-s N, --skip N skip N bases before barcode matching (e.g. if your
sequences start with the control sequence 'TCAG'
followed by the barcode, set to 4) (>=0, default 0).
-e MAXE, --maxe MAXE maximum number of allowed errors (>=0, default 1).
-t, --notrim do not trim barcodes and the sequence preceding it
from sequences.
-f {fastq,fasta}, --format {fastq,fasta}
file format (default fastq).
Examples
Split 'reads.fastq' and write the notmatched sequences in the
file 'notmatched.fastq':
micca split -i input.fastq -o splitted.fastq -b barcode.fasta \
-n notmatched.fastq
stats¶
usage: micca stats [-h] -i FILE [-o DIR] [-n TOPN]
micca stats reports statistics on reads in a FASTQ file. micca stats
returns in the output directory 3 tab-delimited text files:
* stats_lendist.txt: length distribution;
* stats_qualdist.txt: Q score distribution;
* stats_qualsumm.txt: quality summary. For each read position, the
following statistics are reported:
- L: read position;
- NPctCum: percent of reads with at least L bases;
- QAv: average Q score at position L;
- EERatePctAv: average expected error (EE) rate %.
Moreover, micca stats returns the respective plots in PNG format,
stats_lendist_plot.png, stats_qualdist_plot.png, and
stats_qualsumm_plot.png.
optional arguments:
-h, --help show this help message and exit
arguments:
-i FILE, --input FILE
input FASTQ file, Sanger/Illumina 1.8+ format
(phred+33) (required).
-o DIR, --output DIR output directory (default .).
-n TOPN, --topn TOPN perform statistics only on the first TOPN sequences
(disabled by default).
Examples
Compute statistics on the top 10000 sequences of input.fastq:
micca stats -i input.fastq -o stats -n 10000
tablebar¶
usage: micca tablebar [-h] -i FILE -o FILE [-r] [-t N] [--xticklabelsize SIZE]
[-f {pgf,ps,svg,rgba,raw,svgz,pdf,eps,png}]
micca tablebar generates a relative abundance bar plot from
OTU or taxa tables. The table must be an OTU/taxon x sample,
TAB-separated file (see 'micca otu').
optional arguments:
-h, --help show this help message and exit
arguments:
-i FILE, --input FILE
input OTU table file (required).
-o FILE, --output FILE
output image file (required).
-r, --raw plot raw values (i.e. counts) instead of relative
abundances.
-t N, --topn N plot the top N abundant taxa (default 12).
--xticklabelsize SIZE
x tick label size (default 8).
-f {pgf,ps,svg,rgba,raw,svgz,pdf,eps,png}, --format {pgf,ps,svg,rgba,raw,svgz,pdf,eps,png}
output file format (default png).
Example
micca tablebar -i otutable.txt -o otutable_plot.png
tablerare¶
usage: micca tablerare [-h] -i FILE -o FILE -d DEPTH [-r] [-s SEED]
Rarefy an OTU table by subsampling, with or without
replacement. Samples that have fewer counts then the depth are
omitted from the output table. OTUs that are not present in at
least one sample are omitted from the output table.
optional arguments:
-h, --help show this help message and exit
arguments:
-i FILE, --input FILE
input OTU table file (required).
-o FILE, --output FILE
output rarefied OTU table file (required).
-d DEPTH, --depth DEPTH
sample depth (>0, required).
-r, --replace subsample with replacement.
-s SEED, --seed SEED random seed (default 0).
Examples
Rarefy an OTU table at a depth of 1000 sequences/sample:
micca tablerare -i otutable.txt -o otutable_rare.txt -d 1000
tablestats¶
usage: micca tablestats [-h] -i FILE [-o DIR] [-t STEP] [-r] [-s SEED]
micca tablestats reports a sample summary, an OTU summary and
the rarefaction curves for the input OTU table. The
rarefaction curves are evaluated using the interval of 'step'
(-t/--step) sample depths, always including 1 and the total
sample size.
micca filterstats returns in the output directory 4 files:
* tablestats_samplesumm.txt: samples summary;
* tablestats_otusumm.txt: OTUs summary;
* tablestats_rarecurve.txt: rarefaction curves in text format.
* tablestats_rarecurve_plot.txt: rarefaction curves in png format.
optional arguments:
-h, --help show this help message and exit
arguments:
-i FILE, --input FILE
input FASTQ file, Sanger/Illumina 1.8+ format
(phred+33) (required).
-o DIR, --output DIR output directory (default .).
-t STEP, --step STEP sample depth interval (for rarefaction curves, default
500).
-r, --replace subsample with replacement (for rarefaction curves).
-s SEED, --seed SEED random seed (for rarefaction curves, default 0).
Examples
Compute OTU table statistics on otutable.txt:
micca tablestats -i otutable.txt -o tablestats
tabletotax¶
usage: micca tabletotax [-h] -i FILE -t FILE [-o DIR]
Given an OTU table and a taxonomy file, micca tabletotax
creates in the output directory a table for each taxonomic
level (taxtable1.txt, ..., taxtableN.txt). OTU counts are
summed together if they have the same taxonomy at the
considered level.
The OTU table must be an OTU x sample, TAB-separated OTU table
file (see 'micca otu'). The taxonomy file must be a
tab-delimited file where where rows are either in the form
(see 'micca classify'):
1. SEQID[TAB]k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__;g__;
2. SEQID[TAB]Bacteria;Firmicutes;Clostridia;Clostridiales;;;
3. SEQID[TAB]Bacteria;Firmicutes;Clostridia;Clostridiales
4. SEQID[TAB]D_0__Bacteria;D_1__Firmicutes;D_2__Clostridia;D_3__Clostridiales;D_4__;D_5__;
optional arguments:
-h, --help show this help message and exit
arguments:
-i FILE, --input FILE
input OTU table file (required).
-t FILE, --tax FILE input taxonomy file (required).
-o DIR, --output DIR output directory (default .).
Examples
micca tabletotax -i otutable.txt -t tax.txt -o taxtables
tobiom¶
usage: micca tobiom [-h] -i FILE -o FILE [-t FILE] [-s FILE] [-u FILE]
micca tobiom converts the micca OTU table into BIOM Version
1.0 (JSON) format. Optionally, taxonomy and/or sample
information can be added. When you convert on
(closed-reference) OTU table for PICRUSt, replace OTU IDs with
the original sequence IDs use the option -u/--otuids.
optional arguments:
-h, --help show this help message and exit
arguments:
-i FILE, --input FILE
input OTU table file (required).
-o FILE, --output FILE
output BIOM file Version 1.0 (JSON) (required).
-t FILE, --tax FILE add taxonomy information from a taxonomy file.
-s FILE, --sampledata FILE
add sample information from a sample data file.
-u FILE, --otuids FILE
replace OTU IDs with the original sequence IDs. Useful
when the closed-reference OTU picking protocol was
performed for PICRUSt
Example
micca tobiom -i otutable.txt -o output.biom -t tax.txt -s sampledata.txt
tree¶
usage: micca tree [-h] -i FILE -o FILE [-m {fasttree,muscle}] [--fasttree-gtr]
[--fasttree-fastest]
[--muscle-cluster {upgmb,upgma,neighborjoining}]
micca tree infers phylogenetic trees from alignments. It provides two
methods:
* FastTree (doi: 10.1371/journal.pone.0009490);
* MUSCLE (doi: 10.1093/nar/gkl244).
optional arguments:
-h, --help show this help message and exit
arguments:
-i FILE, --input FILE
input FASTA file (required).
-o FILE, --output FILE
output tree in Newick format (required).
-m {fasttree,muscle}, --method {fasttree,muscle}
tree inference method (default fasttree).
FastTree specific options:
--fasttree-gtr use the generalized time-reversible (GTR)+CAT model
instead of Jukes-Cantor+CAT (default False).
--fasttree-fastest speed up the neighbor joining phase and reduce memory
usage recommended for >50,000 sequences) (default
False).
MUSCLE specific options:
--muscle-cluster {upgmb,upgma,neighborjoining}
clustering algorithm (default upgmb).
Examples
Tree inference using FastTree and the generalized time-reversible
(GTR)+CAT model:
micca tree -i input.fasta -o tree.tree --fasttree-gtr
trim¶
usage: micca trim [-h] -i FILE -o FILE [-w FORWARD [FORWARD ...]]
[-r REVERSE [REVERSE ...]] [-e MAXERATE] [-c] [-W] [-R]
[-f {fastq,fasta}]
micca trim trims forward and reverse primers from a FASTQ/FASTA file
using Cutadapt (doi: 10.14806/ej.17.1.200) internally. Primer and the
sequence preceding (for forward) or succeding (for reverse) it are
removed. Optionally, reads that do not contain the primers (untrimmed
reads) can be discarded with the options -W/--duforward and
-R/--dureverse. Recommended options are:
* always discard reads that do not contain the forward primer
(-W/--duforward option);
* for overlapping paired-end (already merged) reads, also discard
reads that do not contain the reverse primer (using both
-W/--duforward and -R/--dureverse options).
IUPAC codes and multiple primers are supported.
optional arguments:
-h, --help show this help message and exit
arguments:
-i FILE, --input FILE
input FASTQ or FASTA file (required).
-o FILE, --output FILE
output FASTQ or FASTA file (required).
-w FORWARD [FORWARD ...], --forward FORWARD [FORWARD ...]
trim forward primer(s). Only the best matching primer
is removed.
-r REVERSE [REVERSE ...], --reverse REVERSE [REVERSE ...]
trim reverse primer(s). Only the best matching primer
is removed.
-e MAXERATE, --maxerate MAXERATE
maximum allowed error rate (default 0.1).
-c, --searchrc search reverse complement primers too (default False).
-W, --duforward discard untrimmed reads (reads that do not contain the
forward primer) (always recommended) (default False).
-R, --dureverse discard untrimmed reads (reads that do not contain the
reverse primer) (suggested option for overlapping
paired-end already merged reads) (default False).
-f {fastq,fasta}, --format {fastq,fasta}
file format (default fastq).
Examples
454 or Illumina single-end reads: trim forward primer and discard reads
that do not contain it. Moreover, trim reverse primer:
micca trim -i input.fastq -o trimmed.fastq -w AGGATTAGATACCCTGGTA \
-r CRRCACGAGCTGACGAC -W
Illumina overlapping paired-end (already merged) reads: trim
forward and reverse primers. Reads that do not contain the forward
or the reverse primer will be discarded:
micca trim -i reads.fastq -o trimmed.fastq -w AGGATTAGATACCCTGGTA \
-r CRRCACGAGCTGACGAC -W -R