MIGEC: Molecular Identifier Guided Error Correction pipeline¶
FEATURES¶
- De-multiplexing, adapter trimming and read overlapping for NGS data. Extraction of UMI sequences
- Assembly of consensuses for original molecules that entered library preparation by grouping reads with identical molecular identifiers
- Mapping of V, D and J segments, extraction of CDR3 regions and clonotype assembly for all human and mouse immune receptors (TRA/TRB/TRG/TRD and IGH/IGL/IGK)
- Additional filtering of hot-spot errors in CDR3 sequence
- Flexible and straightforward batch processing
- Currently all species-gene pairs that have germline sequences (based on IMGT) that allow CDR3 identification are supported
Species | Gene |
---|---|
HomoSapiens | TRA, TRB, TRG, TRD, IGL, IGK, IGH |
MusMusculus | TRB, TRG, TRD, IGL, IGK, IGH |
MacacaMulatta | TRB, IGK, IGH |
OryctolagusCuniculus | IGL, IGK, IGH |
RattusNorvegicus | IGL, IGH |
CanisLupusFamiliaris | TRB, TRG |
SusScrofa | IGL, IGK |
BosTaurus | TRD |
MusSpretus | IGL |
GallusGallus | TRB |
AnasPlatyrhynchos | TRB |
Table of contents:¶
Installing and running¶
The pipeline is written in Groovy (a Java scripting language) and distributed as an executable JAR. To install it get the latest JRE and download the executable from releases section.
To ran a specific script from the pipeline, say Checkout, execute
java -jar MIGEC-$VERSION.jar Checkout [arguments]
Where $VERSION
stands for pipeline version (e.g. 1.2.1), this notation is
omitted in MIGEC routine documentation.
To view the list of available scripts execute:
java -jar MIGEC-$VERSION.jar -h
alternatively you can download the repository and compile it from source using Maven (requires Maven version 3.0)
git clone https://github.com/mikessh/MIGEC.git
cd MIGEC/
mvn clean install
java -jar target/MIGEC-$VERSION.jar
This should show you the list of available MIGEC routines.
Note
The data from 454 platform should be used with caution, as it contains
homopolymer errors which (in present framework) result in reads dropped
during consensus assembly. The 454 platform has a relatively low read
yield, so additional read dropping could result in over-sequencing level
below required threshold. If you still wish to give it a try, we would
recommend filtering off all short reads and repairing indels with
Coral, the latter
should be run with options -mr 2 -mm 1000 -g 3
.
Warning
NCBI-BLAST+ package is required. Could be directly installed on Linux using a command like $sudo apt-get ncbi-blast+ or downloaded and installed directly from here: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/
Warning
Consider providing sufficient memory for the pipeline, i.e. 8Gb for
MiSeq or 36Gb for HiSeq sample, depending on sample sequence diversity
and current script (CdrBlast requires has the highest memory
requirements). To do so, execute the script with -Xmx
argument:
java -Xmx8G -jar MIGEC-$VERSION.jar CdrBlast [arguments]
.
If insufficient amount memory is allocated, the Java Virtual Machine
could drop with a Java Heap Space Out of Memory error.
The pipeline¶
All routines in the pipeline are available in manual and batch variants. Batch variants are designed to automatically handle several input samples with minimal shell scripting glue between analysis steps, this is also the recommended way to use with MIGEC.
List of MIGEC batch routines:
If the “barcodes” file is set properly, the entire pipeline can be written as following:
MIGEC="java -Xmx8G -jar MIGEC-$VERSION.jar"
$MIGEC CheckoutBatch -cu barcodes.txt checkout/
$MIGEC Histogram checkout/ histogram/
$MIGEC AssembleBatch -c checkout/ histogram/ assemble/
$MIGEC CdrBlastBatch -R TRB checkout/ assemble/ cdrblast/
$MIGEC FilterCdrBlastResultsBatch cdrblast/ cdrfinal/
Manual usage¶
List of MIGEC manual routines:
An example for a 300bp paired-end MiSeq run of IGH library on a 16Gb RAM Unix server. Such sequencing read length allows complete IGH sequencing, thus mate pairs overlap. First barcodes.txt should be created containing adapter sequences, see the section below for guidelines. Then, assuming that the corresponding FASTQ files are IGH_SAMPLE_R1.fastq.gz and IGH_SAMPLE_R2.fastq.gz, UMI- and multiplex index-containing adapter is near 5’UTR of V segment (so the CDR3 is in mate#2 after reads are oriented) and NCBI-BLAST+ is installed, run all 5 stages of the pipeline using the following command:
$MIGEC Checkout -cute --overlap barcodes.txt IGH_S1-10_R1.fastq.gz IGH_S1-10_R2.fastq.gz checkout/
$MIGEC Histogram checkout/ histogram/
$MIGEC Assemble -c --mask 0:0 checkout/S1_R12.fastq.gz . assembly/
$MIGEC CdrBlast -R IGH checkout/S1_R12.fastq.gz cdrblast/S1_raw.txt
$MIGEC CdrBlast -a -R IGH assembly/S1_R12.fastq.gz cdrblast/S1_asm.txt
$MIGEC FilterCdrBlastResults cdrblast/S1_asm.txt cdrblast/S1_raw.txt cdrfinal/S1.txt
Note
As dot (.) is reserved by MIGEC to specify empty metadata fields and unused arguments, use ./ in case you want to point to current directory.
Full-length immunoglobulin data analysis¶
MIGEC can be used as pre-processing tool to assemble full-length consensuses
for further post-analysis with HIgBlast
tool. Note that due to typically poor quality of MiSEQ 300+300bp reads, the
--overlap
mode of Checkout routine is not guaranteed to perform well. Instead, we recommend
to assemble consensuses first and then perform overlapping using
external tools. For example, MiTools
merge
action can be used with the --same-strand
option specified, the latter is
critical as assembled consensuses are on the same strand in output in contrast to
normal orientation of Illumina reads.
Consensus quality and overlap efficiency can be greatly improved using
the --only-first-read
option of Histogram and Assemble routines. If set,
this option instructs routines to use only the first read that typically has higher quality
than the second one. This applies to non-oriented reads and works better for
asymmetric sequencing design, e.g. 400+200bp reads.
De-multiplexing¶
Checkout-batch¶
Description
A script to perform de-multiplexing and UMI tag extraction for a set of FASTQ files that were previously split using Illumina sample indices.
Usage
General:
java -jar migec.jar CheckoutBatch [options] barcodes_file output_dir
The barcodes file specifies sample multiplexing and UMI (NNN.. region) extraction rules. It has the same structure as for “manual” Checkout (see section below), with additional two columns that specify input FASTQ file names.
Sample ID | Master barcode sequence | Slave barcode sequence | Read#1 FASTQ | Read#2 FASTQ |
---|---|---|---|---|
S0 | acgtacgtAGGTTAcadkgag | |||
S1 | acgtacgtGGTTAAcadkgag | ctgkGTTCaat | ILM1_R1_L001.fastq.gz | ILM1_R2_L001.fastq.gz |
S1 | acgtacgtAAGGTTcadkgagNNNNNN | ILM2_R1_L001.fastq.gz | ILM2_R2_L001.fastq.gz | |
S3 | acgtacgtTAAGGTcadkgagNNNNNN | NNNNNNctgkGTTCaat | ILM1_R1_L001.fastq.gz | ILM1_R2_L001.fastq.gz |
The following rules apply:
- All specified FASTQ files are sequentially processed using Checkout
- If no FASTQ file is specified for a given barcode, it will be searched in all FASTQ files
- CheckoutBatch will properly aggregate reads from multiple FASTQ files that have the same sample id
- Still there should not be the case when a FASTQ file has the same barcode specified more than once
Parameters
Same as in manual version of Checkout, see below.
Output format
The Checkout routine produces files in the FASTQ format that have a specific
UMI
field added to the header. Each read successfully matched by Checkout
will be output as follows:
@ILLUMINA_HEADER UMI:NNNN:QQQQ
ATAGATTATGAGTATG
+
##II#IIIIIIIIIII
The original read header (ILLUMINA_HEADER
here) is preserved, the
appended UMI:NNNN:QQQQ
contains the sequence of the UMI tag (NNNN
bases)
and its quality string (QQQQ
).
Checkout-manual¶
Description
A script to perform de-multiplexing and UMI tag extraction
Usage
General:
java -jar migec.jar Checkout [options] barcodes_file R1.fastq[.gz] [. or R2.fastq[.gz]] output_dir
For paired-end data:
java -jar migec.jar Checkout -cute barcodes.txt R1.fastq.gz R2.fastq.gz ./checkout/
For unpaired library:
java -jar migec.jar Checkout -cute barcodes.txt R.fastq.gz . ./checkout/
For overlapping paired reads:
java -jar migec.jar Checkout -cute --overlap barcodes.txt R1.fastq.gz R2.fastq.gz . checkout/
accepted barcodes.txt format is a tab-delimited table with the following structure:
Sample ID | Master barcode sequence | Slave barcode sequence |
---|---|---|
S0 | acgtacgtAGGTTAcadkgag | |
S1 | acgtacgtGGTTAAcadkgag | ctgkGTTCaat |
S2 | acgtacgtAAGGTTcadkgagNNNNNN | |
S3 | acgtacgtTAAGGTcadkgagNNNNNN | NNNNNNctgkGTTCaat |
A sequencing read is scanned for master adapter and then, if found, its mate is reverse-complemented to get on the same strand as master read and scanned for slave adapter.
- Slave adapter sequence could be omitted.
- Adaptor sequence could contain any IUPAC DNA letters.
- Upper and lower case letters mark seed and fuzzy-search region parts respectively.
- N characters mark UMI region to be extracted.
- Multiple rows could correspond to the same sample
- In order to be able to run batch pipeline operations, all samples should contain UMI region of the same size
For example, in case S2 Checkout will search for AAGGTT seed exact match, then for the remaining adapter sequence with two mismatches allowed and output the NNNNNN region to header. In case S3 in addition the slave read is scanned for GTTC seed, fuzzy match to the rest of barcode is performed and NNNNNN region is extracted and concatenated with UMI region of master read.
Parameters
General:
-c
compressed output (gzip compression).
-u
perform UMI region extraction and output it to the header of
de-multiplexed FASTQ files
-t
trim adapter sequence from output.
-e
also remove trails of template-switching (poly-G) for the case
when UMI-containing adapter is added using reverse-transcription (cDNA
libraries).
--overlap
will try to overlap reads (paired-end data only),
non-overlapping and overlapping reads will be placed to *_R1/_R2*
and *_R12* FASTQ files respectively. While overlapping the nucleotide
with higher quality will be taken thus improving overall data quality.
--overlap-max-offset X
controls to which extent overlapping region
is searched. IMPORTANT If the read-through extent is high (reads are
embedded) should be set to ~40.
Barcode search:
-o
speed up by assuming that reads are oriented, i.e. master adapter
should be in R1
-r
will apply a custom RC mask. By default it assumes Illumina reads
with mates on different strands, so it reverse-complements read with
slave adapter so that output reads will be on master strand.
--rc-barcodes
also searches for both adapter sequences in reverse
complement. Use it if unsure of your library structure.
--skip-undef
will not store reads that miss adapter sequence to save
drive space.
Note
When there is a huge number of unassigned/unused reads --skip-undef
option
greatly speeds up de-multiplexing. However, take care to carefully investigate
the reasons behind low barcode extraction rate if it is a case.
Important
The --overlap
option may not perform well for poor quality reads, which is
a typical situation for 300+300bp MiSEQ sequencing. In this case, merging reads
using external software after Assemble stage is recommended.
MIG statistics¶
Description
A script to generate consensus coverage statistics, i.e. molecular identifier group (MIG) size distribution.
Usage
General:
java -jar migec.jar Histogram checkout/ histogram/
Running this script will generate several files in histogram folder, the one important for basic data processing is overseq.txt. The header of table contains MIG sizes (in log2 scale), while each row corresponds to a de-multiplexed sample contains the number of reads in MIGs of a given size (cumulative abundance).
For a decent dataset the plot of cumulative abundance display a small peak at MIG size of 1 that could be attributed to erroneous MIGs and has an exponential decline, and a clear peak at MIG size of 10+ containing amplified MIGs. Those erroneous MIGs could arise as experimental artifacts, however the most common reason for their presence is an error event in UMI sequence itself. Note that the latter is only valid when number of distinct UMIs is far lower than theoretically possible UMI diversity (e.g. 4^12 for 12-letter UMI regions)!
MIG size cutoff in Assemble should be set to dissect erroneous MIGs while retaining amplified ones. If peaks overlap collision filtering should be considered.
A simple plotting routine written in R can facilitate visualization of MIG size distributions, available here.
Important
The --only-first-read
option should be used if it is also
specified in Assemble routine for consensus coverage estimates
to be concordant.
Consensus assembly¶
Assemble-batch¶
Description
A script to perform UMI-guided assembly
Usage
General:
java -jar migec.jar AssembleBatch [options] checkout_output_folder/ histogram_output_folder/ output_folder/
Performs a batch assembly for all FASTQ files produced by checkout, all assembly parameters are set according to Histogram output.
One can specify a default mask telling for paired-end reads which
mate(s) to assemble. The mask is provided by
--default-mask <R1=[0,1]:R2=[0,1]>
argument, i.e. to assemble only
second mate use --default-mask 0:1
. This speeds-up the assembly.
Also, by default the mask is 1:1
, so for each MIG an output
consensus pair is created only if both consensuses are successfully
assembled. In case of 0:0
mask will process only overlapped reads.
Remember that during Checkout reads get re-oriented so they are on
the same strand, corresponding to the strand of Master barcode and the
read with Master barcode is assigned with *_R1* index.
A sample metadata file could also be provided with
--sample-metadata <file_name>
argument to guide the batch assembly.
This file should have the following tab-separated table structure:
Sample ID | File type | Mask |
---|---|---|
S0 | paired | 1:0 |
S0 | overlapped | |
S1 | unpaired | |
S2 | paired | 0:1 |
Note that S0 is present with two file types, as when performing read overlap Checkout stores non-overlapped reads in *_R1/_R2* files, which could be then incorporated into data processing.
The --force-overseq X
and --force-collision-filter
will force a
MIG size threshold of X
and filtering of 1-mm UMI collisions for all
samples being processed.
Warning
In most cases, the automatic MIG size threshold selected by Histogram routine is ok. However we strongly recommend manual inspection of Histogram output files and considering to manually specify an appropriate MIG size threshold for input samples. Our experience also shows that it is a good practice to set an identical size threshold for all samples in a batch.
Output format
The assembled consensuses are also stored in FASTQ format, however the read
quality is now proportional to the relative frequency of the most frequent base
at a given position. I.e. if 100% of bases in the position 5 are A
the quality
score will be Phred40 (I
), while if only 27.1% of the bases are A
and other
bases have a frequency of 24.3% the quality will be Phred2 (#
):
@MIG UMI:NNNN:X
ATAGATTATGAGTATG
+
##II#IIIIIIIIIII
The UMI tag is also added to the header, the number X
in the UMI tag field
is the total number of reads that were assembled into a given consensus.
Assemble-manual¶
Description
A script to perform UMI-guided assembly
Usage
General:
java -jar migec.jar Assemble [options] R1.fastq[.gz] [. or R2.fastq[.gz]] output_folder
Unpaired and overlapped FASTQ:
java -jar migec.jar Assemble -c checkout/S1_R0.fastq.gz . assembly/
Paired FASTQ:
java -jar migec.jar Assemble -c checkout/S1_R1.fastq.gz checkout/S1_R2.fastq.gz ./assembly/
Paired FASTQ with only second read to be assembled:
java -jar migec.jar Assemble -c --mask 0:1 checkout/S1_R1.fastq.gz checkout/S1_R2.fastq.gz assembly/
All reads are grouped by their UMI and then read groups (aka molecular identifier groups, MIGs) with >10 reads (default value, see Histogram section for details on setting it) are assembled. Multiple alignment is performed and consensus sequence is generated. Note that for paired reads both consensuses should be successfully assembled, otherwise the pair is dropped.
Automatic output file naming convention is used for compatibility with batch operations. Output file name will be appended with _R0 for unpaired FASTQ file, with either _R1 and _R2 for the corresponding paired FASTQ file and with _R12 for overlapped FASTQ file. Output file name will also include MIG size threshold used.
Settings
The --mask <R1=[0,1]:R2=[0,1]>
parameter indicates FASTQ
files to be assembled in paired-end data. By default both reads are
assembled. In case of 0:0
mask will process only overlapped reads.
The -c
option indicates compressed output.
The -m
option sets minimum number of reads in MIG. This should be
set according to Histogram script output to separate two peaks:
over-sequenced MIGs and erroneous MIGs that cluster around MIG size of
1.
Note
To inspect the effect of such single-mismatch erroneous UMI sub-variants
see “collisions” output of Histogram script. Such collision events could
interfere with real MIGs when over-sequencing is relatively low. In this
case collisions could be filtered during MIG consensus assembly using
--filter-collisions
option in AssembleBatch routine. When using
Assemble routine use --force-collision-filter
command to
turn collision filter on. The child-to-parent ratio for collision filtering
(max allowed size of larger and smaller UMIs that differ by a single mismatch) is
controlled by the --collision-ratio
parameter (default is --collision-ratio 0.1
).
Important
The --only-first-read
option can greatly improve assembly quality
in case of poor second read quality and allows consensus assembly for
asymmetric reads (e.g. 400+200bp sequencing design). If using this option,
don’t forget to set --only-first-read
in Histogram util to correctly
calculate MIG size threshold.
Summary statistics¶
MIG consensus assembly report will be stored in the assemble.log.txt
file which contains basic information such as the fraction of assembled reads and the final number of assembled consensuses. It summarizes both filtering based on MIG size and the consistency of read sequence within the same MIG:
- The
MIGS_*
counters show the number of MIGs that were processed (MIGS_TOTAL
) and were successfully assigned with consensus sequences (MIGS_GOOD_*
). - In case of paired-end sequencing separate statistic is provided for both R1 and R2 (
MIGS_GOOD_FASTQ1
andMIGS_GOOD_FASTQ2
). The total counter (MIGS_GOOD_TOTAL
) reflects number of UMI tags for which both R1 and R2 MIGs were successfully assembled. - The total number of reads in assembled MIGs and all MIGs is provided in
READS_GOOD_*
andREADS_TOTAL
columns respectively.
MIGs can be dropped from the assembly and marked as bad ones for the following reasons:
- MIGs with a size less then the specified size threshold value will be dropped (see
--force-overseq
and-m
options), as well as MIGs that correspond to erroneous UMI variants (see--filter-collisions
option). - Reads that have too many mismatches when compared to the consensus sequence will be dropped, which is reflected by
READS_DROPPED_WITHIN_MIG
statistic. In case a high percentage of reads within MIG is dropped/final MIG size is less than the threshold the entire MIG will be dropped for the analysis.
Note
Additional pre-filtering of UMI tags identified by Checkout utility is performed by removing UMI tag sequences with a minimum Phred quality score below the one specified by the -q
parameter (default is 15
). Thus, the READS_TOTAL
can be somewhat smaller than the total number of reads for a given sample in the checkout.log.txt
V(D)J junction mapping¶
CdrBlast-batch¶
Description
A script to extract CDR3 sequences. Will properly combine reads coming from paired and overlapped data and perform analysis for both raw and assembled data.
Performs CDR3 extraction and V/J segment determination for both raw
(Checkout output) and assembled-data. Gene parameter -R
is
required unless metadata (--sample-metadata
) is provided that
specifies gene for each sample; supported genes are TRA, TRB, TRG,
TRD, IGH, IGK and IGL. If either of assembly_output_folder
or checkout_output_folder is not specified, the processing will be
done solely for the remaining input, this is useful e.g. if one wants
quickly process the assembled data. Otherwise only samples and file
types (paired, overlapped or single) that are present in both outputs
will be used. Processing both raw and assembled data is required for
second stage error correction (removal of hot-spot errors).
Usage
General:
java -jar migec.jar CdrBlastBatch [options] -R gene [checkout_output_folder/ or .] [assemble_output_folder/ or .] output_folder
Several default CdrBlast parameters could be set,
--default-mask <R1=[0,1]:R2=[0,1]>
- mask which specifies for which
read(s) in paired-end data to perform CDR3 extraction. In case of
0:0
mask will process only overlapped reads --default-species
-
default species to be used for all samples, human (used by default) or
mouse --default-file-types
- default file types (paired,
overlapped or single) to be processed for each sample. If several file
types are specified, the corresponding raw and assembled files will be
combined and used as an input to CdrBlast
--default-quality-threshold <Phred=[2..40],CQS=[2..40]>
- quality
threshold pair, default for all samples. First threshold in pair is used
for raw sequence quality (sequencing quality phred) and the second one
is used for assembled sequence quality (CQS score, the fraction of reads
in MIG that contain dominant letter at a given position) --no-sort
-
no sorting is performed for output files which speeds up processing.
Could be safely used in full pipeline as FilterCdrBlastResults will
provide final clonotype table in sorted format
A sample metadata file could also be provided with
--sample-metadata <file_name>
argument to guide the batch CDR3
extraction. This file should have the following tab-separated table
structure:
Sample ID | Species | Gene | File types | Mask | Quality threshold pair |
---|---|---|---|---|---|
S0 | human | TRA | paired, overlapped | 1:0 | 25,30 |
S1 | human | TRB | unpaired | 25,30 | |
S2 | mouse | TRB,TRA | paired | 0:1 | 20,25 |
See section below for more details.
Output format
The output of V-D-J mapping routines of MIGEC is a standard tab-delimited clonotype table with some information on the number of reads and UMI tags that correspond to a given clonotype.
Each clonotype is specified by count, fraction, V, D and J segment identifier list, CDR3 nucleotide and amino acid sequence.
The positions of last V nucleotide, first and last D nucleotide, and first J nucleotide specify the germline region markup within the hypervariable CDR3 sequence, they are given in 0-based coordinates where 0 marks the first base of CDR3.
The total reads
and good reads
fields contain the number of reads
supporting a given clonotype prior to and after the quality filtering.
The total events
and good events
fields contain the number of UMI tags
supporting a given clonotype prior to and after the quality filtering. For raw
(unassembled) data these are equal to total reads
and good reads
respectively.
CdrBlast-manual¶
Description
A script to map V-(D)-J junctions, extract CDR3 sequences and assemble clonotypes.
Usage
General:
java -jar migec.jar CdrBlast [options] -R gene file1.fastq[.gz] [file2.fastq[.gz] ...] output_file
Standard, assuming an example of a library containing T-cell Receptor Alpha Chain sequences
in case of MIG-assembled data:
java -jar migec.jar CdrBlast -a -R TRA assembly/S1_R2.fastq.gz cdrblast/S1_asm.cdrblast.txt
for raw data:
java -jar migec.jar CdrBlast -R TRA checkout/S1_R2.fastq.gz cdrblast/S1_raw.cdrblast.txt
to concatenate and process two or more FASTQ files at once:
java -jar migec.jar CdrBlast -R TRA checkout/S1_R2.fastq.gz checkout/S2_R2.fastq.gz cdrblast/S12_raw.cdrblast.txt
Gene parameter -R
is required, supported genes are TRA, TRB,
TRG, TRD, IGH, IGK and IGL. Several chains can be specified,
for example -R TRA,TRB
or -R IGH,IGL
. Species could be provided with
-S
parameter, by default uses HomoSapiens, supported species are
HomoSapiens, MusMusculus and others. Assembled data should be passed
to the script with -a
option. --same-sample
option should be
used if several assembled files are provided from the same sample, so
duplicate UMIs will be discarded and not counted twice.
To get a sorted output use -o
option, otherwise sorting will be
performed at FilterCdrBlastResults step. Note that both raw and
assembled data should be processed to apply the last step of filtration.
Note
In order to use all alleles, not just the major (*01 ones), use the
--all-alleles
option. To include non-coding segments (V segment
pseudogenes) use the --all-segments
option.
Result filtering¶
FilterCdrBlastResults-batch¶
Description
A script to filter erroneous CDR3 sequences produced due to hot-spot PCR and NGS errors. It can also use a hybrid error correction method that includes frequency-based filtering of singleton clonotypes (i.e. clonotypes represeted by a single MIG).
Usage
java -jar migec.jar FilterCdrBlastResultsBatch [options] cdrblast_batch_folder/ output_folder/
Perform hot-spot error filtration for data process with CdrBlastBatch. Options are the same as for manual version below.
FilterCdrBlastResults-manual¶
Usage
General:
java -jar migec.jar FilterCdrBlastResults [options] cdrblast_result_assembled_data cdrblast_result_raw_data output_file
Example:
java -jar migec.jar FilterCdrBlastResults cdrblast/S1_asm.cdrblast.txt cdrblast/S1_raw.cdrblast.txt final/S1.cdrblast.txt
The -s
option tells to filter CDR3s represented by single MIGs. The
rationale for this is that the deep repertoire profiling (at least with
our protocol) can generate spurious singletons that are associated with
reverse transcription errors and experimental artifacts. Filtering is a
non-greedy procedure and filters single-MIG clonotypes only if a 1- or
2-mismatch parent clonotype exists at ratio 1:20 and 1:400 respectively.
This is done to preserve diversity for samples with shallow sequencing,
e.g. ran on MiSeq.
Other options:
-n
- output non-coding clonotypes that contain either a stop codon or a frameshift within CDR3 sequence.-c
- include non canonical clonotypes that have a CDR3 region that does not start with conserved C residue, or end with a conserved F/W residue.-r
- sets the read accumulation threshold (default is1.0
) used for hot-spot error correction, see MiGEC paper for details.
Now the file S1.cdrblast.txt contains a filtered and sorted CDR3/V/J clonotype table.
MIGEC log structure¶
Below is the description of log files produced by various MIGEC routines.
Checkout
De-multiplexing, barcode extraction and overlapping:
INPUT_FILE_1
first input file containing R1 readsINPUT_FILE_2
second input file containing R2 readsSAMPLE
sample nameMASTER
number of reads where primary (master) barcode was detectedSLAVE
number of reads where secondary (slave) barcode was detectedMASTER+SLAVE
number of reads where both barcodes wereOVERLAPPED
number of succesfully overlapped reads
Histogram
The routine produces a number of histograms for UMI coverage, i.e. statistics of the number of reads tagged with a given UMI:
overseq.txt
contains sample id and sample type (single/paired/overlapped) in the header, followed by UMI coverage (MIG size). Each row has total read counts for UMIs corresponding to a given UMI coverageoverseq-units.txt
same asoverseq.txt
, but lists numbers of unique UMIs, not total read countsestimates.txt
contains sample id, sample type, total number of reads (TOTAL_READS
) and UMIs (TOTAL_MIGS
) in the sample and selected thresholds:OVERSEQ_THRESHOLD
- UMI coverage threshold,COLLISION_THRESHOLD
- if greater or equal toOVERSEQ_THRESHOLD
will search for UMIs that differ by a single mismatch and have a huge count difference and treat them as being the same UMI,UMI_QUAL_THRESHOLD
- threshold for min UMI sequence quality,UMI_LEN
- UMI lengthcollision1.txt
- same asoverseq.txt
, but lists only UMIs that are likely to be erroneous (i.e. have a 1-mismatch UMI neighbour with a substantially higher count)collision1-units.txt
- same ascollision1.txt
, but lists numbers of unique UMIs, not total read countspwm.txt
andpwm-units.txt
- a position weight matrix (PWM) representation of all UMI sequences
Assemble
Statistics of MIG (group of reads tagged with the same UMI) consensus sequence assembly. Note that it also contains summary of pre-filtering steps, e.g. UMIs with low coverage are filtered at this stage:
SAMPLE_ID
sample nameSAMPLE_TYPE
sample type (single/paired/overlapped)INPUT_FASTQ1
first input file containing R1 readsINPUT_FASTQ2
second input file containing R2 readsOUTPUT_ASSEMBLY1
first output file containing R1 consensusesOUTPUT_ASSEMBLY2
second output file containing R2 consensusesMIG_COUNT_THRESHOLD
UMI coverage threshold used in assemble procedureMIGS_GOOD_FASTQ1
number of succesfully assembled consensuses from R1MIGS_GOOD_FASTQ2
same for R2MIGS_GOOD_TOTAL
number of succesfully assembled consensuses that have both R1 and R2 partsMIGS_TOTAL
total number of input UMIs prior to coverage filteringREADS_GOOD_FASTQ1
number of reads in succesfully assembled consensuses from R1READS_GOOD_FASTQ2
same for R2READS_GOOD_TOTAL
number of paired reads in succesfully assembled consensuses that have both R1 and R2 parts. If a given assembled consensus contains inequal number of reads in R1 and R2, an average number is added to this statisticREADS_TOTAL
total number of input reads prior to coverage filteringREADS_DROPPED_WITHIN_MIG_1
number of reads dropped during consensus assembly as they had high number of mismatches to the consensus in R1READS_DROPPED_WITHIN_MIG_2
same for R2MIGS_DROPPED_OVERSEQ_1
number of UMIs dropped due to insufficient coverage in R1MIGS_DROPPED_OVERSEQ_2
same for R2READS_DROPPED_OVERSEQ_1
number of reads in UMIs dropped due to insufficient coverage in R1READS_DROPPED_OVERSEQ_2
same for R2MIGS_DROPPED_COLLISION_1
number of UMIs dropped due to being an erroneous (1-mismatch) variant of some UMI with higher count in R1MIGS_DROPPED_COLLISION_2
same for R2READS_DROPPED_COLLISION_1
number of reads in UMIs dropped due to being an erroneous (1-mismatch) variant of some UMI with higher count in R1READS_DROPPED_COLLISION_2
same for R2
CdrBlast
Statistics of V(D)J mapping with BLAST algorithm:
SAMPLE_ID
sample nameDATA_TYPE
raw reads (raw) or assembled consensuses (asm)OUTPUT_FILE
output file nameINPUT_FILES
list of input filesEVENTS_GOOD
number of MIGs (group of reads tagged with the same UMI, equals to number of reads for raw data) that were V(D)J mapped and passed the quality thresholdEVENTS_MAPPED
number of MIGs that were V(D)J mappedEVENTS_TOTAL
number of input MIGsREADS_GOOD
number of reads that were V(D)J mapped and passed the quality thresholdREADS_MAPPED
number of reads that were V(D)J mappedREADS_TOTAL
number of input reads
FilterCdrBlastResults
Statistics of the second round of TCR/Ig clonotype filtering that considers the number of supporting reads before and after consensus assembly:
SAMPLE_ID
sample nameOUTPUT_FILE
output file nameINPUT_RAW
input file containing CdrBlast results for raw readsINPUT_ASM
input file containing CdrBlast results for assembled consensusesCLONOTYPES_FILTERED
number of clonotypes (unique TCR/Ig V+CDR3 nucleotide+J combinations) that were filteredCLONOTYPES_TOTAL
number of input clonotypesEVENTS_FILTERED
number of MIGs in filtered clonotypesEVENTS_TOTAL
number of input MIGsREADS_FILTERED
number of reads in filtered clonotypesREADS_TOTAL
number of input readsNON_FUNCTIONAL_CLONOTYPES
number of non-functional clonotypes that contain stop codon/frameshift in CDR3NON_FUNCTIONAL_EVENTS
number of MIGs in non-functional clonotypesNON_FUNCTIONAL_READS
number of reads in non-functional clonotypes
MIGEC analysis report¶
You can run the following command to generate
java -jar migec.jar Report [options] [output_path or .]
Command parameters set the checkout (-c
), histogram (-h
),
assemble (-a
), cdrblast (-b
) and final results (-f
) folders. Missing
results will be excluded from the report.
Warning
Works for batch analysis only.
Note
Running this command requires installing dependencies for R markdown compilation, see http://rmarkdown.rstudio.com/
Alternatively, you can install Rstudio and use this Rmd template to manually knit the report HTML.
Post-processing¶
You can additionally build a graph of hypermutations for the sample using
java -jar MIGEC.jar CreateCdrHypermGraph final/S1.cdr3blast.txt net
which will generate files that allow fast network construction using Cytoscape’s network from table and import table routines for further data exploration.
Note that translated CDR3 sequences are obtained by simultaneously
translating codons in two directions: from V and J segments to the
middle of CDR3. If a frameshift is detected, the incomplete codon is
added in lower case, with missing nucleotides marked as ?
; stop
codons are marked by *
. CDR3 that contain either frameshift or stop
codon are non-functional and are filtered by default. To include them
into your output use -n
option.
We also recommend to try out our new VDJtools software suite to perform post analysis of clonotype tables generated by MIGEC.