ssbio: A Framework for Structural Systems Biology¶
Introduction¶
This Python package provides a collection of tools for people with questions in the realm of structural systems biology. The main goals of this package are to:
- Provide an easy way to map hundreds or thousands of genes to their encoded protein sequences and structures
- Directly link protein structures to genome-scale metabolic models
- Demonstrate fully-featured Python scientific analysis environments in Jupyter notebooks
Example questions you can (start to) answer with this package:
- How can I determine the number of protein structures available for my list of genes?
- What is the best, representative structure for my protein?
- Where, in a metabolic network, do these proteins work?
- Where do popular mutations show up on a protein?
- How can I compare the structural features of entire proteomes?
- How do structural properties correlate with my experimental datasets?
- How can I improve the contents of my metabolic model with structural data?
Try it without installing¶
Note
Binder notebooks are still in beta, but they mostly work! Third-party programs are also preinstalled in the Binder notebooks except I-TASSER and TMHMM due to licensing restrictions.
Installation¶
First install NGLview using pip, then install ssbio
pip install nglview
jupyter-nbextension enable nglview --py --sys-prefix
pip install ssbio
Updating¶
pip install ssbio --upgrade
Uninstalling¶
pip uninstall ssbio
Tutorials¶
Check out some Jupyter notebook tutorials for a single Protein and or for many in a GEM-PRO model. See a list of all Tutorials.
Citation¶
The manuscript for the ssbio package can be found and cited at [1].
[1] | Mih N, Brunk E, Chen K, Catoiu E, Sastry A, Kavvas E, Monk JM, Zhang Z, Palsson BO. 2018. ssbio: A Python Framework for Structural Systems Biology. Bioinformatics. https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/bty077/4850940. |
Table of Contents¶
Getting Started¶
Introduction¶
This section will give a quick outline of the design of ssbio and the scientific topics behind it. If you would like to read a pre-print version of the manuscript, please see [1].

Overview of the design and functionality of ssbio. Underlined fixed-width text in blue indicates added functionality to COBRApy for a genome-scale model loaded using ssbio. A) A simplified schematic showing the addition of a Protein to the core objects of COBRApy (fixed-width text in gray). A gene is directly associated with a protein, which can act as a monomeric enzyme or form an active complex with itself or other proteins (the asterisk denotes that methods for complexes are currently under development). B) Summary of properties and functions available for a protein sequence and structure. C) Uses of a GEM-PRO, from the bottom-up and the top-down. Once all protein sequences and structures are mapped to a genome-scale model, the resulting GEM-PRO has uses in multiple areas of study.
The basics¶
ssbio was developed with simplicity in mind - we wanted to make it as easy as possible to work with protein sequences and structures. Furthermore, we didn’t want to reinvent the wheel wherever possible, thus systems models are treated as a direct extension of COBRApy, and Biopython classes and modules are used wherever possible. To best explain the utility of the package, we will outline its features from 2 different viewpoints: as a systems biologist used to looking at the “big picture”; and as a structural biologist where the “devil is in the details”.
From a systems perspective¶
Systems biology is broadly concerned with the modeling and understanding of complex biological systems. What you may be taught in biochemistry 101 at this level will usually be reflected in a kind of interaction map, such as metabolic map shown here:

A “metabolic metro map”. By Dctrzl, changed work of Chakazul [CC BY-SA 4.0], via Wikimedia Commons 1
This map details the reactions needed to sustain the metabolic function of a cell. Typically, nodes will represent enzymes, and edges the metabolites they act upon (this is reversed in some graphical representations). There can be hundreds or thousands of reactions being modeled at once, in silico. These models can be stored in a single file, such as the Systems Biology Markup Language (SBML). ssbio can load SBML models, and so far we have mainly used it in the further annotation of genome-scale metabolic models, or GEMs. The goal of GEMs is to provide a comprehensive annotation of all the metabolic enzymes encoded within a genome, along with a generating a computable model (such as at a steady state, using constraint-based modeling methods, a.k.a. COBRA). That brings us to our first class: the GEMPRO
object.
The objectives of the GEM-PRO pipeline (genome-scale models integrated with protein structures) have previously been detailed [2]. A GEM-PRO directly integrates structural information within a curated GEM, and streamlines identifier mapping, representative object selection, and property calculation for a set of proteins. The pipeline provided in ssbio functions with an input of a GEM (or any other kind of network model that can be loaded with COBRApy), but if this is unfamiliar to you, do not fret! A GEM-PRO can be built simply from a list of gene/protein IDs, and can simply be treated as a way to easily analyze a large number of proteins at once.
See GEMPRO for a detailed explanation of this object, Jupyter notebook tutorials of prospective use cases, and an explanation of select functions.
From a structures perspective¶
Structural biology is broadly concerned with elucidating and understanding the structure and function of proteins and other macromolecules. Ribbons, molecules, and chemical interactions are the name of the game here:

A protein undergoing conformational changes. By Thomas Shafee (Own work) [CC BY 4.0], via Wikimedia Commons 2
An abundance of information is stored within structural data, and we believe that it should not be ignored even when looking at thousands of proteins at once within a systems model. To that end, the Protein
object aims to integrate analyses on the level of a single protein’s sequence (and related sequences) along with its available structures.
A Protein
is representative of its associated gene’s translated polypeptide chain (in other words, we are only considering monomers at this point). The object holds related amino acid sequences and structures, allowing for a single representative sequence and structure to be set from these. Multiple available structures such as those from PDB or homology models can be subjected to QC/QA based on set cutoffs such as sequence coverage and X-ray resolution. Proteins with no structures available can be prepared for homology modeling through the I-TASSER platform. Biopython representations of sequences (SeqRecord
objects) and structures (Structure
objects) are utilized to allow access to analysis functions available for their respective objects.
See Protein for a detailed explanation of this object, Jupyter notebook tutorials of prospective use cases, and an explanation of select functions.
Modules & submodules¶
ssbio is organized into the following submodules for defined purposes. Please see the Python API for function documentation.
ssbio.databases
: modules that heavily depend on the Bioservices package [3] and custom code to enable pulling information from web services such as UniProt, KEGG, and the PDB, and to directly convert that information into sequence and structure objects to load into a protein.ssbio.protein.sequence
: modules which allow a user to execute and parse sequence-based utilities such as sequence alignment algorithms or structural feature predictors.ssbio.protein.structure
: modules that mirror the sequence module but instead work with structural information to calculate properties, and also to streamline the generation of homology models as well as to prepare structures for molecular modeling tools such as docking or molecular dynamics.ssbio.pipeline.gempro
: a pipeline that simplifies the execution of these tools per protein while placing them into the context of a genome-scale model.
References¶
[1] | Mih N, Brunk E, Chen K, Catoiu E, Sastry A, Kavvas E, et al. ssbio: A Python Framework for Structural Systems Biology. bioRxiv. 2017. p. 165506. doi:10.1101/165506 |
[2] | Brunk E, Mih N, Monk J, Zhang Z, O’Brien EJ, Bliven SE, et al. Systems biology of the structural proteome. BMC Syst Biol. 2016;10: 26. doi:10.1186/s12918-016-0271-6 |
[3] | Cokelaer, T, Pultz, D, Harder, LM, Serra-Musach, J, & Saez-Rodriguez, J. (2013). BioServices: a common Python package to access biological Web Services programmatically. Bioinformatics, 29/24: 3241–2. DOI: 10.1093/bioinformatics/btt547 |
The GEM-PRO Pipeline¶
Introduction¶
The GEM-PRO pipeline is focused on annotating genome-scale models with protein structure information. Any SBML model can be used as input to the pipeline, although it is not required to have a one. Here are the possible starting points for using the pipeline:
- An SBML model in SBML (
.sbml
,.xml
), or MATLAB (.mat
) formats - A list of gene IDs (
['b0001', 'b0002', ...]
) - A dictionary of gene IDs and their sequences (
{'b0001':'MSAVEVEEAP..', 'b0002':'AERAPLS', ...}
)
A GEM-PRO object can be thought of at a high-level as simply an annotation project. Creating a new project with any of the above starting points will create a new folder where protein sequences and structures will be downloaded to.
Tutorials¶
GEM-PRO - Calculating Protein Properties¶
This notebook gives an example of how to calculate protein properties for a list of proteins. The main features demonstrated are:
- Information retrieval from UniProt and linking residue numbering sites to structure
- Calculating or predicting global protein sequence and structure properties
- Calculating or predicting local protein sequence and structure properties
Imports¶
In [1]:
import sys
import logging
In [2]:
# Import the GEM-PRO class
from ssbio.pipeline.gempro import GEMPRO
In [3]:
# Printing multiple outputs per cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
Set the logging level in logger.setLevel(logging.<LEVEL_HERE>)
to
specify how verbose you want the pipeline to be. Debug is most verbose.
CRITICAL
- Only really important messages shown
ERROR
- Major errors
WARNING
- Warnings that don’t affect running of the pipeline
INFO
(default)- Info such as the number of structures mapped per gene
DEBUG
- Really detailed information that will print out a lot of stuff
DEBUG
mode prints out a large amount of information,
especially if you have a lot of genes. This may stall your notebook!
In [4]:
# Create logger
logger = logging.getLogger()
logger.setLevel(logging.INFO) # SET YOUR LOGGING LEVEL HERE #
In [5]:
# Other logger stuff for Jupyter notebooks
handler = logging.StreamHandler(sys.stderr)
formatter = logging.Formatter('[%(asctime)s] [%(name)s] %(levelname)s: %(message)s', datefmt="%Y-%m-%d %H:%M")
handler.setFormatter(formatter)
logger.handlers = [handler]
Initialization¶
Set these three things:
ROOT_DIR
- The directory where a folder named after your
PROJECT
will be created
- The directory where a folder named after your
PROJECT
- Your project name
LIST_OF_GENES
- Your list of gene IDs
A directory will be created in ROOT_DIR
with your PROJECT
name.
The folders are organized like so:
ROOT_DIR
└── PROJECT
├── data # General storage for pipeline outputs
├── model # SBML and GEM-PRO models are stored here
├── genes # Per gene information
│ ├── <gene_id1> # Specific gene directory
│ │ └── protein
│ │ ├── sequences # Protein sequence files, alignments, etc.
│ │ └── structures # Protein structure files, calculations, etc.
│ └── <gene_id2>
│ └── protein
│ ├── sequences
│ └── structures
├── reactions # Per reaction information
│ └── <reaction_id1> # Specific reaction directory
│ └── complex
│ └── structures # Protein complex files
└── metabolites # Per metabolite information
└── <metabolite_id1> # Specific metabolite directory
└── chemical
└── structures # Metabolite 2D and 3D structure files
In [6]:
# SET FOLDERS AND DATA HERE
import tempfile
ROOT_DIR = tempfile.gettempdir()
PROJECT = 'ssbio_protein_properties'
LIST_OF_GENES = ['b1276', 'b0118']
In [7]:
# Create the GEM-PRO project
my_gempro = GEMPRO(gem_name=PROJECT, root_dir=ROOT_DIR, genes_list=LIST_OF_GENES, pdb_file_type='pdb')
[2018-02-05 16:52] [ssbio.pipeline.gempro] INFO: Creating GEM-PRO project directory in folder /tmp
[2018-02-05 16:52] [ssbio.pipeline.gempro] INFO: /tmp/ssbio_protein_properties: GEM-PRO project location
[2018-02-05 16:52] [ssbio.pipeline.gempro] INFO: 2: number of genes
Mapping gene ID –> sequence¶
First, we need to map these IDs to their protein sequences. There are 2 ID mapping services provided to do this - through KEGG or UniProt. The end goal is to map a UniProt ID to each ID, since there is a comprehensive mapping (and some useful APIs) between UniProt and the PDB.
-
GEMPRO.
uniprot_mapping_and_metadata
(model_gene_source, custom_gene_mapping=None, outdir=None, set_as_representative=False, force_rerun=False)[source] Map all genes in the model to UniProt IDs using the UniProt mapping service. Also download all metadata and sequences.
Parameters: - model_gene_source (str) –
the database source of your model gene IDs. See: http://www.uniprot.org/help/api_idmapping Common model gene sources are:
- Ensembl Genomes -
ENSEMBLGENOME_ID
(i.e. E. coli b-numbers) - Entrez Gene (GeneID) -
P_ENTREZGENEID
- RefSeq Protein -
P_REFSEQ_AC
- Ensembl Genomes -
- custom_gene_mapping (dict) – If your model genes differ from the gene IDs you want to map, custom_gene_mapping allows you to input a dictionary which maps model gene IDs to new ones. Dictionary keys must match model genes.
- outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
- set_as_representative (bool) – If mapped UniProt IDs should be set as representative sequences
- force_rerun (bool) – If you want to overwrite any existing mappings and files
- model_gene_source (str) –
In [8]:
# UniProt mapping
my_gempro.uniprot_mapping_and_metadata(model_gene_source='ENSEMBLGENOME_ID')
print('Missing UniProt mapping: ', my_gempro.missing_uniprot_mapping)
my_gempro.df_uniprot_metadata.head()
[2018-02-05 16:52] [root] INFO: getUserAgent: Begin
[2018-02-05 16:52] [root] INFO: getUserAgent: user_agent: EBI-Sample-Client/ (services.py; Python 3.6.3; Linux) Python-requests/2.18.4
[2018-02-05 16:52] [root] INFO: getUserAgent: End
A Jupyter Widget
[2018-02-05 16:52] [ssbio.pipeline.gempro] INFO: 2/2: number of genes mapped to UniProt
[2018-02-05 16:52] [ssbio.pipeline.gempro] INFO: Completed ID mapping --> UniProt. See the "df_uniprot_metadata" attribute for a summary dataframe.
Missing UniProt mapping: []
Out[8]:
uniprot | reviewed | gene_name | kegg | refseq | pdbs | pfam | description | entry_date | entry_version | seq_date | seq_version | sequence_file | metadata_file | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
gene | ||||||||||||||
b0118 | P36683 | False | acnB | ecj:JW0114;eco:b0118 | NP_414660.1;WP_001307570.1 | 1L5J | PF00330;PF06434;PF11791 | Aconitate hydratase B | 2018-01-31 | 165 | 1997-11-01 | 3 | P36683.fasta | P36683.xml |
b1276 | P25516 | False | acnA | ecj:JW1268;eco:b1276 | NP_415792.1;WP_000099535.1 | NaN | PF00330;PF00694 | Aconitate hydratase A | 2018-01-31 | 153 | 2008-01-15 | 3 | P25516.fasta | P25516.xml |
-
GEMPRO.
set_representative_sequence
(force_rerun=False)[source] Automatically consolidate loaded sequences (manual, UniProt, or KEGG) and set a single representative sequence.
Manually set representative sequences override all existing mappings. UniProt mappings override KEGG mappings except when KEGG mappings have PDBs associated with them and UniProt doesn’t.
Parameters: force_rerun (bool) – Set to True to recheck stored sequences
In [9]:
# Set representative sequences
my_gempro.set_representative_sequence()
print('Missing a representative sequence: ', my_gempro.missing_representative_sequence)
my_gempro.df_representative_sequences.head()
A Jupyter Widget
[2018-02-05 16:52] [ssbio.pipeline.gempro] INFO: 2/2: number of genes with a representative sequence
[2018-02-05 16:52] [ssbio.pipeline.gempro] INFO: See the "df_representative_sequences" attribute for a summary dataframe.
Missing a representative sequence: []
Out[9]:
uniprot | kegg | pdbs | sequence_file | metadata_file | |
---|---|---|---|---|---|
gene | |||||
b0118 | P36683 | ecj:JW0114;eco:b0118 | 1L5J | P36683.fasta | P36683.xml |
b1276 | P25516 | ecj:JW1268;eco:b1276 | NaN | P25516.fasta | P25516.xml |
Mapping representative sequence –> structure¶
These are the ways to map sequence to structure:
- Use the UniProt ID and their automatic mappings to the PDB
- BLAST the sequence to the PDB
- Make homology models or
- Map to existing homology models
You can only utilize option #1 to map to PDBs if there is a mapped UniProt ID set in the representative sequence. If not, you’ll have to BLAST your sequence to the PDB or make a homology model. You can also run both for maximum coverage.
-
GEMPRO.
map_uniprot_to_pdb
(seq_ident_cutoff=0.0, outdir=None, force_rerun=False)[source] Map all representative sequences’ UniProt ID to PDB IDs using the PDBe “Best Structures” API. Will save a JSON file of the results to each protein’s
sequences
folder.The “Best structures” API is available at https://www.ebi.ac.uk/pdbe/api/doc/sifts.html The list of PDB structures mapping to a UniProt accession sorted by coverage of the protein and, if the same, resolution.
Parameters: - seq_ident_cutoff (float) – Sequence identity cutoff in decimal form
- outdir (str) – Output directory to cache JSON results of search
- force_rerun (bool) – Force re-downloading of JSON results if they already exist
Returns: A rank-ordered list of PDBProp objects that map to the UniProt ID
Return type: list
In [10]:
# Mapping using the PDBe best_structures service
my_gempro.map_uniprot_to_pdb(seq_ident_cutoff=.3)
my_gempro.df_pdb_ranking.head()
[2018-02-05 16:52] [ssbio.pipeline.gempro] INFO: Mapping UniProt IDs --> PDB IDs...
[2018-02-05 16:52] [root] INFO: getUserAgent: Begin
[2018-02-05 16:52] [root] INFO: getUserAgent: user_agent: EBI-Sample-Client/ (services.py; Python 3.6.3; Linux) Python-requests/2.18.4
[2018-02-05 16:52] [root] INFO: getUserAgent: End
A Jupyter Widget
[2018-02-05 16:52] [ssbio.pipeline.gempro] INFO: 1/2: number of genes with at least one experimental structure
[2018-02-05 16:52] [ssbio.pipeline.gempro] INFO: Completed UniProt --> best PDB mapping. See the "df_pdb_ranking" attribute for a summary dataframe.
Out[10]:
pdb_id | pdb_chain_id | uniprot | experimental_method | resolution | coverage | start | end | unp_start | unp_end | rank | |
---|---|---|---|---|---|---|---|---|---|---|---|
gene | |||||||||||
b0118 | 1l5j | A | P36683 | X-ray diffraction | 2.4 | 1 | 1 | 865 | 1 | 865 | 1 |
b0118 | 1l5j | B | P36683 | X-ray diffraction | 2.4 | 1 | 1 | 865 | 1 | 865 | 2 |
-
GEMPRO.
blast_seqs_to_pdb
(seq_ident_cutoff=0, evalue=0.0001, all_genes=False, display_link=False, outdir=None, force_rerun=False)[source] BLAST each representative protein sequence to the PDB. Saves raw BLAST results (XML files).
Parameters: - seq_ident_cutoff (float, optional) – Cutoff results based on percent coverage (in decimal form)
- evalue (float, optional) – Cutoff for the E-value - filters for significant hits. 0.001 is liberal, 0.0001 is stringent (default).
- all_genes (bool) – If all genes should be BLASTed, or only those without any structures currently mapped
- display_link (bool, optional) – Set to True if links to the HTML results should be displayed
- outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
- force_rerun (bool, optional) – If existing BLAST results should not be used, set to True. Default is False
In [11]:
# Mapping using BLAST
my_gempro.blast_seqs_to_pdb(all_genes=True, seq_ident_cutoff=.7, evalue=0.00001)
my_gempro.df_pdb_blast.head(2)
A Jupyter Widget
[2018-02-05 16:53] [ssbio.pipeline.gempro] INFO: Completed sequence --> PDB BLAST. See the "df_pdb_blast" attribute for a summary dataframe.
[2018-02-05 16:53] [ssbio.pipeline.gempro] INFO: 0: number of genes with additional structures added from BLAST
[2018-02-05 16:53] [ssbio.pipeline.gempro] WARNING: Empty dataframe
Out[11]:
Below, we are mapping to previously generated homology models for E. coli. If you are running this as a tutorial, they won’t exist on your computer, so you can skip these steps.
-
GEMPRO.
get_manual_homology_models
(input_dict, outdir=None, clean=True, force_rerun=False)[source] Copy homology models to the GEM-PRO project.
Requires an input of a dictionary formatted like so:
{ model_gene: { homology_model_id1: { 'model_file': '/path/to/homology/model.pdb', 'file_type': 'pdb' 'additional_info': info_value }, homology_model_id2: { 'model_file': '/path/to/homology/model.pdb' 'file_type': 'pdb' } } }
Parameters: - input_dict (dict) – Dictionary of dictionaries of gene names to homology model IDs and other information
- outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
- clean (bool) – If homology files should be cleaned and saved as a new PDB file
- force_rerun (bool) – If homology files should be copied again even if they exist in the GEM-PRO directory
In [15]:
import pandas as pd
import os.path as op
In [16]:
# Creating manual mapping dictionary for ECOLI I-TASSER models
homology_models = '/home/nathan/projects_archive/homology_models/ECOLI/zhang/'
homology_models_df = pd.read_csv('/home/nathan/projects_archive/homology_models/ECOLI/zhang_data/160804-ZHANG_INFO.csv')
tmp = homology_models_df[['zhang_id','model_file','m_gene']].drop_duplicates()
tmp = tmp[pd.notnull(tmp.m_gene)]
homology_model_dict = {}
for i,r in tmp.iterrows():
homology_model_dict[r['m_gene']] = {r['zhang_id']: {'model_file':op.join(homology_models, r['model_file']),
'file_type':'pdb'}}
my_gempro.get_manual_homology_models(homology_model_dict)
A Jupyter Widget
[2018-02-05 16:56] [ssbio.pipeline.gempro] INFO: Updated homology model information for 2 genes.
In [17]:
# Creating manual mapping dictionary for ECOLI SUNPRO models
homology_models = '/home/nathan/projects_archive/homology_models/ECOLI/sunpro/'
homology_models_df = pd.read_csv('/home/nathan/projects_archive/homology_models/ECOLI/sunpro_data/160609-SUNPRO_INFO.csv')
tmp = homology_models_df[['sunpro_id','model_file','m_gene']].drop_duplicates()
tmp = tmp[pd.notnull(tmp.m_gene)]
homology_model_dict = {}
for i,r in tmp.iterrows():
homology_model_dict[r['m_gene']] = {r['sunpro_id']: {'model_file':op.join(homology_models, r['model_file']),
'file_type':'pdb'}}
my_gempro.get_manual_homology_models(homology_model_dict)
A Jupyter Widget
[2018-02-05 16:56] [ssbio.pipeline.gempro] INFO: Updated homology model information for 2 genes.
Downloading and ranking structures¶
-
GEMPRO.
pdb_downloader_and_metadata
(outdir=None, pdb_file_type=None, force_rerun=False)[source] Download ALL mapped experimental structures to each protein’s structures directory.
Parameters: - outdir (str) – Path to output directory, if GEM-PRO directories were not set or other output directory is desired
- pdb_file_type (str) – Type of PDB file to download, if not already set or other format is desired
- force_rerun (bool) – If files should be re-downloaded if they already exist
In [18]:
# Download all mapped PDBs and gather the metadata
my_gempro.pdb_downloader_and_metadata()
my_gempro.df_pdb_metadata.head(2)
A Jupyter Widget
[2018-02-05 16:56] [ssbio.pipeline.gempro] INFO: Updated PDB metadata dataframe. See the "df_pdb_metadata" attribute for a summary dataframe.
[2018-02-05 16:56] [ssbio.pipeline.gempro] INFO: Saved 1 structures total
Out[18]:
pdb_id | pdb_title | description | experimental_method | mapped_chains | resolution | chemicals | taxonomy_name | structure_file | |
---|---|---|---|---|---|---|---|---|---|
gene | |||||||||
b0118 | 1l5j | CRYSTAL STRUCTURE OF E. COLI ACONITASE B. | Aconitate hydratase 2 (E.C.4.2.1.3) | X-ray diffraction | A;B | 2.4 | F3S;TRA | Escherichia coli | 1l5j.pdb |
-
GEMPRO.
set_representative_structure
(seq_outdir=None, struct_outdir=None, pdb_file_type=None, engine=’needle’, always_use_homology=False, rez_cutoff=0.0, seq_ident_cutoff=0.5, allow_missing_on_termini=0.2, allow_mutants=True, allow_deletions=False, allow_insertions=False, allow_unresolved=True, skip_large_structures=False, clean=True, force_rerun=False)[source] Set all representative structure for proteins from a structure in the structures attribute.
Each gene can have a combination of the following, which will be analyzed to set a representative structure.
- Homology model(s)
- Ranked PDBs
- BLASTed PDBs
If the
always_use_homology
flag is true, homology models are always set as representative when they exist. If there are multiple homology models, we rank by the percent sequence coverage.Parameters: - seq_outdir (str) – Path to output directory of sequence alignment files, must be set if GEM-PRO directories were not created initially
- struct_outdir (str) – Path to output directory of structure files, must be set if GEM-PRO directories were not created initially
- pdb_file_type (str) –
pdb
,mmCif
,xml
,mmtf
- file type for files downloaded from the PDB - engine (str) –
biopython
orneedle
- which pairwise alignment program to use.needle
is the standard EMBOSS tool to run pairwise alignments.biopython
is Biopython’s implementation of needle. Results can differ! - always_use_homology (bool) – If homology models should always be set as the representative structure
- rez_cutoff (float) – Resolution cutoff, in Angstroms (only if experimental structure)
- seq_ident_cutoff (float) – Percent sequence identity cutoff, in decimal form
- allow_missing_on_termini (float) – Percentage of the total length of the reference sequence which will be ignored when checking for modifications. Example: if 0.1, and reference sequence is 100 AA, then only residues 5 to 95 will be checked for modifications.
- allow_mutants (bool) – If mutations should be allowed or checked for
- allow_deletions (bool) – If deletions should be allowed or checked for
- allow_insertions (bool) – If insertions should be allowed or checked for
- allow_unresolved (bool) – If unresolved residues should be allowed or checked for
- skip_large_structures (bool) – Default False – currently, large structures can’t be saved as a PDB file even if you just want to save a single chain, so Biopython will throw an error when trying to do so. As an alternative, if a large structure is selected as representative, the pipeline will currently point to it and not clean it. If you don’t want this to happen, set this to true.
- clean (bool) – If structures should be cleaned
- force_rerun (bool) – If sequence to structure alignment should be rerun
Todo
- Remedy large structure representative setting
In [19]:
# Set representative structures
my_gempro.set_representative_structure()
my_gempro.df_representative_structures.head()
A Jupyter Widget
[2018-02-05 16:56] [ssbio.pipeline.gempro] INFO: 2/2: number of genes with a representative structure
[2018-02-05 16:56] [ssbio.pipeline.gempro] INFO: See the "df_representative_structures" attribute for a summary dataframe.
Out[19]:
id | is_experimental | file_type | structure_file | |
---|---|---|---|---|
gene | ||||
b0118 | REP-1l5j | True | pdb | 1l5j-A_clean.pdb |
b1276 | REP-ACON1_ECOLI | False | pdb | ACON1_ECOLI_model1_clean-X_clean.pdb |
Computing and storing protein properties¶
-
GEMPRO.
get_sequence_properties
(representatives_only=True)[source] Run Biopython ProteinAnalysis and EMBOSS pepstats to summarize basic statistics of all protein sequences. Results are stored in the protein’s respective SeqProp objects at
.annotations
Parameters: representative_only (bool) – If analysis should only be run on the representative sequences
In [20]:
# Requires EMBOSS "pepstats" program
# See the ssbio wiki for more information: https://github.com/SBRG/ssbio/wiki/Software-Installations
# Install using:
# sudo apt-get install emboss
my_gempro.get_sequence_properties()
A Jupyter Widget
-
GEMPRO.
get_scratch_predictions
(path_to_scratch, results_dir, scratch_basename=’scratch’, num_cores=1, exposed_buried_cutoff=25, custom_gene_mapping=None)[source] Run and parse
SCRATCH
results to predict secondary structure and solvent accessibility. Annotations are stored in the protein’s representative sequence at:.annotations
.letter_annotations
Parameters: - path_to_scratch (str) – Path to SCRATCH executable
- results_dir (str) – Path to SCRATCH results folder, which will have the files (scratch.ss, scratch.ss8, scratch.acc, scratch.acc20)
- scratch_basename (str) – Basename of the SCRATCH results (‘scratch’ is default)
- num_cores (int) – Number of cores to use to parallelize SCRATCH run
- exposed_buried_cutoff (int) – Cutoff of exposed/buried for the acc20 predictions
- custom_gene_mapping (dict) – Default parsing of SCRATCH output files is to look for the model gene IDs. If your output files contain IDs which differ from the model gene IDs, use this dictionary to map model gene IDs to result file IDs. Dictionary keys must match model genes.
In [ ]:
# Requires SCRATCH installation, replace path_to_scratch with own path to script
# See the ssbio wiki for more information: https://github.com/SBRG/ssbio/wiki/Software-Installations
my_gempro.get_scratch_predictions(path_to_scratch='scratch',
results_dir=my_gempro.data_dir,
num_cores=4)
-
GEMPRO.
find_disulfide_bridges
(representatives_only=True)[source] Run Biopython’s disulfide bridge finder and store found bridges.
Annotations are stored in the protein structure’s chain sequence at:
<chain_prop>.seq_record.annotations['SSBOND-biopython']
Parameters: representative_only (bool) – If analysis should only be run on the representative structure
In [22]:
my_gempro.find_disulfide_bridges(representatives_only=False)
A Jupyter Widget
-
GEMPRO.
get_dssp_annotations
(representatives_only=True, force_rerun=False)[source] Run DSSP on structures and store calculations.
Annotations are stored in the protein structure’s chain sequence at:
<chain_prop>.seq_record.letter_annotations['*-dssp']
Parameters: - representative_only (bool) – If analysis should only be run on the representative structure
- force_rerun (bool) – If calculations should be rerun even if an output file exists
In [23]:
# Requires DSSP installation
# See the ssbio wiki for more information: https://github.com/SBRG/ssbio/wiki/Software-Installations
my_gempro.get_dssp_annotations()
A Jupyter Widget
-
GEMPRO.
get_msms_annotations
(representatives_only=True, force_rerun=False)[source] Run MSMS on structures and store calculations.
Annotations are stored in the protein structure’s chain sequence at:
<chain_prop>.seq_record.letter_annotations['*-msms']
Parameters: - representative_only (bool) – If analysis should only be run on the representative structure
- force_rerun (bool) – If calculations should be rerun even if an output file exists
In [24]:
# Requires MSMS installation
# See the ssbio wiki for more information: https://github.com/SBRG/ssbio/wiki/Software-Installations
my_gempro.get_msms_annotations()
A Jupyter Widget
Additional annotations¶
“Features” are currently loaded directly from UniProt, but if another feature file is available for each protein, it can be loaded manually.
In [25]:
# for g in my_gempro.genes_with_a_representative_sequence:
# g.protein.representative_sequence.feature_path = '/path/to/new/feature/file.gff'
Additional global or local properties can be loaded after loading the saved GEM-PRO.
Make sure to add ``’seq_hydrophobicity-kd’`` to the list of columns to be returned later on!
In [26]:
# Kyte-Doolittle scale for hydrophobicity
kd = { 'A': 1.8,'R':-4.5,'N':-3.5,'D':-3.5,'C': 2.5,
'Q':-3.5,'E':-3.5,'G':-0.4,'H':-3.2,'I': 4.5,
'L': 3.8,'K':-3.9,'M': 1.9,'F': 2.8,'P':-1.6,
'S':-0.8,'T':-0.7,'W':-0.9,'Y':-1.3,'V': 4.2 }
In [27]:
# Use Biopython to calculated hydrophobicity using a set sliding window length
from Bio.SeqUtils.ProtParam import ProteinAnalysis
window = 7
for g in my_gempro.genes_with_a_representative_sequence:
# Create a ProteinAnalysis object -- see http://biopython.org/wiki/ProtParam
my_seq = g.protein.representative_sequence.seq_str
analysed_seq = ProteinAnalysis(my_seq)
# Calculate scale
hydrophobicity = analysed_seq.protein_scale(param_dict=kd, window=window)
# Correct list length by prepending and appending "inf" (result needs to be same length as sequence)
for i in range(window//2):
hydrophobicity.insert(0, float("Inf"))
hydrophobicity.append(float("Inf"))
# Add new annotation to the representative sequence's "letter_annotations" dictionary
g.protein.representative_sequence.letter_annotations['hydrophobicity-kd'] = hydrophobicity
Global protein properties¶
Properties of the entire protein sequence/structure are stored in:
- The
representative_sequence
annotations
field - The
representative_structure
’s representative chain SeqRecord
These properties describe aspects of the entire protein, such as its molecular weight, the percentage of amino acids in a particular secondary structure, etc.
In [28]:
# Printing all global protein properties
from pprint import pprint
# Only looking at 2 genes for now, remove [:2] to gather properties for all
for g in my_gempro.genes_with_a_representative_sequence[:2]:
repseq = g.protein.representative_sequence
repstruct = g.protein.representative_structure
repchain = g.protein.representative_chain
print('Gene: {}'.format(g.id))
print('Number of structures: {}'.format(g.protein.num_structures))
print('Representative sequence: {}'.format(repseq.id))
print('Representative structure: {}'.format(repstruct.id))
print('----------------------------------------------------------------')
print('Global properties of the representative sequence:')
pprint(repseq.annotations)
print('----------------------------------------------------------------')
print('Global properties of the representative structure:')
pprint(repstruct.chains.get_by_id(repchain).seq_record.annotations)
print('****************************************************************')
print('****************************************************************')
print('****************************************************************')
Gene: b1276
Number of structures: 3
Representative sequence: P25516
Representative structure: REP-ACON1_ECOLI
----------------------------------------------------------------
Global properties of the representative sequence:
{'amino_acids_percent-biop': {'A': 0.08641975308641975,
'C': 0.007856341189674524,
'D': 0.06397306397306397,
'E': 0.06172839506172839,
'F': 0.025813692480359147,
'G': 0.08754208754208755,
'H': 0.020202020202020204,
'I': 0.04826038159371493,
'K': 0.04826038159371493,
'L': 0.09427609427609428,
'M': 0.028058361391694726,
'N': 0.037037037037037035,
'P': 0.05611672278338945,
'Q': 0.030303030303030304,
'R': 0.05723905723905724,
'S': 0.05723905723905724,
'T': 0.06060606060606061,
'V': 0.0819304152637486,
'W': 0.014590347923681257,
'Y': 0.03254769921436588},
'aromaticity-biop': 0.07295173961840629,
'instability_index-biop': 36.28239057239071,
'isoelectric_point-biop': 5.59344482421875,
'molecular_weight-biop': 97676.06830000057,
'monoisotopic-biop': False,
'percent_acidic-pepstats': 0.1257,
'percent_aliphatic-pepstats': 0.31089,
'percent_aromatic-pepstats': 0.09315,
'percent_basic-pepstats': 0.1257,
'percent_charged-pepstats': 0.2514,
'percent_helix_naive-biop': 0.29741863075196406,
'percent_non-polar-pepstats': 0.56341,
'percent_polar-pepstats': 0.43659,
'percent_small-pepstats': 0.53872,
'percent_strand_naive-biop': 0.27048260381593714,
'percent_tiny-pepstats': 0.29966000000000004,
'percent_turn_naive-biop': 0.2379349046015713}
----------------------------------------------------------------
Global properties of the representative structure:
{'percent_B-dssp': 0.010101010101010102,
'percent_C-dssp': 0.2222222222222222,
'percent_E-dssp': 0.1739618406285073,
'percent_G-dssp': 0.03928170594837262,
'percent_H-dssp': 0.345679012345679,
'percent_I-dssp': 0.005611672278338945,
'percent_S-dssp': 0.09427609427609428,
'percent_T-dssp': 0.10886644219977554}
****************************************************************
****************************************************************
****************************************************************
Gene: b0118
Number of structures: 4
Representative sequence: P36683
Representative structure: REP-1l5j
----------------------------------------------------------------
Global properties of the representative sequence:
{'amino_acids_percent-biop': {'A': 0.11213872832369942,
'C': 0.011560693641618497,
'D': 0.06358381502890173,
'E': 0.06589595375722543,
'F': 0.03352601156069364,
'G': 0.08786127167630058,
'H': 0.017341040462427744,
'I': 0.05433526011560694,
'K': 0.056647398843930635,
'L': 0.10173410404624278,
'M': 0.026589595375722544,
'N': 0.035838150289017344,
'P': 0.06242774566473988,
'Q': 0.028901734104046242,
'R': 0.04508670520231214,
'S': 0.04161849710982659,
'T': 0.05433526011560694,
'V': 0.06705202312138728,
'W': 0.009248554913294798,
'Y': 0.024277456647398842},
'aromaticity-biop': 0.06705202312138728,
'instability_index-biop': 32.79631213872841,
'isoelectric_point-biop': 5.23931884765625,
'molecular_weight-biop': 93497.01500000065,
'monoisotopic-biop': False,
'percent_acidic-pepstats': 0.12948,
'percent_aliphatic-pepstats': 0.33526000000000006,
'percent_aromatic-pepstats': 0.08439,
'percent_basic-pepstats': 0.11907999999999999,
'percent_charged-pepstats': 0.24855,
'percent_helix_naive-biop': 0.29017341040462424,
'percent_non-polar-pepstats': 0.59075,
'percent_polar-pepstats': 0.40924999999999995,
'percent_small-pepstats': 0.53642,
'percent_strand_naive-biop': 0.3063583815028902,
'percent_tiny-pepstats': 0.30751,
'percent_turn_naive-biop': 0.22774566473988442}
----------------------------------------------------------------
Global properties of the representative structure:
{'percent_B-dssp': 0.016241299303944315,
'percent_C-dssp': 0.20765661252900233,
'percent_E-dssp': 0.14037122969837587,
'percent_G-dssp': 0.03480278422273782,
'percent_H-dssp': 0.3805104408352668,
'percent_I-dssp': 0.0,
'percent_S-dssp': 0.08236658932714618,
'percent_T-dssp': 0.13805104408352667}
****************************************************************
****************************************************************
****************************************************************
Local protein properties¶
Properties of specific residues are stored in:
- The
representative_sequence
’sletter_annotations
attribute - The
representative_structure
’s representative chain SeqRecord
Specific sites, like metal or metabolite binding sites, can be found in
the representative_sequence
’s features
attribute. This
information is retrieved from UniProt. The below examples extract
features for the metal binding sites.
The properties related to those sites can be retrieved using the
function get_residue_annotations
.
UniProt contains more information than just “sites”
In [29]:
# Looking at all features
for g in my_gempro.genes_with_a_representative_sequence[:2]:
g.id
# UniProt features
[x for x in g.protein.representative_sequence.features]
# Catalytic site atlas features
for s in g.protein.structures:
if s.structure_file:
for c in s.mapped_chains:
if s.chains.get_by_id(c).seq_record:
if s.chains.get_by_id(c).seq_record.features:
[x for x in s.chains.get_by_id(c).seq_record.features]
Out[29]:
'b1276'
Out[29]:
[SeqFeature(FeatureLocation(ExactPosition(0), ExactPosition(1)), type='initiator methionine'),
SeqFeature(FeatureLocation(ExactPosition(1), ExactPosition(891)), type='chain', id='PRO_0000076661'),
SeqFeature(FeatureLocation(ExactPosition(434), ExactPosition(435)), type='metal ion-binding site'),
SeqFeature(FeatureLocation(ExactPosition(500), ExactPosition(501)), type='metal ion-binding site'),
SeqFeature(FeatureLocation(ExactPosition(503), ExactPosition(504)), type='metal ion-binding site'),
SeqFeature(FeatureLocation(ExactPosition(521), ExactPosition(522)), type='sequence conflict')]
Out[29]:
'b0118'
Out[29]:
[SeqFeature(FeatureLocation(ExactPosition(0), ExactPosition(865)), type='chain', id='PRO_0000076675'),
SeqFeature(FeatureLocation(ExactPosition(243), ExactPosition(246)), type='region of interest'),
SeqFeature(FeatureLocation(ExactPosition(413), ExactPosition(416)), type='region of interest'),
SeqFeature(FeatureLocation(ExactPosition(709), ExactPosition(710)), type='metal ion-binding site'),
SeqFeature(FeatureLocation(ExactPosition(768), ExactPosition(769)), type='metal ion-binding site'),
SeqFeature(FeatureLocation(ExactPosition(771), ExactPosition(772)), type='metal ion-binding site'),
SeqFeature(FeatureLocation(ExactPosition(190), ExactPosition(191)), type='binding site'),
SeqFeature(FeatureLocation(ExactPosition(497), ExactPosition(498)), type='binding site'),
SeqFeature(FeatureLocation(ExactPosition(790), ExactPosition(791)), type='binding site'),
SeqFeature(FeatureLocation(ExactPosition(795), ExactPosition(796)), type='binding site'),
SeqFeature(FeatureLocation(ExactPosition(768), ExactPosition(769)), type='mutagenesis site'),
SeqFeature(FeatureLocation(ExactPosition(1), ExactPosition(14)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(23), ExactPosition(35)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(41), ExactPosition(51)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(58), ExactPosition(72)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(82), ExactPosition(90)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(98), ExactPosition(104)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(104), ExactPosition(107)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(108), ExactPosition(111)), type='turn'),
SeqFeature(FeatureLocation(ExactPosition(111), ExactPosition(120)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(127), ExactPosition(137)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(140), ExactPosition(151)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(153), ExactPosition(157)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(165), ExactPosition(178)), type='strand'),
SeqFeature(FeatureLocation(ExactPosition(178), ExactPosition(182)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(184), ExactPosition(190)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(193), ExactPosition(197)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(197), ExactPosition(200)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(213), ExactPosition(216)), type='turn'),
SeqFeature(FeatureLocation(ExactPosition(219), ExactPosition(227)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(232), ExactPosition(243)), type='strand'),
SeqFeature(FeatureLocation(ExactPosition(247), ExactPosition(257)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(257), ExactPosition(261)), type='strand'),
SeqFeature(FeatureLocation(ExactPosition(263), ExactPosition(269)), type='strand'),
SeqFeature(FeatureLocation(ExactPosition(271), ExactPosition(278)), type='strand'),
SeqFeature(FeatureLocation(ExactPosition(279), ExactPosition(288)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(291), ExactPosition(295)), type='strand'),
SeqFeature(FeatureLocation(ExactPosition(305), ExactPosition(310)), type='strand'),
SeqFeature(FeatureLocation(ExactPosition(310), ExactPosition(314)), type='turn'),
SeqFeature(FeatureLocation(ExactPosition(314), ExactPosition(318)), type='strand'),
SeqFeature(FeatureLocation(ExactPosition(318), ExactPosition(321)), type='turn'),
SeqFeature(FeatureLocation(ExactPosition(323), ExactPosition(327)), type='strand'),
SeqFeature(FeatureLocation(ExactPosition(333), ExactPosition(341)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(343), ExactPosition(360)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(383), ExactPosition(391)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(391), ExactPosition(394)), type='turn'),
SeqFeature(FeatureLocation(ExactPosition(408), ExactPosition(413)), type='strand'),
SeqFeature(FeatureLocation(ExactPosition(414), ExactPosition(417)), type='turn'),
SeqFeature(FeatureLocation(ExactPosition(417), ExactPosition(427)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(437), ExactPosition(440)), type='strand'),
SeqFeature(FeatureLocation(ExactPosition(443), ExactPosition(448)), type='strand'),
SeqFeature(FeatureLocation(ExactPosition(450), ExactPosition(465)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(465), ExactPosition(468)), type='turn'),
SeqFeature(FeatureLocation(ExactPosition(478), ExactPosition(483)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(483), ExactPosition(486)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(491), ExactPosition(497)), type='strand'),
SeqFeature(FeatureLocation(ExactPosition(502), ExactPosition(507)), type='strand'),
SeqFeature(FeatureLocation(ExactPosition(511), ExactPosition(521)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(530), ExactPosition(538)), type='strand'),
SeqFeature(FeatureLocation(ExactPosition(545), ExactPosition(559)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(572), ExactPosition(576)), type='turn'),
SeqFeature(FeatureLocation(ExactPosition(576), ExactPosition(583)), type='strand'),
SeqFeature(FeatureLocation(ExactPosition(588), ExactPosition(597)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(597), ExactPosition(600)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(600), ExactPosition(603)), type='turn'),
SeqFeature(FeatureLocation(ExactPosition(604), ExactPosition(609)), type='strand'),
SeqFeature(FeatureLocation(ExactPosition(612), ExactPosition(632)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(637), ExactPosition(653)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(666), ExactPosition(673)), type='strand'),
SeqFeature(FeatureLocation(ExactPosition(673), ExactPosition(676)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(680), ExactPosition(683)), type='strand'),
SeqFeature(FeatureLocation(ExactPosition(690), ExactPosition(693)), type='strand'),
SeqFeature(FeatureLocation(ExactPosition(693), ExactPosition(696)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(696), ExactPosition(699)), type='turn'),
SeqFeature(FeatureLocation(ExactPosition(703), ExactPosition(707)), type='strand'),
SeqFeature(FeatureLocation(ExactPosition(713), ExactPosition(726)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(731), ExactPosition(737)), type='strand'),
SeqFeature(FeatureLocation(ExactPosition(741), ExactPosition(750)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(752), ExactPosition(760)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(769), ExactPosition(772)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(774), ExactPosition(777)), type='strand'),
SeqFeature(FeatureLocation(ExactPosition(783), ExactPosition(790)), type='strand'),
SeqFeature(FeatureLocation(ExactPosition(795), ExactPosition(798)), type='strand'),
SeqFeature(FeatureLocation(ExactPosition(801), ExactPosition(805)), type='strand'),
SeqFeature(FeatureLocation(ExactPosition(807), ExactPosition(817)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(822), ExactPosition(834)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(836), ExactPosition(840)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(845), ExactPosition(848)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(849), ExactPosition(857)), type='helix')]
-
Protein.
get_residue_annotations
(seq_resnum, seqprop=None, structprop=None, chain_id=None, use_representatives=False)[source] Get all residue-level annotations stored in the SeqProp
letter_annotations
field for a given residue number.Uses the representative sequence, structure, and chain ID stored by default. If other properties from other structures are desired, input the proper IDs. An alignment for the given sequence to the structure must be present in the sequence_alignments list.
Parameters: - seq_resnum (int) – Residue number in the sequence
- seqprop (SeqProp) – SeqProp object
- structprop (StructProp) – StructProp object
- chain_id (str) – ID of the structure’s chain to get annotation from
- use_representatives (bool) – If the representative sequence/structure/chain IDs should be used
Returns: All available letter_annotations for this residue number
Return type: dict
In [30]:
metal_info = []
for g in my_gempro.genes:
for f in g.protein.representative_sequence.features:
if 'metal' in f.type.lower():
res_info = g.protein.get_residue_annotations(f.location.end, use_representatives=True)
res_info['gene_id'] = g.id
res_info['seq_id'] = g.protein.representative_sequence.id
res_info['struct_id'] = g.protein.representative_structure.id
res_info['chain_id'] = g.protein.representative_chain
metal_info.append(res_info)
cols = ['gene_id', 'seq_id', 'struct_id', 'chain_id',
'seq_residue', 'seq_resnum', 'struct_residue','struct_resnum',
'seq_SS-sspro','seq_SS-sspro8','seq_RSA-accpro','seq_RSA-accpro20',
'struct_SS-dssp','struct_RSA-dssp', 'struct_ASA-dssp',
'struct_PHI-dssp', 'struct_PSI-dssp', 'struct_CA_DEPTH-msms', 'struct_RES_DEPTH-msms']
pd.DataFrame.from_records(metal_info, columns=cols).set_index(['gene_id', 'seq_id', 'struct_id', 'chain_id', 'seq_resnum'])
Out[30]:
seq_residue | struct_residue | struct_resnum | seq_SS-sspro | seq_SS-sspro8 | seq_RSA-accpro | seq_RSA-accpro20 | struct_SS-dssp | struct_RSA-dssp | struct_ASA-dssp | struct_PHI-dssp | struct_PSI-dssp | struct_CA_DEPTH-msms | struct_RES_DEPTH-msms | |||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
gene_id | seq_id | struct_id | chain_id | seq_resnum | ||||||||||||||
b1276 | P25516 | REP-ACON1_ECOLI | X | 435 | C | C | 435 | NaN | NaN | NaN | NaN | H | 0.059259 | 8.0 | -61.1 | -26.6 | 2.656722 | 2.813536 |
501 | C | C | 501 | NaN | NaN | NaN | NaN | S | 0.088889 | 12.0 | -61.0 | -50.0 | 1.999713 | 2.409119 | ||||
504 | C | C | 504 | NaN | NaN | NaN | NaN | G | 0.259259 | 35.0 | -56.0 | -45.6 | 1.999634 | 1.961484 | ||||
b0118 | P36683 | REP-1l5j | A | 710 | C | C | 710 | NaN | NaN | NaN | NaN | T | 0.118519 | 16.0 | -67.1 | -7.2 | 10.148960 | 10.009109 |
769 | C | C | 769 | NaN | NaN | NaN | NaN | - | 0.088889 | 12.0 | -67.8 | -28.3 | 8.296585 | 8.049832 | ||||
772 | C | C | 772 | NaN | NaN | NaN | NaN | G | 0.081481 | 11.0 | -50.2 | -38.0 | 8.282292 | 8.239369 |
gene_id
: Gene ID used in GEM-PRO projectseq_id
: Representative protein sequence IDstruct_id
: Representative protein structure ID, withREP-
prepended to it. 4 letter structure IDs are experimental structures from the PDB, others are homology modelschain_id
: Representative chain ID in the representative structure
seq_resnum
: Residue number of the amino acid in the representative sequencesite_name
: Name of the feature as defined in UniProtseq_residue
: Amino acid in the representative sequence at the residue numberstruct_residue
: Amino acid in the representative structure at the residue numberstruct_resnum
: Residue number of the amino acid in the representative structure
seq_SS-sspro
: Predicted secondary structure, 3 definitions (from the SCRATCH program)seq_SS-sspro8
: Predicted secondary structure, 8 definitions (SCRATCH)seq_RSA-accpro
: Predicted exposed (e) or buried (-) residue (SCRATCH)seq_RSA-accpro20
: Predicted exposed/buried, 0 to 100 scale (SCRATCH)
struct_SS-dssp
: Secondary structure (DSSP program)struct_RSA-dssp
: Relative solvent accessibility (DSSP)struct_ASA-dssp
: Solvent accessibility, absolute value (DSSP)struct_PHI-dssp
: Phi angle measure (DSSP)struct_PSI-dssp
: Psi angle measure (DSSP)struct_RES_DEPTH-msms
: Calculated residue depth averaged for all atoms in the residue (MSMS program)struct_CA_DEPTH-msms
: Calculated residue depth for the carbon alpha atom (MSMS)
-
StructProp.
view_structure
(only_chains=None, opacity=1.0, recolor=False, gui=False)[source] Use NGLviewer to display a structure in a Jupyter notebook
Parameters: - only_chains (str, list) – Chain ID or IDs to display
- opacity (float) – Opacity of the structure
- recolor (bool) – If structure should be cleaned and recolored to silver
- gui (bool) – If the NGLview GUI should show up
Returns: NGLviewer object
-
StructProp.
add_residues_highlight_to_nglview
(view, structure_resnums, chain=None, res_color=’red’)[source] Add a residue number or numbers to an NGLWidget view object.
Parameters: - view (NGLWidget) – NGLWidget view object
- structure_resnums (int, list) – Residue number(s) to highlight, structure numbering
- chain (str, list) – Chain ID or IDs of which residues are a part of. If not provided, all chains in the mapped_chains attribute will be used. If that is also empty, and exception is raised.
- res_color (str) – Color to highlight residues with
In [31]:
for g in my_gempro.genes:
# Gather residue numbers
metal_binding_structure_residues = []
for f in g.protein.representative_sequence.features:
if 'metal' in f.type.lower():
res_info = g.protein.get_residue_annotations(f.location.end, use_representatives=True)
metal_binding_structure_residues.append(res_info['struct_resnum'])
print(metal_binding_structure_residues)
# Display structure
view = g.protein.representative_structure.view_structure()
g.protein.representative_structure.add_residues_highlight_to_nglview(view=view, structure_resnums=metal_binding_structure_residues)
view
[435, 501, 504]
[2018-02-05 17:00] [ssbio.protein.structure.structprop] INFO: Selection: ( :X ) and not hydrogen and ( 504 or 435 or 501 )
A Jupyter Widget
[710, 769, 772]
[2018-02-05 17:00] [ssbio.protein.structure.structprop] INFO: Selection: ( :A ) and not hydrogen and ( 769 or 772 or 710 )
A Jupyter Widget
In [32]:
# Run all sequence to structure alignments
for g in my_gempro.genes:
for s in g.protein.structures:
g.protein.align_seqprop_to_structprop(seqprop=g.protein.representative_sequence, structprop=s)
In [33]:
metal_info_compared = []
for g in my_gempro.genes:
for f in g.protein.representative_sequence.features:
if 'metal' in f.type.lower():
for s in g.protein.structures:
for c in s.mapped_chains:
res_info = g.protein.get_residue_annotations(seq_resnum=f.location.end,
seqprop=g.protein.representative_sequence,
structprop=s, chain_id=c,
use_representatives=False)
res_info['gene_id'] = g.id
res_info['seq_id'] = g.protein.representative_sequence.id
res_info['struct_id'] = s.id
res_info['chain_id'] = c
metal_info_compared.append(res_info)
cols = ['gene_id', 'seq_id', 'struct_id', 'chain_id',
'seq_residue', 'seq_resnum', 'struct_residue','struct_resnum',
'seq_SS-sspro','seq_SS-sspro8','seq_RSA-accpro','seq_RSA-accpro20',
'struct_SS-dssp','struct_RSA-dssp', 'struct_ASA-dssp',
'struct_PHI-dssp', 'struct_PSI-dssp', 'struct_CA_DEPTH-msms', 'struct_RES_DEPTH-msms']
pd.DataFrame.from_records(metal_info_compared, columns=cols).sort_values(by=['seq_resnum','struct_id','chain_id']).set_index(['gene_id','seq_id','seq_resnum','seq_residue','struct_id'])
Out[33]:
chain_id | struct_residue | struct_resnum | seq_SS-sspro | seq_SS-sspro8 | seq_RSA-accpro | seq_RSA-accpro20 | struct_SS-dssp | struct_RSA-dssp | struct_ASA-dssp | struct_PHI-dssp | struct_PSI-dssp | struct_CA_DEPTH-msms | struct_RES_DEPTH-msms | |||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
gene_id | seq_id | seq_resnum | seq_residue | struct_id | ||||||||||||||
b1276 | P25516 | 435 | C | ACON1_ECOLI | X | C | 435 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
E01201 | X | C | 435 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ||||
REP-ACON1_ECOLI | X | C | 435 | NaN | NaN | NaN | NaN | H | 0.059259 | 8.0 | -61.1 | -26.6 | 2.656722 | 2.813536 | ||||
501 | C | ACON1_ECOLI | X | C | 501 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ||
E01201 | X | C | 501 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ||||
REP-ACON1_ECOLI | X | C | 501 | NaN | NaN | NaN | NaN | S | 0.088889 | 12.0 | -61.0 | -50.0 | 1.999713 | 2.409119 | ||||
504 | C | ACON1_ECOLI | X | C | 504 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ||
E01201 | X | C | 504 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ||||
REP-ACON1_ECOLI | X | C | 504 | NaN | NaN | NaN | NaN | G | 0.259259 | 35.0 | -56.0 | -45.6 | 1.999634 | 1.961484 | ||||
b0118 | P36683 | 710 | C | 1l5j | A | C | 710 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1l5j | B | C | 710 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ||||
ACON2_ECOLI | X | C | 710 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ||||
E00113 | X | C | 710 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ||||
REP-1l5j | A | C | 710 | NaN | NaN | NaN | NaN | T | 0.118519 | 16.0 | -67.1 | -7.2 | 10.148960 | 10.009109 | ||||
769 | C | 1l5j | A | C | 769 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ||
1l5j | B | C | 769 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ||||
ACON2_ECOLI | X | C | 769 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ||||
E00113 | X | C | 769 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ||||
REP-1l5j | A | C | 769 | NaN | NaN | NaN | NaN | - | 0.088889 | 12.0 | -67.8 | -28.3 | 8.296585 | 8.049832 | ||||
772 | C | 1l5j | A | C | 772 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ||
1l5j | B | C | 772 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ||||
ACON2_ECOLI | X | C | 772 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ||||
E00113 | X | C | 772 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ||||
REP-1l5j | A | C | 772 | NaN | NaN | NaN | NaN | G | 0.081481 | 11.0 | -50.2 | -38.0 | 8.282292 | 8.239369 |
GEM-PRO - Genes & Sequences¶
This notebook gives an example of how to run the GEM-PRO pipeline with a dictionary of gene IDs and their protein sequences.
Imports¶
In [1]:
import sys
import logging
In [2]:
# Import the GEM-PRO class
from ssbio.pipeline.gempro import GEMPRO
In [3]:
# Printing multiple outputs per cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
Set the logging level in logger.setLevel(logging.<LEVEL_HERE>)
to
specify how verbose you want the pipeline to be. Debug is most verbose.
CRITICAL
- Only really important messages shown
ERROR
- Major errors
WARNING
- Warnings that don’t affect running of the pipeline
INFO
(default)- Info such as the number of structures mapped per gene
DEBUG
- Really detailed information that will print out a lot of stuff
DEBUG
mode prints out a large amount of information,
especially if you have a lot of genes. This may stall your notebook!
In [4]:
# Create logger
logger = logging.getLogger()
logger.setLevel(logging.INFO) # SET YOUR LOGGING LEVEL HERE #
In [5]:
# Other logger stuff for Jupyter notebooks
handler = logging.StreamHandler(sys.stderr)
formatter = logging.Formatter('[%(asctime)s] [%(name)s] %(levelname)s: %(message)s', datefmt="%Y-%m-%d %H:%M")
handler.setFormatter(formatter)
logger.handlers = [handler]
Initialization of the project¶
Set these three things:
ROOT_DIR
- The directory where a folder named after your
PROJECT
will be created
- The directory where a folder named after your
PROJECT
- Your project name
LIST_OF_GENES
- Your list of gene IDs
A directory will be created in ROOT_DIR
with your PROJECT
name.
The folders are organized like so:
ROOT_DIR
└── PROJECT
├── data # General storage for pipeline outputs
├── model # SBML and GEM-PRO models are stored here
├── genes # Per gene information
│ ├── <gene_id1> # Specific gene directory
│ │ └── protein
│ │ ├── sequences # Protein sequence files, alignments, etc.
│ │ └── structures # Protein structure files, calculations, etc.
│ └── <gene_id2>
│ └── protein
│ ├── sequences
│ └── structures
├── reactions # Per reaction information
│ └── <reaction_id1> # Specific reaction directory
│ └── complex
│ └── structures # Protein complex files
└── metabolites # Per metabolite information
└── <metabolite_id1> # Specific metabolite directory
└── chemical
└── structures # Metabolite 2D and 3D structure files
In [6]:
# SET FOLDERS AND DATA HERE
import tempfile
ROOT_DIR = tempfile.gettempdir()
PROJECT = 'genes_and_sequences_GP'
GENES_AND_SEQUENCES = {'b0870': 'MIDLRSDTVTRPSRAMLEAMMAAPVGDDVYGDDPTVNALQDYAAELSGKEAAIFLPTGTQANLVALLSHCERGEEYIVGQAAHNYLFEAGGAAVLGSIQPQPIDAAADGTLPLDKVAMKIKPDDIHFARTKLLSLENTHNGKVLPREYLKEAWEFTRERNLALHVDGARIFNAVVAYGCELKEITQYCDSFTICLSKGLGTPVGSLLVGNRDYIKRAIRWRKMTGGGMRQSGILAAAGIYALKNNVARLQEDHDNAAWMAEQLREAGADVMRQDTNMLFVRVGEENAAALGEYMKARNVLINASPIVRLVTHLDVSREQLAEVAAHWRAFLAR',
'b3041': 'MNQTLLSSFGTPFERVENALAALREGRGVMVLDDEDRENEGDMIFPAETMTVEQMALTIRHGSGIVCLCITEDRRKQLDLPMMVENNTSAYGTGFTVTIEAAEGVTTGVSAADRITTVRAAIADGAKPSDLNRPGHVFPLRAQAGGVLTRGGHTEATIDLMTLAGFKPAGVLCELTNDDGTMARAPECIEFANKHNMALVTIEDLVAYRQAHERKAS'}
PDB_FILE_TYPE = 'mmtf'
In [7]:
# Create the GEM-PRO project
my_gempro = GEMPRO(gem_name=PROJECT, root_dir=ROOT_DIR, genes_and_sequences=GENES_AND_SEQUENCES, pdb_file_type=PDB_FILE_TYPE)
[2018-02-05 18:11] [ssbio.pipeline.gempro] INFO: Creating GEM-PRO project directory in folder /tmp
[2018-02-05 18:11] [ssbio.pipeline.gempro] INFO: /tmp/genes_and_sequences_GP: GEM-PRO project location
[2018-02-05 18:11] [ssbio.pipeline.gempro] INFO: Loaded in 2 sequences
[2018-02-05 18:11] [ssbio.pipeline.gempro] INFO: 2: number of genes
Mapping sequence –> structure¶
Since the sequences have been provided, we just need to BLAST them to the PDB.
-
GEMPRO.
blast_seqs_to_pdb
(seq_ident_cutoff=0, evalue=0.0001, all_genes=False, display_link=False, outdir=None, force_rerun=False)[source] BLAST each representative protein sequence to the PDB. Saves raw BLAST results (XML files).
Parameters: - seq_ident_cutoff (float, optional) – Cutoff results based on percent coverage (in decimal form)
- evalue (float, optional) – Cutoff for the E-value - filters for significant hits. 0.001 is liberal, 0.0001 is stringent (default).
- all_genes (bool) – If all genes should be BLASTed, or only those without any structures currently mapped
- display_link (bool, optional) – Set to True if links to the HTML results should be displayed
- outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
- force_rerun (bool, optional) – If existing BLAST results should not be used, set to True. Default is False
In [8]:
# Mapping using BLAST
my_gempro.blast_seqs_to_pdb(all_genes=True, seq_ident_cutoff=.9, evalue=0.00001)
my_gempro.df_pdb_blast.head(2)
[2018-02-05 18:11] [ssbio.pipeline.gempro] INFO: Completed sequence --> PDB BLAST. See the "df_pdb_blast" attribute for a summary dataframe.
[2018-02-05 18:11] [ssbio.pipeline.gempro] INFO: 2: number of genes with additional structures added from BLAST
Out[8]:
pdb_id | pdb_chain_id | hit_score | hit_evalue | hit_percent_similar | hit_percent_ident | hit_num_ident | hit_num_similar | |
---|---|---|---|---|---|---|---|---|
gene | ||||||||
b0870 | 3wlx | A | 1713.0 | 0.0 | 1.0 | 1.0 | 333 | 333 |
b0870 | 3wlx | B | 1713.0 | 0.0 | 1.0 | 1.0 | 333 | 333 |
Downloading and ranking structures¶
-
GEMPRO.
pdb_downloader_and_metadata
(outdir=None, pdb_file_type=None, force_rerun=False)[source] Download ALL mapped experimental structures to each protein’s structures directory.
Parameters: - outdir (str) – Path to output directory, if GEM-PRO directories were not set or other output directory is desired
- pdb_file_type (str) – Type of PDB file to download, if not already set or other format is desired
- force_rerun (bool) – If files should be re-downloaded if they already exist
In [9]:
# Download all mapped PDBs and gather the metadata
my_gempro.pdb_downloader_and_metadata()
my_gempro.df_pdb_metadata.head(2)
[2018-02-05 18:11] [ssbio.pipeline.gempro] INFO: Updated PDB metadata dataframe. See the "df_pdb_metadata" attribute for a summary dataframe.
[2018-02-05 18:11] [ssbio.pipeline.gempro] INFO: Saved 11 structures total
Out[9]:
pdb_id | pdb_title | description | experimental_method | mapped_chains | resolution | chemicals | taxonomy_name | structure_file | |
---|---|---|---|---|---|---|---|---|---|
gene | |||||||||
b0870 | 3wlx | Crystal structure of low-specificity L-threoni... | Low specificity L-threonine aldolase (E.C.4.1.... | X-RAY DIFFRACTION | A;B | 2.51 | PLG | Escherichia coli | 3wlx.mmtf |
b0870 | 4lnj | Structure of Escherichia coli Threonine Aldola... | Low-specificity L-threonine aldolase (E.C.4.1.... | X-RAY DIFFRACTION | A;B | 2.10 | EPE;MG;PLR | Escherichia coli | 4lnj.mmtf |
-
GEMPRO.
set_representative_structure
(seq_outdir=None, struct_outdir=None, pdb_file_type=None, engine=’needle’, always_use_homology=False, rez_cutoff=0.0, seq_ident_cutoff=0.5, allow_missing_on_termini=0.2, allow_mutants=True, allow_deletions=False, allow_insertions=False, allow_unresolved=True, skip_large_structures=False, clean=True, force_rerun=False)[source] Set all representative structure for proteins from a structure in the structures attribute.
Each gene can have a combination of the following, which will be analyzed to set a representative structure.
- Homology model(s)
- Ranked PDBs
- BLASTed PDBs
If the
always_use_homology
flag is true, homology models are always set as representative when they exist. If there are multiple homology models, we rank by the percent sequence coverage.Parameters: - seq_outdir (str) – Path to output directory of sequence alignment files, must be set if GEM-PRO directories were not created initially
- struct_outdir (str) – Path to output directory of structure files, must be set if GEM-PRO directories were not created initially
- pdb_file_type (str) –
pdb
,mmCif
,xml
,mmtf
- file type for files downloaded from the PDB - engine (str) –
biopython
orneedle
- which pairwise alignment program to use.needle
is the standard EMBOSS tool to run pairwise alignments.biopython
is Biopython’s implementation of needle. Results can differ! - always_use_homology (bool) – If homology models should always be set as the representative structure
- rez_cutoff (float) – Resolution cutoff, in Angstroms (only if experimental structure)
- seq_ident_cutoff (float) – Percent sequence identity cutoff, in decimal form
- allow_missing_on_termini (float) – Percentage of the total length of the reference sequence which will be ignored when checking for modifications. Example: if 0.1, and reference sequence is 100 AA, then only residues 5 to 95 will be checked for modifications.
- allow_mutants (bool) – If mutations should be allowed or checked for
- allow_deletions (bool) – If deletions should be allowed or checked for
- allow_insertions (bool) – If insertions should be allowed or checked for
- allow_unresolved (bool) – If unresolved residues should be allowed or checked for
- skip_large_structures (bool) – Default False – currently, large structures can’t be saved as a PDB file even if you just want to save a single chain, so Biopython will throw an error when trying to do so. As an alternative, if a large structure is selected as representative, the pipeline will currently point to it and not clean it. If you don’t want this to happen, set this to true.
- clean (bool) – If structures should be cleaned
- force_rerun (bool) – If sequence to structure alignment should be rerun
Todo
- Remedy large structure representative setting
In [10]:
# Set representative structures
my_gempro.set_representative_structure()
my_gempro.df_representative_structures.head()
[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: 2/2: number of genes with a representative structure
[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: See the "df_representative_structures" attribute for a summary dataframe.
Out[10]:
id | is_experimental | file_type | structure_file | |
---|---|---|---|---|
gene | ||||
b0870 | REP-3wlx | True | pdb | 3wlx-A_clean.pdb |
b3041 | REP-1iez | True | pdb | 1iez-A_clean.pdb |
In [11]:
# Looking at the information saved within a gene
my_gempro.genes.get_by_id('b0870').protein.representative_structure
my_gempro.genes.get_by_id('b0870').protein.representative_structure.get_dict()
Out[11]:
<StructProp REP-3wlx at 0x7f9a0ae345f8>
Out[11]:
{'_structure_dir': '/tmp/genes_and_sequences_GP/genes/b0870/b0870_protein/structures',
'chains': [<ChainProp A at 0x7f99fbd55710>],
'date': None,
'description': 'Low specificity L-threonine aldolase (E.C.4.1.2.48)',
'file_type': 'pdb',
'id': 'REP-3wlx',
'is_experimental': True,
'mapped_chains': ['A'],
'notes': {},
'original_structure_id': '3wlx',
'resolution': 2.51,
'structure_file': '3wlx-A_clean.pdb',
'taxonomy_name': 'Escherichia coli'}
Creating homology models¶
For those proteins with no representative structure, we can create
homology models for them. ssbio
contains some built in functions for
easily running
I-TASSER
locally or on machines with SLURM
(ie. on NERSC) or Torque
job
scheduling.
You can load in I-TASSER models once they complete using the
get_itasser_models
later.
In [12]:
# Prep I-TASSER model folders
my_gempro.prep_itasser_modeling('~/software/I-TASSER4.4', '~/software/ITLIB/', runtype='local', all_genes=False)
[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: Prepared I-TASSER modeling folders for 0 genes in folder /tmp/genes_and_sequences_GP/data/homology_models
Saving your GEM-PRO¶
-
GEMPRO.
save_json
(outfile, compression=False) Save the object as a JSON file using json_tricks
In [13]:
import os.path as op
my_gempro.save_json(op.join(my_gempro.model_dir, '{}.json'.format(my_gempro.id)), compression=False)
[2018-02-05 18:12] [root] WARNING: json-tricks: numpy scalar serialization is experimental and may work differently in future versions
[2018-02-05 18:12] [ssbio.io] INFO: Saved <class 'ssbio.pipeline.gempro.GEMPRO'> (id: genes_and_sequences_GP) to /tmp/genes_and_sequences_GP/model/genes_and_sequences_GP.json
GEM-PRO - List of Gene IDs¶
This notebook gives an example of how to run the GEM-PRO pipeline with a list of gene IDs.
Imports¶
In [1]:
import sys
import logging
In [2]:
# Import the GEM-PRO class
from ssbio.pipeline.gempro import GEMPRO
In [3]:
# Printing multiple outputs per cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
Set the logging level in logger.setLevel(logging.<LEVEL_HERE>)
to
specify how verbose you want the pipeline to be. Debug is most verbose.
CRITICAL
- Only really important messages shown
ERROR
- Major errors
WARNING
- Warnings that don’t affect running of the pipeline
INFO
(default)- Info such as the number of structures mapped per gene
DEBUG
- Really detailed information that will print out a lot of stuff
DEBUG
mode prints out a large amount of information,
especially if you have a lot of genes. This may stall your notebook!
In [4]:
# Create logger
logger = logging.getLogger()
logger.setLevel(logging.INFO) # SET YOUR LOGGING LEVEL HERE #
In [5]:
# Other logger stuff for Jupyter notebooks
handler = logging.StreamHandler(sys.stderr)
formatter = logging.Formatter('[%(asctime)s] [%(name)s] %(levelname)s: %(message)s', datefmt="%Y-%m-%d %H:%M")
handler.setFormatter(formatter)
logger.handlers = [handler]
Initialization of the project¶
Set these three things:
ROOT_DIR
- The directory where a folder named after your
PROJECT
will be created
- The directory where a folder named after your
PROJECT
- Your project name
LIST_OF_GENES
- Your list of gene IDs
A directory will be created in ROOT_DIR
with your PROJECT
name.
The folders are organized like so:
ROOT_DIR
└── PROJECT
├── data # General storage for pipeline outputs
├── model # SBML and GEM-PRO models are stored here
├── genes # Per gene information
│ ├── <gene_id1> # Specific gene directory
│ │ └── protein
│ │ ├── sequences # Protein sequence files, alignments, etc.
│ │ └── structures # Protein structure files, calculations, etc.
│ └── <gene_id2>
│ └── protein
│ ├── sequences
│ └── structures
├── reactions # Per reaction information
│ └── <reaction_id1> # Specific reaction directory
│ └── complex
│ └── structures # Protein complex files
└── metabolites # Per metabolite information
└── <metabolite_id1> # Specific metabolite directory
└── chemical
└── structures # Metabolite 2D and 3D structure files
In [6]:
# SET FOLDERS AND DATA HERE
import tempfile
ROOT_DIR = tempfile.gettempdir()
PROJECT = 'genes_GP'
LIST_OF_GENES = ['b0761', 'b0889', 'b0995', 'b1013', 'b1014', 'b1040', 'b1130', 'b1187', 'b1221', 'b1299']
PDB_FILE_TYPE = 'mmtf'
In [7]:
# Create the GEM-PRO project
my_gempro = GEMPRO(gem_name=PROJECT, root_dir=ROOT_DIR, genes_list=LIST_OF_GENES, pdb_file_type=PDB_FILE_TYPE)
[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: Creating GEM-PRO project directory in folder /tmp
[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: /tmp/genes_GP: GEM-PRO project location
[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: 10: number of genes
Mapping gene ID –> sequence¶
First, we need to map these IDs to their protein sequences. There are 2 ID mapping services provided to do this - through KEGG or UniProt. The end goal is to map a UniProt ID to each ID, since there is a comprehensive mapping (and some useful APIs) between UniProt and the PDB.
-
GEMPRO.
kegg_mapping_and_metadata
(kegg_organism_code, custom_gene_mapping=None, outdir=None, set_as_representative=False, force_rerun=False)[source] Map all genes in the model to KEGG IDs using the KEGG service.
- Steps:
- Download all metadata and sequence files in the sequences directory
- Creates a KEGGProp object in the protein.sequences attribute
- Returns a Pandas DataFrame of mapping results
Parameters: - kegg_organism_code (str) – The three letter KEGG code of your organism
- custom_gene_mapping (dict) – If your model genes differ from the gene IDs you want to map, custom_gene_mapping allows you to input a dictionary which maps model gene IDs to new ones. Dictionary keys must match model gene IDs.
- outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
- set_as_representative (bool) – If mapped KEGG IDs should be set as representative sequences
- force_rerun (bool) – If you want to overwrite any existing mappings and files
In [8]:
# KEGG mapping of gene ids
my_gempro.kegg_mapping_and_metadata(kegg_organism_code='eco')
print('Missing KEGG mapping: ', my_gempro.missing_kegg_mapping)
my_gempro.df_kegg_metadata.head()
[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: 10/10: number of genes mapped to KEGG
[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: Completed ID mapping --> KEGG. See the "df_kegg_metadata" attribute for a summary dataframe.
Missing KEGG mapping: []
Out[8]:
kegg | refseq | uniprot | pdbs | sequence_file | metadata_file | |
---|---|---|---|---|---|---|
gene | ||||||
b0761 | eco:b0761 | NP_415282 | P0A9G8 | 1B9M;1H9S;1B9N;1O7L;1H9R | eco-b0761.faa | eco-b0761.kegg |
b0889 | eco:b0889 | NP_415409 | P0ACJ0 | 2GQQ;2L4A | eco-b0889.faa | eco-b0889.kegg |
b0995 | eco:b0995 | NP_415515 | P38684 | 1ZGZ | eco-b0995.faa | eco-b0995.kegg |
b1013 | eco:b1013 | NP_415533 | P0ACU2 | 4JYK;4XK4;4X1E;3LOC | eco-b1013.faa | eco-b1013.kegg |
b1014 | eco:b1014 | NP_415534 | P09546 | 3E2Q;4JNZ;3E2R;4JNY;2GPE;4O8A;3E2S;2FZN;1TJ1;1... | eco-b1014.faa | eco-b1014.kegg |
-
GEMPRO.
uniprot_mapping_and_metadata
(model_gene_source, custom_gene_mapping=None, outdir=None, set_as_representative=False, force_rerun=False)[source] Map all genes in the model to UniProt IDs using the UniProt mapping service. Also download all metadata and sequences.
Parameters: - model_gene_source (str) –
the database source of your model gene IDs. See: http://www.uniprot.org/help/api_idmapping Common model gene sources are:
- Ensembl Genomes -
ENSEMBLGENOME_ID
(i.e. E. coli b-numbers) - Entrez Gene (GeneID) -
P_ENTREZGENEID
- RefSeq Protein -
P_REFSEQ_AC
- Ensembl Genomes -
- custom_gene_mapping (dict) – If your model genes differ from the gene IDs you want to map, custom_gene_mapping allows you to input a dictionary which maps model gene IDs to new ones. Dictionary keys must match model genes.
- outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
- set_as_representative (bool) – If mapped UniProt IDs should be set as representative sequences
- force_rerun (bool) – If you want to overwrite any existing mappings and files
- model_gene_source (str) –
In [9]:
# UniProt mapping
my_gempro.uniprot_mapping_and_metadata(model_gene_source='ENSEMBLGENOME_ID')
print('Missing UniProt mapping: ', my_gempro.missing_uniprot_mapping)
my_gempro.df_uniprot_metadata.head()
[2018-02-05 18:12] [root] INFO: getUserAgent: Begin
[2018-02-05 18:12] [root] INFO: getUserAgent: user_agent: EBI-Sample-Client/ (services.py; Python 3.6.3; Linux) Python-requests/2.18.4
[2018-02-05 18:12] [root] INFO: getUserAgent: End
[2018-02-05 18:12] [root] WARNING: status is not ok with Bad Request
[2018-02-05 18:12] [root] WARNING: Results seems empty...returning empty dictionary.
[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: 0/10: number of genes mapped to UniProt
[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: Completed ID mapping --> UniProt. See the "df_uniprot_metadata" attribute for a summary dataframe.
Missing UniProt mapping: ['b1130', 'b0889', 'b1221', 'b0761', 'b0995', 'b1187', 'b1013', 'b1299', 'b1014', 'b1040']
[2018-02-05 18:12] [ssbio.pipeline.gempro] WARNING: Empty dataframe
Out[9]:
uniprot | reviewed | gene_name | kegg | refseq | num_pdbs | pdbs | ec_number | pfam | seq_len | description | entry_date | entry_version | seq_date | seq_version | sequence_file | metadata_file | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
gene |
-
GEMPRO.
set_representative_sequence
(force_rerun=False)[source] Automatically consolidate loaded sequences (manual, UniProt, or KEGG) and set a single representative sequence.
Manually set representative sequences override all existing mappings. UniProt mappings override KEGG mappings except when KEGG mappings have PDBs associated with them and UniProt doesn’t.
Parameters: force_rerun (bool) – Set to True to recheck stored sequences
In [10]:
# Set representative sequences
my_gempro.set_representative_sequence()
print('Missing a representative sequence: ', my_gempro.missing_representative_sequence)
my_gempro.df_representative_sequences.head()
[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: 10/10: number of genes with a representative sequence
[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: See the "df_representative_sequences" attribute for a summary dataframe.
Missing a representative sequence: []
Out[10]:
uniprot | kegg | pdbs | sequence_file | metadata_file | |
---|---|---|---|---|---|
gene | |||||
b0761 | P0A9G8 | eco:b0761 | 1B9M;1H9S;1B9N;1O7L;1H9R | eco-b0761.faa | eco-b0761.kegg |
b0889 | P0ACJ0 | eco:b0889 | 2GQQ;2L4A | eco-b0889.faa | eco-b0889.kegg |
b0995 | P38684 | eco:b0995 | 1ZGZ | eco-b0995.faa | eco-b0995.kegg |
b1013 | P0ACU2 | eco:b1013 | 4JYK;4XK4;4X1E;3LOC | eco-b1013.faa | eco-b1013.kegg |
b1014 | P09546 | eco:b1014 | 3E2Q;4JNZ;3E2R;4JNY;2GPE;4O8A;3E2S;2FZN;1TJ1;1... | eco-b1014.faa | eco-b1014.kegg |
Mapping representative sequence –> structure¶
These are the ways to map sequence to structure:
- Use the UniProt ID and their automatic mappings to the PDB
- BLAST the sequence to the PDB
- Make homology models or
- Map to existing homology models
You can only utilize option #1 to map to PDBs if there is a mapped UniProt ID set in the representative sequence. If not, you’ll have to BLAST your sequence to the PDB or make a homology model. You can also run both for maximum coverage.
-
GEMPRO.
map_uniprot_to_pdb
(seq_ident_cutoff=0.0, outdir=None, force_rerun=False)[source] Map all representative sequences’ UniProt ID to PDB IDs using the PDBe “Best Structures” API. Will save a JSON file of the results to each protein’s
sequences
folder.The “Best structures” API is available at https://www.ebi.ac.uk/pdbe/api/doc/sifts.html The list of PDB structures mapping to a UniProt accession sorted by coverage of the protein and, if the same, resolution.
Parameters: - seq_ident_cutoff (float) – Sequence identity cutoff in decimal form
- outdir (str) – Output directory to cache JSON results of search
- force_rerun (bool) – Force re-downloading of JSON results if they already exist
Returns: A rank-ordered list of PDBProp objects that map to the UniProt ID
Return type: list
In [11]:
# Mapping using the PDBe best_structures service
my_gempro.map_uniprot_to_pdb(seq_ident_cutoff=.3)
my_gempro.df_pdb_ranking.head()
[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: Mapping UniProt IDs --> PDB IDs...
[2018-02-05 18:12] [root] INFO: getUserAgent: Begin
[2018-02-05 18:12] [root] INFO: getUserAgent: user_agent: EBI-Sample-Client/ (services.py; Python 3.6.3; Linux) Python-requests/2.18.4
[2018-02-05 18:12] [root] INFO: getUserAgent: End
[2018-02-05 18:12] [root] WARNING: status is not ok with Bad Request
[2018-02-05 18:12] [root] WARNING: Results seems empty...returning empty dictionary.
[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: 0/10: number of genes with at least one experimental structure
[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: Completed UniProt --> best PDB mapping. See the "df_pdb_ranking" attribute for a summary dataframe.
[2018-02-05 18:12] [ssbio.pipeline.gempro] WARNING: Empty dataframe
Out[11]:
-
GEMPRO.
blast_seqs_to_pdb
(seq_ident_cutoff=0, evalue=0.0001, all_genes=False, display_link=False, outdir=None, force_rerun=False)[source] BLAST each representative protein sequence to the PDB. Saves raw BLAST results (XML files).
Parameters: - seq_ident_cutoff (float, optional) – Cutoff results based on percent coverage (in decimal form)
- evalue (float, optional) – Cutoff for the E-value - filters for significant hits. 0.001 is liberal, 0.0001 is stringent (default).
- all_genes (bool) – If all genes should be BLASTed, or only those without any structures currently mapped
- display_link (bool, optional) – Set to True if links to the HTML results should be displayed
- outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
- force_rerun (bool, optional) – If existing BLAST results should not be used, set to True. Default is False
In [12]:
# Mapping using BLAST
my_gempro.blast_seqs_to_pdb(all_genes=True, seq_ident_cutoff=.9, evalue=0.00001)
my_gempro.df_pdb_blast.head(2)
[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: Completed sequence --> PDB BLAST. See the "df_pdb_blast" attribute for a summary dataframe.
[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: 5: number of genes with additional structures added from BLAST
Out[12]:
pdb_id | pdb_chain_id | hit_score | hit_evalue | hit_percent_similar | hit_percent_ident | hit_num_ident | hit_num_similar | |
---|---|---|---|---|---|---|---|---|
gene | ||||||||
b0761 | 1b9n | B | 1091.0 | 5.530720e-119 | 0.931298 | 0.931298 | 244 | 244 |
b0761 | 1o7l | D | 1089.0 | 1.096280e-118 | 0.931298 | 0.931298 | 244 | 244 |
Downloading and ranking structures¶
-
GEMPRO.
pdb_downloader_and_metadata
(outdir=None, pdb_file_type=None, force_rerun=False)[source] Download ALL mapped experimental structures to each protein’s structures directory.
Parameters: - outdir (str) – Path to output directory, if GEM-PRO directories were not set or other output directory is desired
- pdb_file_type (str) – Type of PDB file to download, if not already set or other format is desired
- force_rerun (bool) – If files should be re-downloaded if they already exist
In [13]:
# Download all mapped PDBs and gather the metadata
my_gempro.pdb_downloader_and_metadata()
my_gempro.df_pdb_metadata.head(2)
[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: Updated PDB metadata dataframe. See the "df_pdb_metadata" attribute for a summary dataframe.
[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: Saved 15 structures total
Out[13]:
chemicals | description | experimental_method | mapped_chains | pdb_id | pdb_title | resolution | structure_file | taxonomy_name | |
---|---|---|---|---|---|---|---|---|---|
gene | |||||||||
b0761 | NI | ModE (MOLYBDATE-DEPENDENT TRANSCRIPTIONAL REGU... | X-RAY DIFFRACTION | A;B | 1b9m | REGULATOR FROM ESCHERICHIA COLI | 1.75 | 1b9m.mmtf | Escherichia coli |
b0761 | NI | MODE (MOLYBDATE DEPENDENT TRANSCRIPTIONAL REGU... | X-RAY DIFFRACTION | A;B | 1b9n | REGULATOR FROM ESCHERICHIA COLI | 2.09 | 1b9n.mmtf | Escherichia coli |
-
GEMPRO.
set_representative_structure
(seq_outdir=None, struct_outdir=None, pdb_file_type=None, engine=’needle’, always_use_homology=False, rez_cutoff=0.0, seq_ident_cutoff=0.5, allow_missing_on_termini=0.2, allow_mutants=True, allow_deletions=False, allow_insertions=False, allow_unresolved=True, skip_large_structures=False, clean=True, force_rerun=False)[source] Set all representative structure for proteins from a structure in the structures attribute.
Each gene can have a combination of the following, which will be analyzed to set a representative structure.
- Homology model(s)
- Ranked PDBs
- BLASTed PDBs
If the
always_use_homology
flag is true, homology models are always set as representative when they exist. If there are multiple homology models, we rank by the percent sequence coverage.Parameters: - seq_outdir (str) – Path to output directory of sequence alignment files, must be set if GEM-PRO directories were not created initially
- struct_outdir (str) – Path to output directory of structure files, must be set if GEM-PRO directories were not created initially
- pdb_file_type (str) –
pdb
,mmCif
,xml
,mmtf
- file type for files downloaded from the PDB - engine (str) –
biopython
orneedle
- which pairwise alignment program to use.needle
is the standard EMBOSS tool to run pairwise alignments.biopython
is Biopython’s implementation of needle. Results can differ! - always_use_homology (bool) – If homology models should always be set as the representative structure
- rez_cutoff (float) – Resolution cutoff, in Angstroms (only if experimental structure)
- seq_ident_cutoff (float) – Percent sequence identity cutoff, in decimal form
- allow_missing_on_termini (float) – Percentage of the total length of the reference sequence which will be ignored when checking for modifications. Example: if 0.1, and reference sequence is 100 AA, then only residues 5 to 95 will be checked for modifications.
- allow_mutants (bool) – If mutations should be allowed or checked for
- allow_deletions (bool) – If deletions should be allowed or checked for
- allow_insertions (bool) – If insertions should be allowed or checked for
- allow_unresolved (bool) – If unresolved residues should be allowed or checked for
- skip_large_structures (bool) – Default False – currently, large structures can’t be saved as a PDB file even if you just want to save a single chain, so Biopython will throw an error when trying to do so. As an alternative, if a large structure is selected as representative, the pipeline will currently point to it and not clean it. If you don’t want this to happen, set this to true.
- clean (bool) – If structures should be cleaned
- force_rerun (bool) – If sequence to structure alignment should be rerun
Todo
- Remedy large structure representative setting
In [14]:
# Set representative structures
my_gempro.set_representative_structure()
my_gempro.df_representative_structures.head()
[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: 5/10: number of genes with a representative structure
[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: See the "df_representative_structures" attribute for a summary dataframe.
Out[14]:
id | is_experimental | file_type | structure_file | |
---|---|---|---|---|
gene | ||||
b0761 | REP-1b9n | True | pdb | 1b9n-A_clean.pdb |
b0889 | REP-2gqq | True | pdb | 2gqq-A_clean.pdb |
b1013 | REP-4xk4 | True | pdb | 4xk4-A_clean.pdb |
b1187 | REP-1h9t | True | pdb | 1h9t-A_clean.pdb |
b1221 | REP-1rnl | True | pdb | 1rnl-A_clean.pdb |
In [15]:
# Looking at the information saved within a gene
my_gempro.genes.get_by_id('b1187').protein.representative_structure
my_gempro.genes.get_by_id('b1187').protein.representative_structure.get_dict()
Out[15]:
<StructProp REP-1h9t at 0x7f3880a275c0>
Out[15]:
{'_structure_dir': '/tmp/genes_GP/genes/b1187/b1187_protein/structures',
'chains': [<ChainProp A at 0x7f38830e1c18>],
'date': None,
'description': 'FATTY ACID METABOLISM REGULATOR PROTEIN',
'file_type': 'pdb',
'id': 'REP-1h9t',
'is_experimental': True,
'mapped_chains': ['A'],
'notes': {},
'original_structure_id': '1h9t',
'resolution': 3.25,
'structure_file': '1h9t-A_clean.pdb',
'taxonomy_name': 'ESCHERICHIA COLI'}
Saving your GEM-PRO¶
-
GEMPRO.
save_json
(outfile, compression=False) Save the object as a JSON file using json_tricks
In [16]:
import os.path as op
my_gempro.save_json(op.join(my_gempro.model_dir, '{}.json'.format(my_gempro.id)), compression=False)
[2018-02-05 18:12] [root] WARNING: json-tricks: numpy scalar serialization is experimental and may work differently in future versions
[2018-02-05 18:12] [ssbio.io] INFO: Saved <class 'ssbio.pipeline.gempro.GEMPRO'> (id: genes_GP) to /tmp/genes_GP/model/genes_GP.json
GEM-PRO - SBML Model¶
This notebook gives an example of how to run the GEM-PRO pipeline with a SBML model, in this case iNJ661, the metabolic model of M. tuberculosis.
Imports¶
In [1]:
import sys
import logging
In [2]:
# Import the GEM-PRO class
from ssbio.pipeline.gempro import GEMPRO
In [3]:
# Printing multiple outputs per cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
Set the logging level in logger.setLevel(logging.<LEVEL_HERE>)
to
specify how verbose you want the pipeline to be. Debug is most verbose.
CRITICAL
- Only really important messages shown
ERROR
- Major errors
WARNING
- Warnings that don’t affect running of the pipeline
INFO
(default)- Info such as the number of structures mapped per gene
DEBUG
- Really detailed information that will print out a lot of stuff
DEBUG
mode prints out a large amount of information,
especially if you have a lot of genes. This may stall your notebook!
In [4]:
# Create logger
logger = logging.getLogger()
logger.setLevel(logging.INFO) # SET YOUR LOGGING LEVEL HERE #
In [5]:
# Other logger stuff for Jupyter notebooks
handler = logging.StreamHandler(sys.stderr)
formatter = logging.Formatter('[%(asctime)s] [%(name)s] %(levelname)s: %(message)s', datefmt="%Y-%m-%d %H:%M")
handler.setFormatter(formatter)
logger.handlers = [handler]
Initialization of the project¶
Set these three things:
ROOT_DIR
- The directory where a folder named after your
PROJECT
will be created
- The directory where a folder named after your
PROJECT
- Your project name
LIST_OF_GENES
- Your list of gene IDs
A directory will be created in ROOT_DIR
with your PROJECT
name.
The folders are organized like so:
ROOT_DIR
└── PROJECT
├── data # General storage for pipeline outputs
├── model # SBML and GEM-PRO models are stored here
├── genes # Per gene information
│ ├── <gene_id1> # Specific gene directory
│ │ └── protein
│ │ ├── sequences # Protein sequence files, alignments, etc.
│ │ └── structures # Protein structure files, calculations, etc.
│ └── <gene_id2>
│ └── protein
│ ├── sequences
│ └── structures
├── reactions # Per reaction information
│ └── <reaction_id1> # Specific reaction directory
│ └── complex
│ └── structures # Protein complex files
└── metabolites # Per metabolite information
└── <metabolite_id1> # Specific metabolite directory
└── chemical
└── structures # Metabolite 2D and 3D structure files
In [6]:
# SET FOLDERS AND DATA HERE
import tempfile
ROOT_DIR = tempfile.gettempdir()
PROJECT = 'mtuberculosis_gp'
GEM_FILE = '../../ssbio/test/test_files/models/iNJ661.json'
GEM_FILE_TYPE = 'json'
PDB_FILE_TYPE = 'mmtf'
In [7]:
# Create the GEM-PRO project
my_gempro = GEMPRO(gem_name=PROJECT, root_dir=ROOT_DIR, gem_file_path=GEM_FILE, gem_file_type=GEM_FILE_TYPE, pdb_file_type=PDB_FILE_TYPE)
[2018-02-05 18:13] [ssbio.pipeline.gempro] INFO: Creating GEM-PRO project directory in folder /tmp
[2018-02-05 18:13] [ssbio.pipeline.gempro] INFO: /tmp/mtuberculosis_gp: GEM-PRO project location
[2018-02-05 18:13] [ssbio.pipeline.gempro] INFO: iNJ661: loaded model
[2018-02-05 18:13] [ssbio.pipeline.gempro] INFO: 1025: number of reactions
[2018-02-05 18:13] [ssbio.pipeline.gempro] INFO: 720: number of reactions linked to a gene
[2018-02-05 18:13] [ssbio.pipeline.gempro] INFO: 661: number of genes (excluding spontaneous)
[2018-02-05 18:13] [ssbio.pipeline.gempro] INFO: 826: number of metabolites
[2018-02-05 18:13] [ssbio.pipeline.gempro] WARNING: IMPORTANT: All Gene objects have been transformed into GenePro objects, and will be for any new ones
[2018-02-05 18:13] [ssbio.pipeline.gempro] INFO: 661: number of genes
Mapping gene ID –> sequence¶
First, we need to map these IDs to their protein sequences. There are 2 ID mapping services provided to do this - through KEGG or UniProt. The end goal is to map a UniProt ID to each ID, since there is a comprehensive mapping (and some useful APIs) between UniProt and the PDB.
However, you don’t need to map using these services if you already have
the amino acid sequences for each protein. You can just manually load in
the sequences as shown using the method manual_seq_mapping
. Or, if
you already have the UniProt IDs, you can load those in using the method
manual_uniprot_mapping
.
-
GEMPRO.
manual_seq_mapping
(gene_to_seq_dict, outdir=None, write_fasta_files=True, set_as_representative=True)[source] Read a manual input dictionary of model gene IDs –> protein sequences. By default sets them as representative.
Parameters: - gene_to_seq_dict (dict) – Mapping of gene IDs to their protein sequence strings
- outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
- write_fasta_files (bool) – If individual protein FASTA files should be written out
- set_as_representative (bool) – If mapped sequences should be set as representative
In [8]:
gene_to_seq_dict = {'Rv1295': 'MTVPPTATHQPWPGVIAAYRDRLPVGDDWTPVTLLEGGTPLIAATNLSKQTGCTIHLKVEGLNPTGSFKDRGMTMAVTDALAHGQRAVLCASTGNTSASAAAYAARAGITCAVLIPQGKIAMGKLAQAVMHGAKIIQIDGNFDDCLELARKMAADFPTISLVNSVNPVRIEGQKTAAFEIVDVLGTAPDVHALPVGNAGNITAYWKGYTEYHQLGLIDKLPRMLGTQAAGAAPLVLGEPVSHPETIATAIRIGSPASWTSAVEAQQQSKGRFLAASDEEILAAYHLVARVEGVFVEPASAASIAGLLKAIDDGWVARGSTVVCTVTGNGLKDPDTALKDMPSVSPVPVDPVAVVEKLGLA',
'Rv2233': 'VSSPRERRPASQAPRLSRRPPAHQTSRSSPDTTAPTGSGLSNRFVNDNGIVTDTTASGTNCPPPPRAAARRASSPGESPQLVIFDLDGTLTDSARGIVSSFRHALNHIGAPVPEGDLATHIVGPPMHETLRAMGLGESAEEAIVAYRADYSARGWAMNSLFDGIGPLLADLRTAGVRLAVATSKAEPTARRILRHFGIEQHFEVIAGASTDGSRGSKVDVLAHALAQLRPLPERLVMVGDRSHDVDGAAAHGIDTVVVGWGYGRADFIDKTSTTVVTHAATIDELREALGV'}
my_gempro.manual_seq_mapping(gene_to_seq_dict)
[2018-02-05 18:14] [ssbio.pipeline.gempro] INFO: Loaded in 2 sequences
-
GEMPRO.
manual_uniprot_mapping
(gene_to_uniprot_dict, outdir=None, set_as_representative=True)[source] Read a manual dictionary of model gene IDs –> UniProt IDs. By default sets them as representative.
This allows for mapping of the missing genes, or overriding of automatic mappings.
Input a dictionary of:
{ <gene_id1>: <uniprot_id1>, <gene_id2>: <uniprot_id2>, }
Parameters: - gene_to_uniprot_dict – Dictionary of mappings as shown above
- outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
- set_as_representative (bool) – If mapped UniProt IDs should be set as representative sequences
In [9]:
manual_uniprot_dict = {'Rv1755c': 'P9WIA9', 'Rv2321c': 'P71891', 'Rv0619': 'Q79FY3', 'Rv0618': 'Q79FY4', 'Rv2322c': 'P71890'}
my_gempro.manual_uniprot_mapping(manual_uniprot_dict)
my_gempro.df_uniprot_metadata.tail(4)
[2018-02-05 18:14] [ssbio.pipeline.gempro] INFO: Completed manual ID mapping --> UniProt. See the "df_uniprot_metadata" attribute for a summary dataframe.
Out[9]:
uniprot | reviewed | gene_name | kegg | refseq | pfam | description | entry_date | entry_version | seq_date | seq_version | sequence_file | metadata_file | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
gene | |||||||||||||
Rv0619 | Q79FY3 | False | galTb | NaN | NaN | PF02744 | Probable galactose-1-phosphate uridylyltransfe... | 2017-07-05 | 78 | 2004-07-05 | 1 | Q79FY3.fasta | Q79FY3.xml |
Rv1755c | P9WIA9 | False | plcD | NaN | NaN | PF04185 | Phospholipase C 4 | 2017-07-05 | 18 | 2014-04-16 | 1 | P9WIA9.fasta | P9WIA9.xml |
Rv2321c | P71891 | False | rocD2 | mtv:RVBD_2321c | WP_003411956.1 | PF00202 | Probable ornithine aminotransferase (C-terminu... | 2017-07-05 | 116 | 1997-02-01 | 1 | P71891.fasta | P71891.xml |
Rv2322c | P71890 | False | rocD1 | mtv:RVBD_2322c | WP_003411957.1 | PF00202 | Probable ornithine aminotransferase (N-terminu... | 2017-06-07 | 117 | 1997-02-01 | 1 | P71890.fasta | P71890.xml |
-
GEMPRO.
kegg_mapping_and_metadata
(kegg_organism_code, custom_gene_mapping=None, outdir=None, set_as_representative=False, force_rerun=False)[source] Map all genes in the model to KEGG IDs using the KEGG service.
- Steps:
- Download all metadata and sequence files in the sequences directory
- Creates a KEGGProp object in the protein.sequences attribute
- Returns a Pandas DataFrame of mapping results
Parameters: - kegg_organism_code (str) – The three letter KEGG code of your organism
- custom_gene_mapping (dict) – If your model genes differ from the gene IDs you want to map, custom_gene_mapping allows you to input a dictionary which maps model gene IDs to new ones. Dictionary keys must match model gene IDs.
- outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
- set_as_representative (bool) – If mapped KEGG IDs should be set as representative sequences
- force_rerun (bool) – If you want to overwrite any existing mappings and files
In [10]:
# KEGG mapping of gene ids
my_gempro.kegg_mapping_and_metadata(kegg_organism_code='mtu')
print('Missing KEGG mapping: ', my_gempro.missing_kegg_mapping)
my_gempro.df_kegg_metadata.head()
[2018-02-05 18:14] [root] WARNING: status is not ok with Not Found
[2018-02-05 18:14] [ssbio.databases.kegg] WARNING: mtu:Rv1755c: no sequence file available
[2018-02-05 18:14] [root] WARNING: status is not ok with Not Found
[2018-02-05 18:14] [ssbio.databases.kegg] WARNING: mtu:Rv1755c: no metadata file available
[2018-02-05 18:14] [root] WARNING: status is not ok with Not Found
[2018-02-05 18:14] [ssbio.databases.kegg] WARNING: mtu:Rv2233: no sequence file available
[2018-02-05 18:14] [root] WARNING: status is not ok with Not Found
[2018-02-05 18:14] [ssbio.databases.kegg] WARNING: mtu:Rv2233: no metadata file available
[2018-02-05 18:14] [ssbio.core.protein] WARNING: Rv2233: representative sequence does not match mapped KEGG sequence.
[2018-02-05 18:14] [root] WARNING: status is not ok with Not Found
[2018-02-05 18:14] [ssbio.databases.kegg] WARNING: mtu:Rv0619: no sequence file available
[2018-02-05 18:14] [root] WARNING: status is not ok with Not Found
[2018-02-05 18:14] [ssbio.databases.kegg] WARNING: mtu:Rv0619: no metadata file available
[2018-02-05 18:14] [root] WARNING: status is not ok with Not Found
[2018-02-05 18:14] [ssbio.databases.kegg] WARNING: mtu:Rv0618: no sequence file available
[2018-02-05 18:14] [root] WARNING: status is not ok with Not Found
[2018-02-05 18:14] [ssbio.databases.kegg] WARNING: mtu:Rv2321c: no sequence file available
[2018-02-05 18:14] [root] WARNING: status is not ok with Not Found
[2018-02-05 18:14] [ssbio.databases.kegg] WARNING: mtu:Rv2321c: no metadata file available
[2018-02-05 18:14] [root] WARNING: status is not ok with Not Found
[2018-02-05 18:14] [ssbio.databases.kegg] WARNING: mtu:Rv2322c: no sequence file available
[2018-02-05 18:14] [root] WARNING: status is not ok with Not Found
[2018-02-05 18:14] [ssbio.databases.kegg] WARNING: mtu:Rv2322c: no metadata file available
[2018-02-05 18:14] [ssbio.pipeline.gempro] INFO: 655/661: number of genes mapped to KEGG
[2018-02-05 18:14] [ssbio.pipeline.gempro] INFO: Completed ID mapping --> KEGG. See the "df_kegg_metadata" attribute for a summary dataframe.
Missing KEGG mapping: ['Rv0618', 'Rv2233', 'Rv0619', 'Rv1755c', 'Rv2322c', 'Rv2321c']
Out[10]:
kegg | refseq | uniprot | pdbs | sequence_file | metadata_file | |
---|---|---|---|---|---|---|
gene | ||||||
Rv0013 | mtu:Rv0013 | YP_177615 | I6WX77 | NaN | mtu-Rv0013.faa | mtu-Rv0013.kegg |
Rv0032 | mtu:Rv0032 | NP_214546 | I6Y6Q7 | NaN | mtu-Rv0032.faa | mtu-Rv0032.kegg |
Rv0046c | mtu:Rv0046c | NP_214560 | I6X8D3 | 1GR0 | mtu-Rv0046c.faa | mtu-Rv0046c.kegg |
Rv0066c | mtu:Rv0066c | NP_214580 | L0T2B7 | NaN | mtu-Rv0066c.faa | mtu-Rv0066c.kegg |
Rv0069c | mtu:Rv0069c | NP_214583 | A0A089S0Y8 | NaN | mtu-Rv0069c.faa | mtu-Rv0069c.kegg |
-
GEMPRO.
uniprot_mapping_and_metadata
(model_gene_source, custom_gene_mapping=None, outdir=None, set_as_representative=False, force_rerun=False)[source] Map all genes in the model to UniProt IDs using the UniProt mapping service. Also download all metadata and sequences.
Parameters: - model_gene_source (str) –
the database source of your model gene IDs. See: http://www.uniprot.org/help/api_idmapping Common model gene sources are:
- Ensembl Genomes -
ENSEMBLGENOME_ID
(i.e. E. coli b-numbers) - Entrez Gene (GeneID) -
P_ENTREZGENEID
- RefSeq Protein -
P_REFSEQ_AC
- Ensembl Genomes -
- custom_gene_mapping (dict) – If your model genes differ from the gene IDs you want to map, custom_gene_mapping allows you to input a dictionary which maps model gene IDs to new ones. Dictionary keys must match model genes.
- outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
- set_as_representative (bool) – If mapped UniProt IDs should be set as representative sequences
- force_rerun (bool) – If you want to overwrite any existing mappings and files
- model_gene_source (str) –
In [11]:
# UniProt mapping
my_gempro.uniprot_mapping_and_metadata(model_gene_source='TUBERCULIST_ID')
print('Missing UniProt mapping: ', my_gempro.missing_uniprot_mapping)
my_gempro.df_uniprot_metadata.head()
[2018-02-05 18:14] [root] INFO: getUserAgent: Begin
[2018-02-05 18:14] [root] INFO: getUserAgent: user_agent: EBI-Sample-Client/ (services.py; Python 3.6.3; Linux) Python-requests/2.18.4
[2018-02-05 18:14] [root] INFO: getUserAgent: End
[2018-02-05 18:15] [root] WARNING: status is not ok with Bad Request
[2018-02-05 18:15] [root] WARNING: Results seems empty...returning empty dictionary.
[2018-02-05 18:15] [ssbio.pipeline.gempro] INFO: 0/661: number of genes mapped to UniProt
[2018-02-05 18:15] [ssbio.pipeline.gempro] INFO: Completed ID mapping --> UniProt. See the "df_uniprot_metadata" attribute for a summary dataframe.
Missing UniProt mapping: ['Rv2178c', 'Rv2476c', 'Rv3607c', 'Rv1202', 'Rv1030', 'Rv1187', 'Rv2421c', 'Rv2247', 'Rv2701c', 'Rv1662', 'Rv1380', 'Rv2763c', 'Rv2435c', 'Rv0082', 'Rv0189c', 'Rv3757c', 'Rv1133c', 'Rv1257c', 'Rv1623c', 'Rv0958', 'Rv1162', 'Rv3581c', 'Rv1739c', 'Rv1086', 'Rv1077', 'Rv1625c', 'Rv3227', 'Rv2832c', 'Rv3308', 'Rv2858c', 'Rv2193', 'Rv2834c', 'Rv2995c', 'Rv1915', 'Rv3003c', 'Rv1538c', 'Rv2793c', 'Rv2152c', 'Rv3468c', 'Rv1485', 'Rv1079', 'Rv1552', 'Rv1161', 'Rv3307', 'Rv0768', 'Rv1843c', 'Rv1302', 'Rv0780', 'Rv2064', 'Rv2124c', 'Rv1908c', 'Rv1555', 'Rv2754c', 'Rv0346c', 'Rv2467', 'Rv2384', 'Rv3758c', 'Rv1338', 'Rv1311', 'Rv1599', 'Rv1589', 'Rv2930', 'Rv0564c', 'Rv2070c', 'Rv0143c', 'Rv0107c', 'Rv2157c', 'Rv2847c', 'Rv1416', 'Rv0147', 'Rv0432', 'Rv3152', 'Rv0098', 'Rv0548c', 'Rv1408', 'Rv2320c', 'Rv2859c', 'Rv3314c', 'Rv3331', 'Rv3393', 'Rv0509', 'Rv2062c', 'Rv1248c', 'Rv1895', 'Rv2870c', 'Rv1131', 'Rv1294', 'Rv0337c', 'Rv0650', 'Rv1001', 'Rv2211c', 'Rv2471', 'Rv2075c', 'Rv1622c', 'Rv3045', 'Rv0772', 'Rv1607', 'Rv2601', 'Rv2363', 'Rv2483c', 'Rv1553', 'Rv1559', 'Rv3464', 'Rv3010c', 'Rv2552c', 'Rv1031', 'Rv1905c', 'Rv3815c', 'Rv2935', 'Rv3469c', 'Rv1595', 'Rv1293', 'Rv1099c', 'Rv3398c', 'Rv2764c', 'Rv1604', 'Rv2399c', 'Rv1695', 'Rv0253', 'Rv0046c', 'Rv3634c', 'Rv2982c', 'Rv0414c', 'Rv3410c', 'Rv3042c', 'Rv1659', 'Rv0013', 'Rv3582c', 'Rv2051c', 'Rv1082', 'Rv1373', 'Rv1570', 'Rv2236c', 'Rv1093', 'Rv1445c', 'Rv3602c', 'Rv2502c', 'Rv1436', 'Rv1849', 'Rv2382c', 'Rv0952', 'Rv2139', 'Rv0956', 'Rv3283', 'Rv3310', 'Rv0373c', 'Rv2243', 'Rv1612', 'Rv1307', 'Rv3229c', 'Rv3265c', 'Rv2220', 'Rv2786c', 'Rv0382c', 'Rv2583c', 'Rv2678c', 'Rv0896', 'Rv1207', 'Rv0183', 'Rv3913', 'Rv2940c', 'Rv3153', 'Rv1731', 'Rv0973c', 'Rv3247c', 'Rv0728c', 'Rv2987c', 'Rv3846', 'Rv3150', 'Rv0234c', 'Rv3794', 'Rv3436c', 'Rv2427c', 'Rv3257c', 'Rv2539c', 'Rv2965c', 'Rv2156c', 'Rv3791', 'Rv1296', 'Rv2287', 'Rv2899c', 'Rv3275c', 'Rv1837c', 'Rv1285', 'Rv1602', 'Rv2245', 'Rv0423c', 'Rv2208', 'Rv3801c', 'Rv0436c', 'Rv1492', 'Rv2780', 'Rv1658', 'Rv2674', 'Rv2329c', 'Rv3509c', 'Rv3290c', 'Rv0620', 'Rv1383', 'Rv1850', 'Rv0112', 'Rv2996c', 'Rv3396c', 'Rv2671', 'Rv1653', 'Rv2496c', 'Rv3330', 'Rv1663', 'Rv0820', 'Rv1448c', 'Rv2205c', 'Rv1447c', 'Rv2043c', 'Rv3806c', 'Rv2029c', 'Rv3792', 'Rv3313c', 'Rv1822', 'Rv3470c', 'Rv1609', 'Rv0103c', 'Rv3264c', 'Rv1484', 'Rv3534c', 'Rv1603', 'Rv0803', 'Rv2849c', 'Rv2458', 'Rv3341', 'Rv1617', 'Rv0570', 'Rv3609c', 'Rv1512', 'Rv1672c', 'Rv3759c', 'Rv2201', 'Rv0855', 'Rv3236c', 'Rv1940', 'Rv2163c', 'Rv0374c', 'Rv1563c', 'Rv0375c', 'Rv2455c', 'Rv1310', 'Rv1295', 'Rv2702', 'Rv3356c', 'Rv2445c', 'Rv3273', 'Rv1306', 'Rv3423c', 'Rv2589', 'Rv0317c', 'Rv3303c', 'Rv2439c', 'Rv1304', 'Rv3248c', 'Rv0858c', 'Rv0534c', 'Rv3158', 'Rv1605', 'Rv1127c', 'Rv1692', 'Rv1655', 'Rv2192c', 'Rv1391', 'Rv1122', 'Rv2454c', 'Rv2573', 'Rv2949c', 'Rv2379c', 'Rv1240', 'Rv1098c', 'Rv2196', 'Rv0555', 'Rv1350', 'Rv3302c', 'Rv1309', 'Rv0267', 'Rv0255c', 'Rv2194', 'Rv3051c', 'Rv0524', 'Rv2590', 'Rv2465c', 'Rv0137c', 'Rv1601', 'Rv0437c', 'Rv1554', 'Rv1318c', 'Rv3340', 'Rv1389', 'Rv2934', 'Rv0545c', 'Rv0558', 'Rv1005c', 'Rv1412', 'Rv2127', 'Rv2335', 'Rv2392', 'Rv2981c', 'Rv3709c', 'Rv3754', 'Rv2612c', 'Rv2931', 'Rv0557', 'Rv0252', 'Rv2928', 'Rv1347c', 'Rv1449c', 'Rv1600', 'Rv2207', 'Rv0536', 'Rv0295c', 'Rv3455c', 'Rv3411c', 'Rv0503c', 'Rv3708c', 'Rv0417', 'Rv1704c', 'Rv0489', 'Rv3795', 'Rv0553', 'Rv0573c', 'Rv2498c', 'Rv0470c', 'Rv0773c', 'Rv3048c', 'Rv2933', 'Rv3704c', 'Rv3818', 'Rv1737c', 'Rv0266c', 'Rv0644c', 'Rv2130c', 'Rv3280', 'Rv2400c', 'Rv2182c', 'Rv1018c', 'Rv2158c', 'Rv3149', 'Rv2289', 'Rv0126', 'Rv2967c', 'Rv3214', 'Rv2317', 'Rv3490', 'Rv2881c', 'Rv1916', 'Rv1438', 'Rv1348', 'Rv3808c', 'Rv0334', 'Rv1594', 'Rv3156', 'Rv2072c', 'Rv2344c', 'Rv2210c', 'Rv2281', 'Rv0993', 'Rv2316', 'Rv1406', 'Rv1613', 'Rv1201c', 'Rv0091', 'Rv3318', 'Rv3285', 'Rv0409', 'Rv0974c', 'Rv3316', 'Rv2121c', 'Rv2833c', 'Rv3309c', 'Rv1328', 'Rv3266c', 'Rv1213', 'Rv2697c', 'Rv0946c', 'Rv1832', 'Rv3793', 'Rv1011', 'Rv3696c', 'Rv2246', 'Rv3858c', 'Rv1620c', 'Rv2386c', 'Rv1618', 'Rv0211', 'Rv1631', 'Rv0512', 'Rv1385', 'Rv1286', 'Rv2122c', 'Rv3319', 'Rv2249c', 'Rv2835c', 'Rv2941', 'Rv2605c', 'Rv2394', 'Rv3145', 'Rv0886', 'Rv3157', 'Rv3737', 'Rv1094', 'Rv1820', 'Rv0500', 'Rv1236', 'Rv1409', 'Rv2136c', 'Rv3800c', 'Rv2984', 'Rv0118c', 'Rv1121', 'Rv2932', 'Rv1826', 'Rv2233', 'Rv1511', 'Rv2540c', 'Rv3838c', 'Rv1305', 'Rv2200c', 'Rv3146', 'Rv0389', 'Rv0848', 'Rv3279c', 'Rv1529', 'Rv0533c', 'Rv0951', 'Rv0884c', 'Rv1308', 'Rv0482', 'Rv2259', 'Rv0727c', 'Rv0261c', 'Rv3281', 'Rv3756c', 'Rv1475c', 'Rv0462', 'Rv1551', 'Rv0805', 'Rv1656', 'Rv2398c', 'Rv1392', 'Rv1170', 'Rv2438c', 'Rv3379c', 'Rv1562c', 'Rv0066c', 'Rv2726c', 'Rv1844c', 'Rv1699', 'Rv2231c', 'Rv3002c', 'Rv3441c', 'Rv1029', 'Rv3001c', 'Rv1415', 'Rv3148', 'Rv2378c', 'Rv2958c', 'Rv2900c', 'Rv0505c', 'Rv3043c', 'Rv1200', 'Rv1185c', 'Rv2531c', 'Rv3772', 'Rv2947c', 'Rv1654', 'Rv0486', 'Rv2436', 'Rv2977c', 'Rv0511', 'Rv1848', 'Rv1237', 'Rv3601c', 'Rv0733', 'Rv3155', 'Rv0753c', 'Rv0694', 'Rv3215', 'Rv3710', 'Rv3606c', 'Rv0254c', 'Rv3315c', 'Rv3713', 'Rv0155', 'Rv0859', 'Rv3317', 'Rv2222c', 'Rv2495c', 'Rv1745c', 'Rv2497c', 'Rv0771', 'Rv1320c', 'Rv3842c', 'Rv1349', 'Rv2773c', 'Rv1164', 'Rv1381', 'Rv0363c', 'Rv2682c', 'Rv3113', 'Rv3154', 'Rv2860c', 'Rv3588c', 'Rv3608c', 'Rv3535c', 'Rv2318', 'Rv0777', 'Rv0729', 'Rv1714', 'Rv0645c', 'Rv0853c', 'Rv0642c', 'Rv0162c', 'Rv2992c', 'Rv3068c', 'Rv1621c', 'Rv0522', 'Rv1568', 'Rv3465', 'Rv2584c', 'Rv2746c', 'Rv0248c', 'Rv0542c', 'Rv2383c', 'Rv0391', 'Rv2332', 'Rv1264', 'Rv2065', 'Rv1652', 'Rv2291', 'Rv1902c', 'Rv1928c', 'Rv0478', 'Rv0467', 'Rv2713', 'Rv1017c', 'Rv0501', 'Rv0422c', 'Rv1023', 'Rv3624c', 'Rv2071c', 'Rv2964', 'Rv0357c', 'Rv2195', 'Rv1647', 'Rv3293', 'Rv2006', 'Rv3826', 'Rv0499', 'Rv0032', 'Rv2988c', 'Rv2155c', 'Rv3372', 'Rv1712', 'Rv1336', 'Rv0306', 'Rv2397c', 'Rv1611', 'Rv3859c', 'Rv0889c', 'Rv2677c', 'Rv0069c', 'Rv2945c', 'Rv1238', 'Rv0408', 'Rv3276c', 'Rv2361c', 'Rv0510', 'Rv1569', 'Rv2610c', 'Rv0812', 'Rv2962c', 'Rv0649', 'Rv2381c', 'Rv2538c', 'Rv1872c', 'Rv3628', 'Rv0191', 'Rv2920c', 'Rv3777', 'Rv0808', 'Rv2447c', 'Rv2607', 'Rv2611c', 'Rv3790', 'Rv3147', 'Rv2380c', 'Rv2753c', 'Rv1323', 'Rv3255c', 'Rv2202c', 'Rv2443', 'Rv0247c', 'Rv1437', 'Rv3784', 'Rv0156', 'Rv1188', 'Rv0322', 'Rv2957', 'Rv0957', 'Rv3332', 'Rv0157', 'Rv3339c', 'Rv3809c', 'Rv2523c', 'Rv0468', 'Rv0936', 'Rv2501c', 'Rv3106', 'Rv0904c', 'Rv1878', 'Rv1451', 'Rv0794c', 'Rv1885c', 'Rv2524c', 'Rv2153c', 'Rv1606', 'Rv3432c', 'Rv0843', 'Rv0321', 'Rv2883c', 'Rv2334', 'Rv1315', 'Rv0819', 'Rv3667', 'Rv0809', 'Rv1596', 'Rv0084', 'Rv2066', 'Rv2391', 'Rv1319c', 'Rv1163', 'Rv2504c', 'Rv1239c', 'Rv3565', 'Rv2215', 'Rv2855', 'Rv0824c', 'Rv2537c', 'Rv1464', 'Rv2848c', 'Rv2225', 'Rv0070c', 'Rv2241', 'Rv1493', 'Rv1981c', 'Rv2503c', 'Rv2388c', 'Rv1092c', 'Rv1483', 'Rv0788', 'Rv0860']
Out[11]:
uniprot | reviewed | gene_name | kegg | refseq | pfam | description | entry_date | entry_version | seq_date | seq_version | sequence_file | metadata_file | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
gene | |||||||||||||
Rv0618 | Q79FY4 | False | galTa | mtv:RVBD_0618 | WP_003900189.1 | PF01087 | Probable galactose-1-phosphate uridylyltransfe... | 2017-07-05 | 87 | 2004-07-05 | 1 | Q79FY4.fasta | Q79FY4.xml |
Rv0619 | Q79FY3 | False | galTb | NaN | NaN | PF02744 | Probable galactose-1-phosphate uridylyltransfe... | 2017-07-05 | 78 | 2004-07-05 | 1 | Q79FY3.fasta | Q79FY3.xml |
Rv1755c | P9WIA9 | False | plcD | NaN | NaN | PF04185 | Phospholipase C 4 | 2017-07-05 | 18 | 2014-04-16 | 1 | P9WIA9.fasta | P9WIA9.xml |
Rv2321c | P71891 | False | rocD2 | mtv:RVBD_2321c | WP_003411956.1 | PF00202 | Probable ornithine aminotransferase (C-terminu... | 2017-07-05 | 116 | 1997-02-01 | 1 | P71891.fasta | P71891.xml |
Rv2322c | P71890 | False | rocD1 | mtv:RVBD_2322c | WP_003411957.1 | PF00202 | Probable ornithine aminotransferase (N-terminu... | 2017-06-07 | 117 | 1997-02-01 | 1 | P71890.fasta | P71890.xml |
-
GEMPRO.
set_representative_sequence
(force_rerun=False)[source] Automatically consolidate loaded sequences (manual, UniProt, or KEGG) and set a single representative sequence.
Manually set representative sequences override all existing mappings. UniProt mappings override KEGG mappings except when KEGG mappings have PDBs associated with them and UniProt doesn’t.
Parameters: force_rerun (bool) – Set to True to recheck stored sequences
If you have mapped with both KEGG and UniProt mappers, then you can set a representative sequence for the gene using this function. If you used just one, this will just set that ID as representative.
- If any sequences or IDs were provided manually, these will be set as representative first.
- UniProt mappings override KEGG mappings except when KEGG mappings have PDBs associated with them and UniProt doesn’t.
In [12]:
# Set representative sequences
my_gempro.set_representative_sequence()
print('Missing a representative sequence: ', my_gempro.missing_representative_sequence)
my_gempro.df_representative_sequences.head()
[2018-02-05 18:15] [ssbio.pipeline.gempro] INFO: 661/661: number of genes with a representative sequence
[2018-02-05 18:15] [ssbio.pipeline.gempro] INFO: See the "df_representative_sequences" attribute for a summary dataframe.
Missing a representative sequence: []
Out[12]:
uniprot | kegg | pdbs | sequence_file | metadata_file | |
---|---|---|---|---|---|
gene | |||||
Rv0013 | I6WX77 | mtu:Rv0013 | NaN | mtu-Rv0013.faa | mtu-Rv0013.kegg |
Rv0032 | I6Y6Q7 | mtu:Rv0032 | NaN | mtu-Rv0032.faa | mtu-Rv0032.kegg |
Rv0046c | I6X8D3 | mtu:Rv0046c | 1GR0 | mtu-Rv0046c.faa | mtu-Rv0046c.kegg |
Rv0066c | L0T2B7 | mtu:Rv0066c | NaN | mtu-Rv0066c.faa | mtu-Rv0066c.kegg |
Rv0069c | A0A089S0Y8 | mtu:Rv0069c | NaN | mtu-Rv0069c.faa | mtu-Rv0069c.kegg |
Mapping representative sequence –> structure¶
These are the ways to map sequence to structure:
- Use the UniProt ID and their automatic mappings to the PDB
- BLAST the sequence to the PDB
- Make homology models or
- Map to existing homology models
You can only utilize option #1 to map to PDBs if there is a mapped UniProt ID set in the representative sequence. If not, you’ll have to BLAST your sequence to the PDB or make a homology model. You can also run both for maximum coverage.
-
GEMPRO.
map_uniprot_to_pdb
(seq_ident_cutoff=0.0, outdir=None, force_rerun=False)[source] Map all representative sequences’ UniProt ID to PDB IDs using the PDBe “Best Structures” API. Will save a JSON file of the results to each protein’s
sequences
folder.The “Best structures” API is available at https://www.ebi.ac.uk/pdbe/api/doc/sifts.html The list of PDB structures mapping to a UniProt accession sorted by coverage of the protein and, if the same, resolution.
Parameters: - seq_ident_cutoff (float) – Sequence identity cutoff in decimal form
- outdir (str) – Output directory to cache JSON results of search
- force_rerun (bool) – Force re-downloading of JSON results if they already exist
Returns: A rank-ordered list of PDBProp objects that map to the UniProt ID
Return type: list
In [13]:
# Mapping using the PDBe best_structures service
my_gempro.map_uniprot_to_pdb(seq_ident_cutoff=.3)
my_gempro.df_pdb_ranking.head()
[2018-02-05 18:15] [ssbio.pipeline.gempro] INFO: Mapping UniProt IDs --> PDB IDs...
[2018-02-05 18:15] [root] INFO: getUserAgent: Begin
[2018-02-05 18:15] [root] INFO: getUserAgent: user_agent: EBI-Sample-Client/ (services.py; Python 3.6.3; Linux) Python-requests/2.18.4
[2018-02-05 18:15] [root] INFO: getUserAgent: End
[2018-02-05 18:15] [root] WARNING: status is not ok with Bad Request
[2018-02-05 18:15] [root] WARNING: Results seems empty...returning empty dictionary.
[2018-02-05 18:15] [ssbio.pipeline.gempro] INFO: 0/661: number of genes with at least one experimental structure
[2018-02-05 18:15] [ssbio.pipeline.gempro] INFO: Completed UniProt --> best PDB mapping. See the "df_pdb_ranking" attribute for a summary dataframe.
[2018-02-05 18:15] [ssbio.pipeline.gempro] WARNING: Empty dataframe
Out[13]:
-
GEMPRO.
blast_seqs_to_pdb
(seq_ident_cutoff=0, evalue=0.0001, all_genes=False, display_link=False, outdir=None, force_rerun=False)[source] BLAST each representative protein sequence to the PDB. Saves raw BLAST results (XML files).
Parameters: - seq_ident_cutoff (float, optional) – Cutoff results based on percent coverage (in decimal form)
- evalue (float, optional) – Cutoff for the E-value - filters for significant hits. 0.001 is liberal, 0.0001 is stringent (default).
- all_genes (bool) – If all genes should be BLASTed, or only those without any structures currently mapped
- display_link (bool, optional) – Set to True if links to the HTML results should be displayed
- outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
- force_rerun (bool, optional) – If existing BLAST results should not be used, set to True. Default is False
In [14]:
# Mapping using BLAST
my_gempro.blast_seqs_to_pdb(all_genes=True, seq_ident_cutoff=.9, evalue=0.00001)
my_gempro.df_pdb_blast.head(2)
[2018-02-05 18:15] [ssbio.pipeline.gempro] INFO: Completed sequence --> PDB BLAST. See the "df_pdb_blast" attribute for a summary dataframe.
[2018-02-05 18:15] [ssbio.pipeline.gempro] INFO: 141: number of genes with additional structures added from BLAST
Out[14]:
pdb_id | pdb_chain_id | hit_score | hit_evalue | hit_percent_similar | hit_percent_ident | hit_num_ident | hit_num_similar | |
---|---|---|---|---|---|---|---|---|
gene | ||||||||
Rv0046c | 1gr0 | A | 1861.0 | 0.0 | 1.000000 | 1.000000 | 367 | 367 |
Rv0066c | 5kvu | D | 3828.0 | 0.0 | 0.981208 | 0.981208 | 731 | 731 |
-
GEMPRO.
get_itasser_models
(homology_raw_dir, custom_itasser_name_mapping=None, outdir=None, force_rerun=False)[source] Copy generated I-TASSER models from a directory to the GEM-PRO directory.
Parameters: - homology_raw_dir (str) – Root directory of I-TASSER folders.
- custom_itasser_name_mapping (dict) – Use this if your I-TASSER folder names differ from your model gene names. Input a dict of {model_gene: ITASSER_folder}.
- outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
- force_rerun (bool) – If homology files should be copied again even if they exist in the GEM-PRO directory
In [15]:
tb_homology_dir = '/home/nathan/projects_archive/homology_models/MTUBERCULOSIS/'
##### EXAMPLE SPECIFIC CODE #####
# Needed to map to older IDs used in this example
import pandas as pd
import os.path as op
old_gene_to_homology = pd.read_csv(op.join(tb_homology_dir, 'data/161031-old_gene_to_uniprot_mapping.csv'))
gene_to_uniprot = old_gene_to_homology.set_index('m_gene').to_dict()['u_uniprot_acc']
my_gempro.get_itasser_models(homology_raw_dir=op.join(tb_homology_dir, 'raw'), custom_itasser_name_mapping=gene_to_uniprot)
### END EXAMPLE SPECIFIC CODE ###
# Organizing I-TASSER homology models
my_gempro.get_itasser_models(homology_raw_dir=op.join(tb_homology_dir, 'raw'))
my_gempro.df_homology_models.head()
[2018-02-05 18:16] [ssbio.pipeline.gempro] INFO: Completed copying of 435 I-TASSER models to GEM-PRO directory. See the "df_homology_models" attribute for a summary dataframe.
[2018-02-05 18:16] [ssbio.pipeline.gempro] INFO: Completed copying of 9 I-TASSER models to GEM-PRO directory. See the "df_homology_models" attribute for a summary dataframe.
Out[15]:
id | structure_file | model_date | difficulty | top_template_pdb | top_template_chain | c_score | tm_score | tm_score_err | rmsd | rmsd_err | |
---|---|---|---|---|---|---|---|---|---|---|---|
gene | |||||||||||
Rv0013 | P9WN35 | P9WN35_model1.pdb | 2018-02-06 | easy | 1i7s | B | -0.53 | 0.65 | 0.13 | 6.8 | 4.0 |
Rv0032 | P9WQ85 | P9WQ85_model1.pdb | 2018-02-06 | easy | 3a2b | A | -2.89 | 0.39 | 0.13 | 15.7 | 3.3 |
Rv0066c | O53611 | O53611_model1.pdb | 2018-02-06 | easy | 1itw | A | 1.91 | 0.99 | 0.04 | 4.1 | 2.8 |
Rv0069c | P9WGT5 | P9WGT5_model1.pdb | 2018-02-06 | easy | 4rqo | A | 1.18 | 0.88 | 0.07 | 4.6 | 3.0 |
Rv0070c | P9WGI7 | P9WGI7_model1.pdb | 2018-02-06 | easy | 3h7f | B | 1.80 | 0.97 | 0.05 | 3.3 | 2.3 |
-
GEMPRO.
get_manual_homology_models
(input_dict, outdir=None, clean=True, force_rerun=False)[source] Copy homology models to the GEM-PRO project.
Requires an input of a dictionary formatted like so:
{ model_gene: { homology_model_id1: { 'model_file': '/path/to/homology/model.pdb', 'file_type': 'pdb' 'additional_info': info_value }, homology_model_id2: { 'model_file': '/path/to/homology/model.pdb' 'file_type': 'pdb' } } }
Parameters: - input_dict (dict) – Dictionary of dictionaries of gene names to homology model IDs and other information
- outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
- clean (bool) – If homology files should be cleaned and saved as a new PDB file
- force_rerun (bool) – If homology files should be copied again even if they exist in the GEM-PRO directory
In [16]:
homology_model_dict = {}
my_gempro.get_manual_homology_models(homology_model_dict)
[2018-02-05 18:16] [ssbio.pipeline.gempro] INFO: Updated homology model information for 0 genes.
Downloading and ranking structures¶
-
GEMPRO.
pdb_downloader_and_metadata
(outdir=None, pdb_file_type=None, force_rerun=False)[source] Download ALL mapped experimental structures to each protein’s structures directory.
Parameters: - outdir (str) – Path to output directory, if GEM-PRO directories were not set or other output directory is desired
- pdb_file_type (str) – Type of PDB file to download, if not already set or other format is desired
- force_rerun (bool) – If files should be re-downloaded if they already exist
In [ ]:
# Download all mapped PDBs and gather the metadata
my_gempro.pdb_downloader_and_metadata()
my_gempro.df_pdb_metadata.head(2)
-
GEMPRO.
set_representative_structure
(seq_outdir=None, struct_outdir=None, pdb_file_type=None, engine=’needle’, always_use_homology=False, rez_cutoff=0.0, seq_ident_cutoff=0.5, allow_missing_on_termini=0.2, allow_mutants=True, allow_deletions=False, allow_insertions=False, allow_unresolved=True, skip_large_structures=False, clean=True, force_rerun=False)[source] Set all representative structure for proteins from a structure in the structures attribute.
Each gene can have a combination of the following, which will be analyzed to set a representative structure.
- Homology model(s)
- Ranked PDBs
- BLASTed PDBs
If the
always_use_homology
flag is true, homology models are always set as representative when they exist. If there are multiple homology models, we rank by the percent sequence coverage.Parameters: - seq_outdir (str) – Path to output directory of sequence alignment files, must be set if GEM-PRO directories were not created initially
- struct_outdir (str) – Path to output directory of structure files, must be set if GEM-PRO directories were not created initially
- pdb_file_type (str) –
pdb
,mmCif
,xml
,mmtf
- file type for files downloaded from the PDB - engine (str) –
biopython
orneedle
- which pairwise alignment program to use.needle
is the standard EMBOSS tool to run pairwise alignments.biopython
is Biopython’s implementation of needle. Results can differ! - always_use_homology (bool) – If homology models should always be set as the representative structure
- rez_cutoff (float) – Resolution cutoff, in Angstroms (only if experimental structure)
- seq_ident_cutoff (float) – Percent sequence identity cutoff, in decimal form
- allow_missing_on_termini (float) – Percentage of the total length of the reference sequence which will be ignored when checking for modifications. Example: if 0.1, and reference sequence is 100 AA, then only residues 5 to 95 will be checked for modifications.
- allow_mutants (bool) – If mutations should be allowed or checked for
- allow_deletions (bool) – If deletions should be allowed or checked for
- allow_insertions (bool) – If insertions should be allowed or checked for
- allow_unresolved (bool) – If unresolved residues should be allowed or checked for
- skip_large_structures (bool) – Default False – currently, large structures can’t be saved as a PDB file even if you just want to save a single chain, so Biopython will throw an error when trying to do so. As an alternative, if a large structure is selected as representative, the pipeline will currently point to it and not clean it. If you don’t want this to happen, set this to true.
- clean (bool) – If structures should be cleaned
- force_rerun (bool) – If sequence to structure alignment should be rerun
Todo
- Remedy large structure representative setting
In [17]:
# Set representative structures
my_gempro.set_representative_structure()
my_gempro.df_representative_structures.head()
[2018-02-05 18:16] [ssbio.core.protein] WARNING: Rv0234c: no structures meet quality checks
[2018-02-05 18:16] [ssbio.core.protein] WARNING: Rv0505c: no structures meet quality checks
[2018-02-05 18:17] [ssbio.core.protein] WARNING: Rv2987c: no structures meet quality checks
[2018-02-05 18:18] [ssbio.core.protein] WARNING: Rv2498c: no structures meet quality checks
[2018-02-05 18:18] [ssbio.core.protein] WARNING: Rv3601c: no structures meet quality checks
[2018-02-05 18:18] [ssbio.pipeline.gempro] INFO: 553/661: number of genes with a representative structure
[2018-02-05 18:18] [ssbio.pipeline.gempro] INFO: See the "df_representative_structures" attribute for a summary dataframe.
Out[17]:
id | is_experimental | file_type | structure_file | |
---|---|---|---|---|
gene | ||||
Rv0013 | REP-P9WN35 | False | pdb | P9WN35_model1-X_clean.pdb |
Rv0032 | REP-P9WQ85 | False | pdb | P9WQ85_model1-X_clean.pdb |
Rv0046c | REP-1gr0 | True | pdb | 1gr0-A_clean.pdb |
Rv0066c | REP-5kvu | True | pdb | 5kvu-A_clean.pdb |
Rv0069c | REP-P9WGT5 | False | pdb | P9WGT5_model1-X_clean.pdb |
In [18]:
# Looking at the information saved within a gene
my_gempro.genes.get_by_id('Rv1295').protein.representative_structure
my_gempro.genes.get_by_id('Rv1295').protein.representative_structure.get_dict()
Out[18]:
<StructProp REP-2d1f at 0x7fdec937e4a8>
Out[18]:
{'_structure_dir': '/tmp/mtuberculosis_gp/genes/Rv1295/Rv1295_protein/structures',
'chains': [<ChainProp A at 0x7fdec8655630>],
'date': None,
'description': 'Threonine synthase (E.C.4.2.3.1)',
'file_type': 'pdb',
'id': 'REP-2d1f',
'is_experimental': True,
'mapped_chains': ['A'],
'notes': {},
'original_structure_id': '2d1f',
'resolution': 2.5,
'structure_file': '2d1f-A_clean.pdb',
'taxonomy_name': 'Mycobacterium tuberculosis'}
Creating homology models¶
For those proteins with no representative structure, we can create
homology models for them. ssbio
contains some built in functions for
easily running
I-TASSER
locally or on machines with SLURM
(ie. on NERSC) or Torque
job
scheduling.
You can load in I-TASSER models once they complete using the
get_itasser_models
later.
In [19]:
# Prep I-TASSER model folders
my_gempro.prep_itasser_modeling('~/software/I-TASSER4.4', '~/software/ITLIB/', runtype='local', all_genes=False)
[2018-02-05 18:18] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv2934: I-TASSER modeling will not run as sequence length (1827) is not in the range [10, 1500]
[2018-02-05 18:18] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv2932: I-TASSER modeling will not run as sequence length (1538) is not in the range [10, 1500]
[2018-02-05 18:18] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv2933: I-TASSER modeling will not run as sequence length (2188) is not in the range [10, 1500]
[2018-02-05 18:18] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv2931: I-TASSER modeling will not run as sequence length (1876) is not in the range [10, 1500]
[2018-02-05 18:18] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv2380c: I-TASSER modeling will not run as sequence length (1682) is not in the range [10, 1500]
[2018-02-05 18:18] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv3859c: I-TASSER modeling will not run as sequence length (1527) is not in the range [10, 1500]
[2018-02-05 18:18] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv2476c: I-TASSER modeling will not run as sequence length (1624) is not in the range [10, 1500]
[2018-02-05 18:18] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv3800c: I-TASSER modeling will not run as sequence length (1733) is not in the range [10, 1500]
[2018-02-05 18:18] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv0107c: I-TASSER modeling will not run as sequence length (1632) is not in the range [10, 1500]
[2018-02-05 18:18] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv2940c: I-TASSER modeling will not run as sequence length (2111) is not in the range [10, 1500]
[2018-02-05 18:18] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv1662: I-TASSER modeling will not run as sequence length (1602) is not in the range [10, 1500]
[2018-02-05 18:18] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv2524c: I-TASSER modeling will not run as sequence length (3069) is not in the range [10, 1500]
[2018-02-05 18:18] [ssbio.pipeline.gempro] INFO: Prepared I-TASSER modeling folders for 108 genes in folder /tmp/mtuberculosis_gp/data/homology_models
Saving your GEM-PRO¶
Finally, you can save your GEM-PRO as a JSON
or pickle
file, so
you don’t have to run the pipeline again.
For most functions, if you rerun them, they will check for existing results saved as files. The only function that would take a long time is setting the representative structure, as they are each rechecked and cleaned. This is where saving helps!
-
GEMPRO.
save_pickle
(outfile, protocol=2) Save the object as a pickle file
Parameters: - outfile (str) – Filename
- protocol (int) – Pickle protocol to use. Default is 2 to remain compatible with Python 2
Returns: Path to pickle file
Return type: str
In [20]:
import os.path as op
my_gempro.save_pickle(op.join(my_gempro.model_dir, '{}.pckl'.format(my_gempro.id)))
-
GEMPRO.
save_json
(outfile, compression=False) Save the object as a JSON file using json_tricks
In [21]:
import os.path as op
my_gempro.save_json(op.join(my_gempro.model_dir, '{}.json'.format(my_gempro.id)), compression=False)
[2018-02-05 18:18] [root] WARNING: json-tricks: numpy scalar serialization is experimental and may work differently in future versions
[2018-02-05 18:18] [ssbio.io] INFO: Saved <class 'ssbio.pipeline.gempro.GEMPRO'> (id: mtuberculosis_gp) to /tmp/mtuberculosis_gp/model/mtuberculosis_gp.json
Loading a saved GEM-PRO¶
In [ ]:
# Loading a pickle file
import pickle
with open('/tmp/mtuberculosis_gp_atlas/model/mtuberculosis_gp_atlas.pckl', 'rb') as f:
my_saved_gempro = pickle.load(f)
In [ ]:
# Loading a JSON file
import ssbio.core.io
my_saved_gempro = ssbio.core.io.load_json('/tmp/mtuberculosis_gp_atlas/model/mtuberculosis_gp_atlas.json', decompression=False)
Features¶
- Automated mapping of gene/protein sequence IDs
- Consolidating sequence IDs and setting a representative protein sequence
- Mapping of representative protein sequence –> 3D structures
- Preparation of sequences for homology modeling (currently for I-TASSER)
- Running QC/QA on structures and setting a representative protein structure
- Automation of protein sequence and structure property calculation
- Creation of Pandas DataFrame summaries directly from downloaded or calculated metadata
COBRApy model additions¶

Let’s take a look at a GEM loaded with ssbio and what additions exist compared to a GEM loaded with COBRApy. In the figure above, the text in grey indicates objects that exist in a COBRApy Model
object, and in blue, the attributes added when loading with ssbio. Please note that the Complex
object is still under development and currently non-functional.
COBRApy¶
Under construction…
ssbio¶
Under construction…
Use cases¶

When would you create or use a GEM-PRO? The added context of manually curated network interactions to protein structures enables different scales of analyses. For instance…
From the “top-down”:¶
- Global non-variant properties of protein structures such as the distribution of fold types can be compared within or between organisms [1], [2], [3], elucidating adaptations that are reflected in the structural proteome.
- Multi-strain modelling techniques ([10], [11], [12]) would allow strain-specific changes to be investigated at the molecular level, potentially explaining phenotypic differences or strain adaptations to certain environments.
File organization¶
Files such as sequences, structures, alignment files, and property calculation outputs can optionally be cached on a user’s disk to minimize calls to web services, limit recalculations, and provide direct inputs to common sequence and structure algorithms which often require local copies of the data. For a GEM-PRO project, files are organized in the following fashion once a root directory and project name are set:
<ROOT_DIR>
└── <PROJECT_NAME>
├── data # General directory for pipeline outputs
├── model # SBML and GEM-PRO models are stored in this directory
└── genes # Per gene information
└── <gene_id1> # Specific gene directory
└── <protein_id1> # Protein directory
├── sequences # Protein sequence files, alignments, etc.
└── structures # Protein structure files, calculations, etc.
API¶
GEMPRO¶
-
class
ssbio.pipeline.gempro.
GEMPRO
(gem_name, root_dir=None, pdb_file_type='mmtf', gem=None, gem_file_path=None, gem_file_type=None, genes_list=None, genes_and_sequences=None, genome_path=None, write_protein_fasta_files=True, description=None, custom_spont_id=None)[source]¶ Generic class to represent all information for a GEM-PRO project.
Initialize the GEM-PRO project with a genome-scale model, a list of genes, or a dict of genes and sequences. Specify the name of your project, along with the root directory where a folder with that name will be created.
Main methods provided are:
Automated mapping of sequence IDs
- With KEGG mapper
- With UniProt mapper
- Allowing manual gene ID –> protein sequence entry
- Allowing manual gene ID –> UniProt ID
Consolidating sequence IDs and setting a representative sequence
- Currently these are set based on available PDB IDs
Mapping of representative sequence –> structures
- With UniProt –> ranking of PDB structures
- BLAST representative sequence –> PDB database
Preparation of files for homology modeling (currently for I-TASSER)
- Mapping to existing models
- Preparation for running I-TASSER
- Parsing I-TASSER runs
Running QC/QA on structures and setting a representative structure
- Various cutoffs (mutations, insertions, deletions) can be set to filter structures
Automation of protein sequence and structure property calculation
Creation of Pandas DataFrame summaries directly from downloaded metadata
Parameters: - gem_name (str) – The name of your GEM or just your project in general. This will be the name of the main folder that is created in root_dir.
- root_dir (str) – Path to where the folder named after
gem_name
will be created. If not provided, directories will not be created and output directories need to be specified for some steps. - pdb_file_type (str) –
pdb
,mmCif
,xml
,mmtf
- file type for files downloaded from the PDB - gem (Model) – COBRApy Model object
- gem_file_path (str) – Path to GEM file
- gem_file_type (str) – GEM model type -
sbml
(orxml
),mat
, orjson
formats - genes_list (list) – List of gene IDs that you want to map
- genes_and_sequences (dict) – Dictionary of gene IDs and their amino acid sequence strings
- genome_path (str) – FASTA file of all protein sequences
- write_protein_fasta_files (bool) – If individual protein FASTA files should be written out
- description (str) – Description string of your project
- custom_spont_id (str) – ID of spontaneous genes in a COBRA model which will be ignored for analysis
-
add_gene_ids
(genes_list)[source]¶ Add gene IDs manually into the GEM-PRO project.
Parameters: genes_list (list) – List of gene IDs as strings.
-
base_dir
¶ str – GEM-PRO project folder.
-
blast_seqs_to_pdb
(seq_ident_cutoff=0, evalue=0.0001, all_genes=False, display_link=False, outdir=None, force_rerun=False)[source]¶ BLAST each representative protein sequence to the PDB. Saves raw BLAST results (XML files).
Parameters: - seq_ident_cutoff (float, optional) – Cutoff results based on percent coverage (in decimal form)
- evalue (float, optional) – Cutoff for the E-value - filters for significant hits. 0.001 is liberal, 0.0001 is stringent (default).
- all_genes (bool) – If all genes should be BLASTed, or only those without any structures currently mapped
- display_link (bool, optional) – Set to True if links to the HTML results should be displayed
- outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
- force_rerun (bool, optional) – If existing BLAST results should not be used, set to True. Default is False
-
custom_spont_id
= None¶ str – ID of spontaneous genes in a COBRA model which will be ignored for analysis
-
data_dir
¶ str – Directory where all data are stored.
-
df_homology_models
¶ DataFrame – Get a dataframe of I-TASSER homology model results
-
df_kegg_metadata
¶ DataFrame – Pandas DataFrame of KEGG metadata per protein.
-
df_pdb_blast
¶ DataFrame – Get a dataframe of PDB BLAST results
-
df_pdb_metadata
¶ DataFrame – Get a dataframe of PDB metadata (PDBs have to be downloaded first).
-
df_pdb_ranking
¶ DataFrame – Get a dataframe of UniProt -> best structure in PDB results
-
df_proteins
¶ DataFrame – Get a summary dataframe of all proteins in the project.
-
df_representative_sequences
¶ DataFrame – Pandas DataFrame of representative sequence information per protein.
-
df_representative_structures
¶ DataFrame – Get a dataframe of representative protein structure information.
-
df_uniprot_metadata
¶ DataFrame – Pandas DataFrame of UniProt metadata per protein.
-
find_disulfide_bridges
(representatives_only=True)[source]¶ Run Biopython’s disulfide bridge finder and store found bridges.
Annotations are stored in the protein structure’s chain sequence at:
<chain_prop>.seq_record.annotations['SSBOND-biopython']
Parameters: representative_only (bool) – If analysis should only be run on the representative structure
-
find_disulfide_bridges_parallelize
(sc, representatives_only=True)[source]¶ Run Biopython’s disulfide bridge finder and store found bridges.
Annotations are stored in the protein structure’s chain sequence at:
<chain_prop>.seq_record.annotations['SSBOND-biopython']
Parameters: representative_only (bool) – If analysis should only be run on the representative structure
-
functional_genes
¶ DictList – All genes with a representative protein structure.
-
genes
= None¶ DictList – All protein-coding genes in this GEM-PRO project
-
genes_dir
¶ str – Directory where all gene specific information is stored.
-
genes_with_a_representative_sequence
¶ DictList – All genes with a representative sequence.
-
genes_with_a_representative_structure
¶ DictList – All genes with a representative protein structure.
-
genes_with_experimental_structures
¶ DictList – All genes that have at least one experimental structure.
-
genes_with_homology_models
¶ DictList – All genes that have at least one homology model.
-
genes_with_structures
¶ DictList – All genes with any mapped protein structures.
-
genome_path
= None¶ str – Simple link to the filepath of the FASTA file containing all protein sequences
-
get_dssp_annotations
(representatives_only=True, force_rerun=False)[source]¶ Run DSSP on structures and store calculations.
Annotations are stored in the protein structure’s chain sequence at:
<chain_prop>.seq_record.letter_annotations['*-dssp']
Parameters: - representative_only (bool) – If analysis should only be run on the representative structure
- force_rerun (bool) – If calculations should be rerun even if an output file exists
-
get_dssp_annotations_parallelize
(sc, representatives_only=True, force_rerun=False)[source]¶ Run DSSP on structures and store calculations.
Annotations are stored in the protein structure’s chain sequence at:
<chain_prop>.seq_record.letter_annotations['*-dssp']
Parameters: - representative_only (bool) – If analysis should only be run on the representative structure
- force_rerun (bool) – If calculations should be rerun even if an output file exists
-
get_freesasa_annotations
(include_hetatms=False, representatives_only=True, force_rerun=False)[source]¶ Run freesasa on structures and store calculations.
Annotations are stored in the protein structure’s chain sequence at:
<chain_prop>.seq_record.letter_annotations['*-freesasa']
Parameters: - include_hetatms (bool) – If HETATMs should be included in calculations. Defaults to
False
. - representative_only (bool) – If analysis should only be run on the representative structure
- force_rerun (bool) – If calculations should be rerun even if an output file exists
- include_hetatms (bool) – If HETATMs should be included in calculations. Defaults to
-
get_freesasa_annotations_parallelize
(sc, include_hetatms=False, representatives_only=True, force_rerun=False)[source]¶ Run freesasa on structures and store calculations.
Annotations are stored in the protein structure’s chain sequence at:
<chain_prop>.seq_record.letter_annotations['*-freesasa']
Parameters: - include_hetatms (bool) – If HETATMs should be included in calculations. Defaults to
False
. - representative_only (bool) – If analysis should only be run on the representative structure
- force_rerun (bool) – If calculations should be rerun even if an output file exists
- include_hetatms (bool) – If HETATMs should be included in calculations. Defaults to
-
get_itasser_models
(homology_raw_dir, custom_itasser_name_mapping=None, outdir=None, force_rerun=False)[source]¶ Copy generated I-TASSER models from a directory to the GEM-PRO directory.
Parameters: - homology_raw_dir (str) – Root directory of I-TASSER folders.
- custom_itasser_name_mapping (dict) – Use this if your I-TASSER folder names differ from your model gene names. Input a dict of {model_gene: ITASSER_folder}.
- outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
- force_rerun (bool) – If homology files should be copied again even if they exist in the GEM-PRO directory
-
get_manual_homology_models
(input_dict, outdir=None, clean=True, force_rerun=False)[source]¶ Copy homology models to the GEM-PRO project.
Requires an input of a dictionary formatted like so:
{ model_gene: { homology_model_id1: { 'model_file': '/path/to/homology/model.pdb', 'file_type': 'pdb' 'additional_info': info_value }, homology_model_id2: { 'model_file': '/path/to/homology/model.pdb' 'file_type': 'pdb' } } }
Parameters: - input_dict (dict) – Dictionary of dictionaries of gene names to homology model IDs and other information
- outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
- clean (bool) – If homology files should be cleaned and saved as a new PDB file
- force_rerun (bool) – If homology files should be copied again even if they exist in the GEM-PRO directory
-
get_msms_annotations
(representatives_only=True, force_rerun=False)[source]¶ Run MSMS on structures and store calculations.
Annotations are stored in the protein structure’s chain sequence at:
<chain_prop>.seq_record.letter_annotations['*-msms']
Parameters: - representative_only (bool) – If analysis should only be run on the representative structure
- force_rerun (bool) – If calculations should be rerun even if an output file exists
-
get_msms_annotations_parallelize
(sc, representatives_only=True, force_rerun=False)[source]¶ Run MSMS on structures and store calculations.
Annotations are stored in the protein structure’s chain sequence at:
<chain_prop>.seq_record.letter_annotations['*-msms']
Parameters: - representative_only (bool) – If analysis should only be run on the representative structure
- force_rerun (bool) – If calculations should be rerun even if an output file exists
-
get_scratch_predictions
(path_to_scratch, results_dir, scratch_basename='scratch', num_cores=1, exposed_buried_cutoff=25, custom_gene_mapping=None)[source]¶ Run and parse
SCRATCH
results to predict secondary structure and solvent accessibility. Annotations are stored in the protein’s representative sequence at:.annotations
.letter_annotations
Parameters: - path_to_scratch (str) – Path to SCRATCH executable
- results_dir (str) – Path to SCRATCH results folder, which will have the files (scratch.ss, scratch.ss8, scratch.acc, scratch.acc20)
- scratch_basename (str) – Basename of the SCRATCH results (‘scratch’ is default)
- num_cores (int) – Number of cores to use to parallelize SCRATCH run
- exposed_buried_cutoff (int) – Cutoff of exposed/buried for the acc20 predictions
- custom_gene_mapping (dict) – Default parsing of SCRATCH output files is to look for the model gene IDs. If your output files contain IDs which differ from the model gene IDs, use this dictionary to map model gene IDs to result file IDs. Dictionary keys must match model genes.
-
get_sequence_properties
(representatives_only=True)[source]¶ Run Biopython ProteinAnalysis and EMBOSS pepstats to summarize basic statistics of all protein sequences. Results are stored in the protein’s respective SeqProp objects at
.annotations
Parameters: representative_only (bool) – If analysis should only be run on the representative sequences
-
get_tmhmm_predictions
(tmhmm_results, custom_gene_mapping=None)[source]¶ Parse TMHMM results and store in the representative sequences.
This is a basic function to parse pre-run TMHMM results. Run TMHMM from the web service (http://www.cbs.dtu.dk/services/TMHMM/) by doing the following:
- Write all representative sequences in the GEM-PRO using the function
write_representative_sequences_file
- Upload the file to http://www.cbs.dtu.dk/services/TMHMM/ and choose “Extensive, no graphics” as the output
- Copy and paste the results (ignoring the top header and above “HELP with output formats”) into a file and save it
- Run this function on that file
Parameters: - tmhmm_results (str) – Path to TMHMM results (long format)
- custom_gene_mapping (dict) – Default parsing of TMHMM output is to look for the model gene IDs. If your output file contains IDs which differ from the model gene IDs, use this dictionary to map model gene IDs to result file IDs. Dictionary keys must match model genes.
- Write all representative sequences in the GEM-PRO using the function
-
kegg_mapping_and_metadata
(kegg_organism_code, custom_gene_mapping=None, outdir=None, set_as_representative=False, force_rerun=False)[source]¶ Map all genes in the model to KEGG IDs using the KEGG service.
- Steps:
- Download all metadata and sequence files in the sequences directory
- Creates a KEGGProp object in the protein.sequences attribute
- Returns a Pandas DataFrame of mapping results
Parameters: - kegg_organism_code (str) – The three letter KEGG code of your organism
- custom_gene_mapping (dict) – If your model genes differ from the gene IDs you want to map, custom_gene_mapping allows you to input a dictionary which maps model gene IDs to new ones. Dictionary keys must match model gene IDs.
- outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
- set_as_representative (bool) – If mapped KEGG IDs should be set as representative sequences
- force_rerun (bool) – If you want to overwrite any existing mappings and files
-
kegg_mapping_and_metadata_parallelize
(sc, kegg_organism_code, custom_gene_mapping=None, outdir=None, set_as_representative=False, force_rerun=False)[source]¶ Map all genes in the model to KEGG IDs using the KEGG service.
- Steps:
- Download all metadata and sequence files in the sequences directory
- Creates a KEGGProp object in the protein.sequences attribute
- Returns a Pandas DataFrame of mapping results
Parameters: - sc (SparkContext) – Spark Context to parallelize this function
- kegg_organism_code (str) – The three letter KEGG code of your organism
- custom_gene_mapping (dict) – If your model genes differ from the gene IDs you want to map, custom_gene_mapping allows you to input a dictionary which maps model gene IDs to new ones. Dictionary keys must match model gene IDs.
- outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
- set_as_representative (bool) – If mapped KEGG IDs should be set as representative sequences
- force_rerun (bool) – If you want to overwrite any existing mappings and files
-
load_cobra_model
(model)[source]¶ Load a COBRApy Model object into the GEM-PRO project.
Parameters: model (Model) – COBRApy Model
object
-
manual_seq_mapping
(gene_to_seq_dict, outdir=None, write_fasta_files=True, set_as_representative=True)[source]¶ Read a manual input dictionary of model gene IDs –> protein sequences. By default sets them as representative.
Parameters: - gene_to_seq_dict (dict) – Mapping of gene IDs to their protein sequence strings
- outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
- write_fasta_files (bool) – If individual protein FASTA files should be written out
- set_as_representative (bool) – If mapped sequences should be set as representative
-
manual_uniprot_mapping
(gene_to_uniprot_dict, outdir=None, set_as_representative=True)[source]¶ Read a manual dictionary of model gene IDs –> UniProt IDs. By default sets them as representative.
This allows for mapping of the missing genes, or overriding of automatic mappings.
Input a dictionary of:
{ <gene_id1>: <uniprot_id1>, <gene_id2>: <uniprot_id2>, }
Parameters: - gene_to_uniprot_dict – Dictionary of mappings as shown above
- outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
- set_as_representative (bool) – If mapped UniProt IDs should be set as representative sequences
-
map_uniprot_to_pdb
(seq_ident_cutoff=0.0, outdir=None, force_rerun=False)[source]¶ Map all representative sequences’ UniProt ID to PDB IDs using the PDBe “Best Structures” API. Will save a JSON file of the results to each protein’s
sequences
folder.The “Best structures” API is available at https://www.ebi.ac.uk/pdbe/api/doc/sifts.html The list of PDB structures mapping to a UniProt accession sorted by coverage of the protein and, if the same, resolution.
Parameters: - seq_ident_cutoff (float) – Sequence identity cutoff in decimal form
- outdir (str) – Output directory to cache JSON results of search
- force_rerun (bool) – Force re-downloading of JSON results if they already exist
Returns: A rank-ordered list of PDBProp objects that map to the UniProt ID
Return type: list
-
missing_homology_models
¶ list – List of genes with no mapping to any homology models.
-
missing_kegg_mapping
¶ list – List of genes with no mapping to KEGG.
-
missing_pdb_structures
¶ list – List of genes with no mapping to any experimental PDB structure.
-
missing_representative_sequence
¶ list – List of genes with no mapping to a representative sequence.
-
missing_representative_structure
¶ list – List of genes with no mapping to a representative structure.
-
missing_uniprot_mapping
¶ list – List of genes with no mapping to UniProt.
-
model
= None¶ Model – COBRApy model object
-
model_dir
¶ str – Directory where original GEMs and GEM-related files are stored.
-
pdb_downloader_and_metadata
(outdir=None, pdb_file_type=None, force_rerun=False)[source]¶ Download ALL mapped experimental structures to each protein’s structures directory.
Parameters: - outdir (str) – Path to output directory, if GEM-PRO directories were not set or other output directory is desired
- pdb_file_type (str) – Type of PDB file to download, if not already set or other format is desired
- force_rerun (bool) – If files should be re-downloaded if they already exist
-
pdb_file_type
= None¶ str –
pdb
,mmCif
,xml
,mmtf
- file type for files downloaded from the PDB
-
prep_itasser_modeling
(itasser_installation, itlib_folder, runtype, create_in_dir=None, execute_from_dir=None, all_genes=False, print_exec=False, **kwargs)[source]¶ Prepare to run I-TASSER homology modeling for genes without structures, or all genes.
Parameters: - itasser_installation (str) – Path to I-TASSER folder, i.e.
~/software/I-TASSER4.4
- itlib_folder (str) – Path to ITLIB folder, i.e.
~/software/ITLIB
- runtype – How you will be running I-TASSER - local, slurm, or torque
- create_in_dir (str) – Local directory where folders will be created
- execute_from_dir (str) – Optional path to execution directory - use this if you are copying the homology models to another location such as a supercomputer for running
- all_genes (bool) – If all genes should be prepped, or only those without any mapped structures
- print_exec (bool) – If the execution statement should be printed to run modelling
Todo
- Document kwargs - extra options for I-TASSER, SLURM or Torque execution
- Allow modeling of any sequence in sequences attribute, select by ID or provide SeqProp?
- itasser_installation (str) – Path to I-TASSER folder, i.e.
-
root_dir
¶ str – Directory where GEM-PRO project folder named after the attribute
base_dir
is located.
-
set_representative_sequence
(force_rerun=False)[source]¶ Automatically consolidate loaded sequences (manual, UniProt, or KEGG) and set a single representative sequence.
Manually set representative sequences override all existing mappings. UniProt mappings override KEGG mappings except when KEGG mappings have PDBs associated with them and UniProt doesn’t.
Parameters: force_rerun (bool) – Set to True to recheck stored sequences
-
set_representative_structure
(seq_outdir=None, struct_outdir=None, pdb_file_type=None, engine='needle', always_use_homology=False, rez_cutoff=0.0, seq_ident_cutoff=0.5, allow_missing_on_termini=0.2, allow_mutants=True, allow_deletions=False, allow_insertions=False, allow_unresolved=True, skip_large_structures=False, clean=True, force_rerun=False)[source]¶ Set all representative structure for proteins from a structure in the structures attribute.
Each gene can have a combination of the following, which will be analyzed to set a representative structure.
- Homology model(s)
- Ranked PDBs
- BLASTed PDBs
If the
always_use_homology
flag is true, homology models are always set as representative when they exist. If there are multiple homology models, we rank by the percent sequence coverage.Parameters: - seq_outdir (str) – Path to output directory of sequence alignment files, must be set if GEM-PRO directories were not created initially
- struct_outdir (str) – Path to output directory of structure files, must be set if GEM-PRO directories were not created initially
- pdb_file_type (str) –
pdb
,mmCif
,xml
,mmtf
- file type for files downloaded from the PDB - engine (str) –
biopython
orneedle
- which pairwise alignment program to use.needle
is the standard EMBOSS tool to run pairwise alignments.biopython
is Biopython’s implementation of needle. Results can differ! - always_use_homology (bool) – If homology models should always be set as the representative structure
- rez_cutoff (float) – Resolution cutoff, in Angstroms (only if experimental structure)
- seq_ident_cutoff (float) – Percent sequence identity cutoff, in decimal form
- allow_missing_on_termini (float) – Percentage of the total length of the reference sequence which will be ignored when checking for modifications. Example: if 0.1, and reference sequence is 100 AA, then only residues 5 to 95 will be checked for modifications.
- allow_mutants (bool) – If mutations should be allowed or checked for
- allow_deletions (bool) – If deletions should be allowed or checked for
- allow_insertions (bool) – If insertions should be allowed or checked for
- allow_unresolved (bool) – If unresolved residues should be allowed or checked for
- skip_large_structures (bool) – Default False – currently, large structures can’t be saved as a PDB file even if you just want to save a single chain, so Biopython will throw an error when trying to do so. As an alternative, if a large structure is selected as representative, the pipeline will currently point to it and not clean it. If you don’t want this to happen, set this to true.
- clean (bool) – If structures should be cleaned
- force_rerun (bool) – If sequence to structure alignment should be rerun
Todo
- Remedy large structure representative setting
-
structures_dir
¶ str – Directory where all structures are stored.
-
uniprot_mapping_and_metadata
(model_gene_source, custom_gene_mapping=None, outdir=None, set_as_representative=False, force_rerun=False)[source]¶ Map all genes in the model to UniProt IDs using the UniProt mapping service. Also download all metadata and sequences.
Parameters: - model_gene_source (str) –
the database source of your model gene IDs. See: http://www.uniprot.org/help/api_idmapping Common model gene sources are:
- Ensembl Genomes -
ENSEMBLGENOME_ID
(i.e. E. coli b-numbers) - Entrez Gene (GeneID) -
P_ENTREZGENEID
- RefSeq Protein -
P_REFSEQ_AC
- Ensembl Genomes -
- custom_gene_mapping (dict) – If your model genes differ from the gene IDs you want to map, custom_gene_mapping allows you to input a dictionary which maps model gene IDs to new ones. Dictionary keys must match model genes.
- outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
- set_as_representative (bool) – If mapped UniProt IDs should be set as representative sequences
- force_rerun (bool) – If you want to overwrite any existing mappings and files
- model_gene_source (str) –
-
write_representative_sequences_file
(outname, outdir=None, set_ids_from_model=True)[source]¶ Write all the model’s sequences as a single FASTA file. By default, sets IDs to model gene IDs.
Parameters: - outname (str) – Name of the output FASTA file without the extension
- outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
- set_ids_from_model (bool) – If the gene ID source should be the model gene IDs, not the original sequence ID
Further reading¶
For examples in which structures have been integrated into a GEM and utilized on a genome-scale, please see the following:
[1] | Zhang Y, Thiele I, Weekes D, Li Z, Jaroszewski L, Ginalski K, et al. Three-dimensional structural view of the central metabolic network of Thermotoga maritima. Science. 2009 Sep 18;325(5947):1544–9. Available from: http://dx.doi.org/10.1126/science.1174671 |
[2] | Brunk E, Mih N, Monk J, Zhang Z, O’Brien EJ, Bliven SE, et al. Systems biology of the structural proteome. BMC Syst Biol. 2016;10: 26. doi:10.1186/s12918-016-0271-6 |
[3] | Monk JM, Lloyd CJ, Brunk E, Mih N, Sastry A, King Z, et al. iML1515, a knowledgebase that computes Escherichia coli traits. Nat Biotechnol. 2017;35: 904–908. doi:10.1038/nbt.3956 |
[4] | Chang RL, Xie L, Xie L, Bourne PE, Palsson BØ. Drug off-target effects predicted using structural analysis in the context of a metabolic network model. PLoS Comput Biol. 2010 Sep 23;6(9):e1000938. Available from: http://dx.doi.org/10.1371/journal.pcbi.1000938 |
[5] | Chang RL, Andrews K, Kim D, Li Z, Godzik A, Palsson BO. Structural systems biology evaluation of metabolic thermotolerance in Escherichia coli. Science. 2013 Jun 7;340(6137):1220–3. Available from: http://dx.doi.org/10.1126/science.1234012 |
[6] | Chang RL, Xie L, Bourne PE, Palsson BO. Antibacterial mechanisms identified through structural systems pharmacology. BMC Syst Biol. 2013 Oct 10;7:102. Available from: http://dx.doi.org/10.1186/1752-0509-7-102 |
[7] | Mih N, Brunk E, Bordbar A, Palsson BO. A Multi-scale Computational Platform to Mechanistically Assess the Effect of Genetic Variation on Drug Responses in Human Erythrocyte Metabolism. PLoS Comput Biol. 2016;12: e1005039. doi:10.1371/journal.pcbi.1005039 |
[8] | Chen K, Gao Y, Mih N, O’Brien EJ, Yang L, Palsson BO. Thermosensitivity of growth is determined by chaperone-mediated proteome reallocation. Proceedings of the National Academy of Sciences. 2017;114: 11548–11553. doi:10.1073/pnas.1705524114 |
[9] | Yang L, Mih N, Yurkovich JT, Park JH, Seo S, Kim D, et al. Multi-scale model of the proteomic and metabolic consequences of reactive oxygen species. bioRxiv. 2017. p. 227892. doi:10.1101/227892 |
References¶
[10] | Bosi, E, Monk, JM, Aziz, RK, Fondi, M, Nizet, V, & Palsson, BO. (2016). Comparative genome-scale modelling of Staphylococcus aureus strains identifies strain-specific metabolic capabilities linked to pathogenicity. Proceedings of the National Academy of Sciences of the United States of America, 113/26: E3801–9. DOI: 10.1073/pnas.1523199113 |
[11] | Monk, JM, Koza, A, Campodonico, MA, Machado, D, Seoane, JM, Palsson, BO, Herrgård, MJ, et al. (2016). Multi-omics Quantification of Species Variation of Escherichia coli Links Molecular Features with Strain Phenotypes. Cell systems, 3/3: 238–51.e12. DOI: 10.1016/j.cels.2016.08.013 |
[12] | Ong, WK, Vu, TT, Lovendahl, KN, Llull, JM, Serres, MH, Romine, MF, & Reed, JL. (2014). Comparisons of Shewanella strains based on genome annotations, modeling, and experiments. BMC systems biology, 8: 31. DOI: 10.1186/1752-0509-8-31 |
The Protein Class¶

Introduction¶
This section will give an overview of the methods that can be executed for the Protein
class, which is a basic representation of a protein by a collection of amino acid sequences and 3D structures.
Tutorials¶
Protein - Structure Mapping, Alignments, and Visualization¶
This notebook gives an example of how to map a single protein sequence to its structure, along with conducting sequence alignments and visualizing the mutations.
Imports¶
In [ ]:
import sys
import logging
In [ ]:
# Import the Protein class
from ssbio.core.protein import Protein
In [ ]:
# Printing multiple outputs per cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
Set the logging level in logger.setLevel(logging.<LEVEL_HERE>)
to
specify how verbose you want the pipeline to be. Debug is most verbose.
CRITICAL
- Only really important messages shown
ERROR
- Major errors
WARNING
- Warnings that don’t affect running of the pipeline
INFO
(default)- Info such as the number of structures mapped per gene
DEBUG
- Really detailed information that will print out a lot of stuff
DEBUG
mode prints out a large amount of information,
especially if you have a lot of genes. This may stall your notebook!
In [ ]:
# Create logger
logger = logging.getLogger()
logger.setLevel(logging.INFO) # SET YOUR LOGGING LEVEL HERE #
In [ ]:
# Other logger stuff for Jupyter notebooks
handler = logging.StreamHandler(sys.stderr)
formatter = logging.Formatter('[%(asctime)s] [%(name)s] %(levelname)s: %(message)s', datefmt="%Y-%m-%d %H:%M")
handler.setFormatter(formatter)
logger.handlers = [handler]
Initialization of the project¶
Set these three things:
ROOT_DIR
- The directory where a folder named after your
PROTEIN_ID
will be created
- The directory where a folder named after your
PROTEIN_ID
- Your protein ID
PROTEIN_SEQ
- Your protein sequence
A directory will be created in ROOT_DIR
with your PROTEIN_ID
name. The folders are organized like so:
ROOT_DIR
└── PROTEIN_ID
├── sequences # Protein sequence files, alignments, etc.
└── structures # Protein structure files, calculations, etc.
In [ ]:
# SET FOLDERS AND DATA HERE
import tempfile
ROOT_DIR = tempfile.gettempdir()
PROTEIN_ID = 'SRR1753782_00918'
PROTEIN_SEQ = 'MSKQQIGVVGMAVMGRNLALNIESRGYTVSVFNRSREKTEEVIAENPGKKLVPYYTVKEFVESLETPRRILLMVKAGAGTDAAIDSLKPYLEKGDIIIDGGNTFFQDTIRRNRELSAEGFNFIGTGVSGGEEGALKGPSIMPGGQKDAYELVAPILTKIAAVAEDGEPCVTYIGADGAGHYVKMVHNGIEYGDMQLIAEAYSLLKGGLNLSNEELANTFTEWNNGELSSYLIDITKDIFTKKDEDGNYLVDVILDEAANKGTGKWTSQSALDLGEPLSLITESVFARYISSLKAQRVAASKVLSGPKAQPAGDKAEFIEKVRRALYLGKIVSYAQGFSQLRAASDEYHWDLNYGEIAKIFRAGCIIRAQFLQKITDAYAENADIANLLLAPYFKKIADEYQQALRDVVAYAVQNGIPVPTFSAAVAYYDSYRAAVLPANLIQAQRDYFGAHTYKRTDKEGIFHTEWLE'
-
class
ssbio.core.protein.
Protein
(ident, description=None, root_dir=None, pdb_file_type=’mmtf’)[source] Store information about a protein, which represents the monomeric translated unit of a gene.
The main utilities of this class are to:
- Load, parse, and store the same (ie. from different database sources) or similar (ie. from different strains)
protein sequences as SeqProp objects in the
sequences
attribute - Load, parse, and store multiple experimental or predicted protein structures as StructProp
objects in the
structures
attribute - Set a single
representative_sequence
andrepresentative_structure
- Calculate, store, and access pairwise sequence alignments to the representative sequence or structure
- Provide summaries of alignments and mutations seen
- Map between residue numbers of sequences and structures
Parameters: - ident (str) – Unique identifier for this protein
- description (str) – Optional description for this protein
- root_dir (str) – Path to where the folder named by this protein’s ID will be created. Default is current working directory.
- pdb_file_type (str) –
pdb
,mmCif
,xml
,mmtf
- file type for files downloaded from the PDB
Todo
- Implement structural alignment objects with FATCAT
- Load, parse, and store the same (ie. from different database sources) or similar (ie. from different strains)
protein sequences as SeqProp objects in the
In [ ]:
# Create the Protein object
my_protein = Protein(ident=PROTEIN_ID, root_dir=ROOT_DIR, pdb_file_type='mmtf')
In [ ]:
# Load the protein sequence
# This sets the loaded sequence as the representative one
my_protein.load_manual_sequence(seq=PROTEIN_SEQ, ident='WT', write_fasta_file=True, set_as_representative=True)
Mapping sequence –> structure¶
Since the sequence has been provided, we just need to BLAST it to the PDB.
-
Protein.
blast_representative_sequence_to_pdb
(seq_ident_cutoff=0, evalue=0.0001, display_link=False, outdir=None, force_rerun=False)[source] BLAST the representative protein sequence to the PDB. Saves a raw BLAST result file (XML file).
Parameters: - seq_ident_cutoff (float, optional) – Cutoff results based on percent coverage (in decimal form)
- evalue (float, optional) – Cutoff for the E-value - filters for significant hits. 0.001 is liberal, 0.0001 is stringent (default).
- display_link (bool, optional) – Set to True if links to the HTML results should be displayed
- outdir (str) – Path to output directory of downloaded XML files, must be set if protein directory was not initialized
- force_rerun (bool, optional) – If existing BLAST results should not be used, set to True. Default is False.
Returns: List of new
PDBProp
objects added to thestructures
attributeReturn type: list
In [ ]:
# Mapping using BLAST
my_protein.blast_representative_sequence_to_pdb(seq_ident_cutoff=0.9, evalue=0.00001)
my_protein.df_pdb_blast.head()
Downloading and ranking structures¶
-
Protein.
pdb_downloader_and_metadata
(outdir=None, pdb_file_type=None, force_rerun=False)[source] Download ALL mapped experimental structures to the protein structures directory.
Parameters: - outdir (str) – Path to output directory, if protein structures directory not set or other output directory is desired
- pdb_file_type (str) – Type of PDB file to download, if not already set or other format is desired
- force_rerun (bool) – If files should be re-downloaded if they already exist
Returns: List of PDB IDs that were downloaded
Return type: list
Todo
- Parse mmtf or PDB file for header information, rather than always getting the cif file for header info
In [ ]:
# Download all mapped PDBs and gather the metadata
my_protein.pdb_downloader_and_metadata()
my_protein.df_pdb_metadata.head(2)
-
Protein.
set_representative_structure
(seq_outdir=None, struct_outdir=None, pdb_file_type=None, engine=’needle’, always_use_homology=False, rez_cutoff=0.0, seq_ident_cutoff=0.5, allow_missing_on_termini=0.2, allow_mutants=True, allow_deletions=False, allow_insertions=False, allow_unresolved=True, clean=True, keep_chemicals=None, skip_large_structures=False, force_rerun=False)[source] Set a representative structure from a structure in the structures attribute.
- Each gene can have a combination of the following, which will be analyzed to set a representative structure.
- Homology model(s)
- Ranked PDBs
- BLASTed PDBs
If the
always_use_homology
flag is true, homology models are always set as representative when they exist. If there are multiple homology models, we rank by the percent sequence coverage.Parameters: - seq_outdir (str) – Path to output directory of sequence alignment files, must be set if Protein directory was not created initially
- struct_outdir (str) – Path to output directory of structure files, must be set if Protein directory was not created initially
- pdb_file_type (str) –
pdb
,mmCif
,xml
,mmtf
- file type for files downloaded from the PDB - engine (str) –
biopython
orneedle
- which pairwise alignment program to use.needle
is the standard EMBOSS tool to run pairwise alignments.biopython
is Biopython’s implementation of needle. Results can differ! - always_use_homology (bool) – If homology models should always be set as the representative structure
- rez_cutoff (float) – Resolution cutoff, in Angstroms (only if experimental structure)
- seq_ident_cutoff (float) – Percent sequence identity cutoff, in decimal form
- allow_missing_on_termini (float) – Percentage of the total length of the reference sequence which will be ignored when checking for modifications. Example: if 0.1, and reference sequence is 100 AA, then only residues 5 to 95 will be checked for modifications.
- allow_mutants (bool) – If mutations should be allowed or checked for
- allow_deletions (bool) – If deletions should be allowed or checked for
- allow_insertions (bool) – If insertions should be allowed or checked for
- allow_unresolved (bool) – If unresolved residues should be allowed or checked for
- clean (bool) – If structure should be cleaned
- keep_chemicals (str, list) – Keep specified chemical names if structure is to be cleaned
- skip_large_structures (bool) – Default False – currently, large structures can’t be saved as a PDB file even if you just want to save a single chain, so Biopython will throw an error when trying to do so. As an alternative, if a large structure is selected as representative, the pipeline will currently point to it and not clean it. If you don’t want this to happen, set this to true.
- force_rerun (bool) – If sequence to structure alignment should be rerun
Returns: Representative structure from the list of structures. This is a not a map to the original structure, it is copied and optionally cleaned from the original one.
Return type: Todo
- Remedy large structure representative setting
In [ ]:
# Set representative structures
my_protein.set_representative_structure()
Loading and aligning new sequences¶
You can load additional sequences into this protein object and align them to the representative sequence.
-
Protein.
load_manual_sequence
(seq, ident=None, write_fasta_file=False, outdir=None, set_as_representative=False, force_rewrite=False)[source] Load a manual sequence given as a string and optionally set it as the representative sequence. Also store it in the sequences attribute.
Parameters: - seq (str, Seq, SeqRecord) – Sequence string, Biopython Seq or SeqRecord object
- ident (str) – Optional identifier for the sequence, required if seq is a string. Also will override existing IDs in Seq or SeqRecord objects if set.
- write_fasta_file (bool) – If this sequence should be written out to a FASTA file
- outdir (str) – Path to output directory
- set_as_representative (bool) – If this sequence should be set as the representative one
- force_rewrite (bool) – If the FASTA file should be overwritten if it already exists
Returns: Sequence that was loaded into the
sequences
attributeReturn type:
In [ ]:
# Input your mutated sequence and load it
mutated_protein1_id = 'N17P_SNP'
mutated_protein1_seq = 'MSKQQIGVVGMAVMGRPLALNIESRGYTVSVFNRSREKTEEVIAENPGKKLVPYYTVKEFVESLETPRRILLMVKAGAGTDAAIDSLKPYLEKGDIIIDGGNTFFQDTIRRNRELSAEGFNFIGTGVSGGEEGALKGPSIMPGGQKDAYELVAPILTKIAAVAEDGEPCVTYIGADGAGHYVKMVHNGIEYGDMQLIAEAYSLLKGGLNLSNEELANTFTEWNNGELSSYLIDITKDIFTKKDEDGNYLVDVILDEAANKGTGKWTSQSALDLGEPLSLITESVFARYISSLKAQRVAASKVLSGPKAQPAGDKAEFIEKVRRALYLGKIVSYAQGFSQLRAASDEYHWDLNYGEIAKIFRAGCIIRAQFLQKITDAYAENADIANLLLAPYFKKIADEYQQALRDVVAYAVQNGIPVPTFSAAVAYYDSYRAAVLPANLIQAQRDYFGAHTYKRTDKEGIFHTEWLE'
my_protein.load_manual_sequence(ident=mutated_protein1_id, seq=mutated_protein1_seq)
In [ ]:
# Input another mutated sequence and load it
mutated_protein2_id = 'Q4S_N17P_SNP'
mutated_protein2_seq = 'MSKSQIGVVGMAVMGRPLALNIESRGYTVSVFNRSREKTEEVIAENPGKKLVPYYTVKEFVESLETPRRILLMVKAGAGTDAAIDSLKPYLEKGDIIIDGGNTFFQDTIRRNRELSAEGFNFIGTGVSGGEEGALKGPSIMPGGQKDAYELVAPILTKIAAVAEDGEPCVTYIGADGAGHYVKMVHNGIEYGDMQLIAEAYSLLKGGLNLSNEELANTFTEWNNGELSSYLIDITKDIFTKKDEDGNYLVDVILDEAANKGTGKWTSQSALDLGEPLSLITESVFARYISSLKAQRVAASKVLSGPKAQPAGDKAEFIEKVRRALYLGKIVSYAQGFSQLRAASDEYHWDLNYGEIAKIFRAGCIIRAQFLQKITDAYAENADIANLLLAPYFKKIADEYQQALRDVVAYAVQNGIPVPTFSAAVAYYDSYRAAVLPANLIQAQRDYFGAHTYKRTDKEGIFHTEWLE'
my_protein.load_manual_sequence(ident=mutated_protein2_id, seq=mutated_protein2_seq)
-
Protein.
pairwise_align_sequences_to_representative
(gapopen=10, gapextend=0.5, outdir=None, engine=’needle’, parse=True, force_rerun=False)[source] Pairwise all sequences in the sequences attribute to the representative sequence. Stores the alignments in the
sequence_alignments
DictList attribute.Parameters: - gapopen (int) – Only for
engine='needle'
- Gap open penalty is the score taken away when a gap is created - gapextend (float) – Only for
engine='needle'
- Gap extension penalty is added to the standard gap penalty for each base or residue in the gap - outdir (str) – Only for
engine='needle'
- Path to output directory. Default is the protein sequence directory. - engine (str) –
biopython
orneedle
- which pairwise alignment program to use.needle
is the standard EMBOSS tool to run pairwise alignments.biopython
is Biopython’s implementation of needle. Results can differ! - parse (bool) – Store locations of mutations, insertions, and deletions in the alignment object (as an annotation)
- force_rerun (bool) – Only for
engine='needle'
- Default False, set to True if you want to rerun the alignment if outfile exists.
- gapopen (int) – Only for
In [ ]:
# Conduct pairwise sequence alignments
my_protein.pairwise_align_sequences_to_representative()
In [ ]:
# View IDs of all sequence alignments
[x.id for x in my_protein.sequence_alignments]
# View the stored information for one of the alignments
my_alignment = my_protein.sequence_alignments.get_by_id('SRR1753782_00918_N17P_SNP')
my_alignment.annotations
str(my_alignment[0].seq)
str(my_alignment[1].seq)
-
Protein.
sequence_mutation_summary
(alignment_ids=None, alignment_type=None)[source] Summarize all mutations found in the sequence_alignments attribute.
Returns 2 dictionaries, single_counter and fingerprint_counter.
- single_counter:
Dictionary of
{point mutation: list of genes/strains}
Example:{ ('A', 24, 'V'): ['Strain1', 'Strain2', 'Strain4'], ('R', 33, 'T'): ['Strain2'] }
Here, we report which genes/strains have the single point mutation.
- fingerprint_counter:
Dictionary of
{mutation group: list of genes/strains}
Example:{ (('A', 24, 'V'), ('R', 33, 'T')): ['Strain2'], (('A', 24, 'V')): ['Strain1', 'Strain4'] }
Here, we report which genes/strains have the specific combinations (or “fingerprints”) of point mutations
Parameters: - alignment_ids (str, list) – Specified alignment ID or IDs to use
- alignment_type (str) – Specified alignment type contained in the
annotation
field of an alignment object,seqalign
orstructalign
are the current types.
Returns: single_counter, fingerprint_counter
Return type: dict, dict
In [ ]:
# Summarize all the mutations in all sequence alignments
s,f = my_protein.sequence_mutation_summary(alignment_type='seqalign')
print('Single mutations:')
s
print('---------------------')
print('Mutation fingerprints')
f
Some additional methods¶
In [ ]:
import ssbio.databases.uniprot
In [ ]:
this_examples_uniprot = 'P14062'
sites = ssbio.databases.uniprot.uniprot_sites(this_examples_uniprot)
my_protein.representative_sequence.features = sites
my_protein.representative_sequence.features
Mapping sequence residue numbers to structure residue numbers¶
-
Protein.
map_seqprop_resnums_to_structprop_resnums
(resnums, seqprop=None, structprop=None, chain_id=None, use_representatives=False)[source] Map a residue number in any SeqProp to the structure’s residue number for a specified chain.
Parameters: - resnums (int, list) – Residue numbers in the sequence
- seqprop (SeqProp) – SeqProp object
- structprop (StructProp) – StructProp object
- chain_id (str) – Chain ID to map to
- use_representatives (bool) – If the representative sequence and structure should be used. If True, seqprop, structprop, and chain_id do not need to be defined.
Returns: Mapping of sequence residue numbers to structure residue numbers
Return type: dict
In [ ]:
# Returns a dictionary mapping sequence residue numbers to structure residue identifiers
# Will warn you if residues are not present in the structure
structure_sites = my_protein.map_seqprop_resnums_to_structprop_resnums(resnums=[1,3,45],
use_representatives=True)
structure_sites
Viewing structures¶
The awesome package nglview is
utilized as a backend for viewing structures within a Jupyter notebook.
ssbio
view functions will either return a NGLWidget
object,
which is the same as using nglview
like the below example, or act
upon the widget object itself.
# This is how NGLview usually works - it will load a structure file and return a NGLWidget "view" object.
import nglview
view = nglview.show_structure_file(my_protein.representative_structure.structure_path)
view
-
StructProp.
view_structure
(only_chains=None, opacity=1.0, recolor=False, gui=False)[source] Use NGLviewer to display a structure in a Jupyter notebook
Parameters: - only_chains (str, list) – Chain ID or IDs to display
- opacity (float) – Opacity of the structure
- recolor (bool) – If structure should be cleaned and recolored to silver
- gui (bool) – If the NGLview GUI should show up
Returns: NGLviewer object
In [ ]:
# View just the structure
view = my_protein.representative_structure.view_structure()
view
-
Protein.
add_mutations_to_nglview
(view, alignment_type=’seqalign’, alignment_ids=None, seqprop=None, structprop=None, chain_id=None, use_representatives=False, grouped=False, color=’red’, unique_colors=True, opacity_range=(0.8, 1), scale_range=(1, 5))[source] Add representations to an NGLWidget view object for residues that are mutated in the
sequence_alignments
attribute.Parameters: - view (NGLWidget) – NGLWidget view object
- alignment_type (str) – Specified alignment type contained in the
annotation
field of an alignment object,seqalign
orstructalign
are the current types. - alignment_ids (str, list) – Specified alignment ID or IDs to use
- seqprop (SeqProp) – SeqProp object
- structprop (StructProp) – StructProp object
- chain_id (str) – ID of the structure’s chain to get annotation from
- use_representatives (bool) – If the representative sequence/structure/chain IDs should be used
- grouped (bool) – If groups of mutations should be colored and sized together
- color (str) – Color of the mutations (overridden if unique_colors=True)
- unique_colors (bool) – If each mutation/mutation group should be colored uniquely
- opacity_range (tuple) – Min/max opacity values (mutations that show up more will be opaque)
- scale_range (tuple) – Min/max size values (mutations that show up more will be bigger)
In [ ]:
# Map the mutations on the visualization (scale increased) - will show up on the above view
my_protein.add_mutations_to_nglview(view=view, alignment_type='seqalign', scale_range=(4,7),
use_representatives=True)
-
Protein.
add_features_to_nglview
(view, seqprop=None, structprop=None, chain_id=None, use_representatives=False)[source] Add select features from the selected SeqProp object to an NGLWidget view object.
Currently parsing for:
- Single residue features (ie. metal binding sites)
- Disulfide bonds
Parameters: - view (NGLWidget) – NGLWidget view object
- seqprop (SeqProp) – SeqProp object
- structprop (StructProp) – StructProp object
- chain_id (str) – ID of the structure’s chain to get annotation from
- use_representatives (bool) – If the representative sequence/structure/chain IDs should be used
In [ ]:
# Add sites as shown above in the table to the view
my_protein.add_features_to_nglview(view=view, use_representatives=True)
Saving¶
-
Protein.
save_json
(outfile, compression=False) Save the object as a JSON file using json_tricks
In [ ]:
import os.path as op
my_protein.save_json(op.join(my_protein.protein_dir, '{}.json'.format(my_protein.id)))
Features¶
- Load, parse, and store the same (ie. from different database sources) or similar (ie. from different strains) protein sequences as SeqProp objects in the
sequences
attribute - Load, parse, and store multiple experimental or predicted protein structures as StructProp objects in the
structures
attribute - Set a single
representative_sequence
andrepresentative_structure
- Calculate, store, and access pairwise sequence alignments to the representative sequence or structure
- Provide summaries of alignments and mutations seen
- Map between residue numbers of sequences and structures
API¶
Protein¶
-
class
ssbio.core.protein.
Protein
(ident, description=None, root_dir=None, pdb_file_type='mmtf')[source]¶ Store information about a protein, which represents the monomeric translated unit of a gene.
The main utilities of this class are to:
- Load, parse, and store the same (ie. from different database sources) or similar (ie. from different strains)
protein sequences as SeqProp objects in the
sequences
attribute - Load, parse, and store multiple experimental or predicted protein structures as StructProp
objects in the
structures
attribute - Set a single
representative_sequence
andrepresentative_structure
- Calculate, store, and access pairwise sequence alignments to the representative sequence or structure
- Provide summaries of alignments and mutations seen
- Map between residue numbers of sequences and structures
Parameters: - ident (str) – Unique identifier for this protein
- description (str) – Optional description for this protein
- root_dir (str) – Path to where the folder named by this protein’s ID will be created. Default is current working directory.
- pdb_file_type (str) –
pdb
,mmCif
,xml
,mmtf
- file type for files downloaded from the PDB
Todo
- Implement structural alignment objects with FATCAT
-
add_features_to_nglview
(view, seqprop=None, structprop=None, chain_id=None, use_representatives=False)[source]¶ Add select features from the selected SeqProp object to an NGLWidget view object.
Currently parsing for:
- Single residue features (ie. metal binding sites)
- Disulfide bonds
Parameters: - view (NGLWidget) – NGLWidget view object
- seqprop (SeqProp) – SeqProp object
- structprop (StructProp) – StructProp object
- chain_id (str) – ID of the structure’s chain to get annotation from
- use_representatives (bool) – If the representative sequence/structure/chain IDs should be used
-
add_fingerprint_to_nglview
(view, fingerprint, seqprop=None, structprop=None, chain_id=None, use_representatives=False, color='red', opacity_range=(0.8, 1), scale_range=(1, 5))[source]¶ Add representations to an NGLWidget view object for residues that are mutated in the
sequence_alignments
attribute.Parameters: - view (NGLWidget) – NGLWidget view object
- fingerprint (dict) – Single mutation group from the
sequence_mutation_summary
function - seqprop (SeqProp) – SeqProp object
- structprop (StructProp) – StructProp object
- chain_id (str) – ID of the structure’s chain to get annotation from
- use_representatives (bool) – If the representative sequence/structure/chain IDs should be used
- color (str) – Color of the mutations (overridden if unique_colors=True)
- opacity_range (tuple) – Min/max opacity values (mutations that show up more will be opaque)
- scale_range (tuple) – Min/max size values (mutations that show up more will be bigger)
-
add_mutations_to_nglview
(view, alignment_type='seqalign', alignment_ids=None, seqprop=None, structprop=None, chain_id=None, use_representatives=False, grouped=False, color='red', unique_colors=True, opacity_range=(0.8, 1), scale_range=(1, 5))[source]¶ Add representations to an NGLWidget view object for residues that are mutated in the
sequence_alignments
attribute.Parameters: - view (NGLWidget) – NGLWidget view object
- alignment_type (str) – Specified alignment type contained in the
annotation
field of an alignment object,seqalign
orstructalign
are the current types. - alignment_ids (str, list) – Specified alignment ID or IDs to use
- seqprop (SeqProp) – SeqProp object
- structprop (StructProp) – StructProp object
- chain_id (str) – ID of the structure’s chain to get annotation from
- use_representatives (bool) – If the representative sequence/structure/chain IDs should be used
- grouped (bool) – If groups of mutations should be colored and sized together
- color (str) – Color of the mutations (overridden if unique_colors=True)
- unique_colors (bool) – If each mutation/mutation group should be colored uniquely
- opacity_range (tuple) – Min/max opacity values (mutations that show up more will be opaque)
- scale_range (tuple) – Min/max size values (mutations that show up more will be bigger)
-
align_seqprop_to_structprop
(seqprop, structprop, chains=None, outdir=None, engine='needle', structure_already_parsed=False, parse=True, force_rerun=False, **kwargs)[source]¶ Run and store alignments of a SeqProp to chains in the
mapped_chains
attribute of a StructProp.Alignments are stored in the sequence_alignments attribute, with the IDs formatted as
<SeqProp_ID>_<StructProp_ID>-<Chain_ID>
. Although it is more intuitive to align to individual ChainProps, StructProps should be loaded as little as possible to reduce run times so the alignment is done to the entire structure.Parameters: - seqprop (SeqProp) – SeqProp object with a loaded sequence
- structprop (StructProp) – StructProp object with a loaded structure
- chains (str, list) – Chain ID or IDs to map to. If not specified,
mapped_chains
attribute is inspected for chains. If no chains there, all chains will be aligned to. - outdir (str) – Directory to output sequence alignment files (only if running with needle)
- engine (str) –
biopython
orneedle
- which pairwise alignment program to use.needle
is the standard EMBOSS tool to run pairwise alignments.biopython
is Biopython’s implementation of needle. Results can differ! - structure_already_parsed (bool) – If the structure has already been parsed and the chain sequences are stored. Temporary option until Hadoop sequence file is implemented to reduce number of times a structure is parsed.
- parse (bool) – Store locations of mutations, insertions, and deletions in the alignment object (as an annotation)
- force_rerun (bool) – If alignments should be rerun
- **kwargs – Other alignment options
Todo
- Document **kwargs for alignment options
-
blast_representative_sequence_to_pdb
(seq_ident_cutoff=0, evalue=0.0001, display_link=False, outdir=None, force_rerun=False)[source]¶ BLAST the representative protein sequence to the PDB. Saves a raw BLAST result file (XML file).
Parameters: - seq_ident_cutoff (float, optional) – Cutoff results based on percent coverage (in decimal form)
- evalue (float, optional) – Cutoff for the E-value - filters for significant hits. 0.001 is liberal, 0.0001 is stringent (default).
- display_link (bool, optional) – Set to True if links to the HTML results should be displayed
- outdir (str) – Path to output directory of downloaded XML files, must be set if protein directory was not initialized
- force_rerun (bool, optional) – If existing BLAST results should not be used, set to True. Default is False.
Returns: List of new
PDBProp
objects added to thestructures
attributeReturn type: list
-
check_structure_chain_quality
(seqprop, structprop, chain_id, seq_ident_cutoff=0.5, allow_missing_on_termini=0.2, allow_mutants=True, allow_deletions=False, allow_insertions=False, allow_unresolved=True)[source]¶ Report if a structure’s chain meets the defined cutoffs for sequence quality.
-
df_homology_models
¶ DataFrame – Get a dataframe of I-TASSER homology model results
-
df_pdb_blast
¶ DataFrame – Get a dataframe of PDB BLAST results
-
df_pdb_metadata
¶ DataFrame – Get a dataframe of PDB metadata (PDBs have to be downloaded first)
-
df_pdb_ranking
¶ DataFrame – Get a dataframe of UniProt -> best structure in PDB results
-
download_all_pdbs
(outdir=None, pdb_file_type=None, load_metadata=False, force_rerun=False)[source]¶ Downloads all structures from the PDB. load_metadata flag sets if metadata should be parsed and stored in StructProp, otherwise filepaths are just linked
-
filter_sequences
(seq_type)[source]¶ Return a DictList of only specified types in the sequences attribute.
Parameters: seq_type (SeqProp) – Object type Returns: A filtered DictList of specified object type only Return type: DictList
-
find_disulfide_bridges
(representative_only=True)[source]¶ Run Biopython’s disulfide bridge finder and store found bridges.
Annotations are stored in the protein structure’s chain sequence at:
<chain_prop>.seq_record.annotations['SSBOND-biopython']
Parameters: representative_only (bool) – If analysis should only be run on the representative structure
-
find_representative_chain
(seqprop, structprop, chains_to_check=None, seq_ident_cutoff=0.5, allow_missing_on_termini=0.2, allow_mutants=True, allow_deletions=False, allow_insertions=False, allow_unresolved=True)[source]¶ Set and return the representative chain based on sequence quality checks to a reference sequence.
Parameters: - seqprop (SeqProp) – SeqProp object to compare to chain sequences
- structprop (StructProp) – StructProp object with chains to compare to in the
mapped_chains
attribute. If there are none present,chains_to_check
can be specified, otherwise all chains are checked. - chains_to_check (str, list) – Chain ID or IDs to check for sequence coverage quality
- seq_ident_cutoff (float) – Percent sequence identity cutoff, in decimal form
- allow_missing_on_termini (float) – Percentage of the total length of the reference sequence which will be ignored when checking for modifications. Example: if 0.1, and reference sequence is 100 AA, then only residues 5 to 95 will be checked for modifications.
- allow_mutants (bool) – If mutations should be allowed or checked for
- allow_deletions (bool) – If deletions should be allowed or checked for
- allow_insertions (bool) – If insertions should be allowed or checked for
- allow_unresolved (bool) – If unresolved residues should be allowed or checked for
Returns: the best chain ID, if any
Return type: str
-
get_dssp_annotations
(representative_only=True, force_rerun=False)[source]¶ Run DSSP on structures and store calculations.
Annotations are stored in the protein structure’s chain sequence at:
<chain_prop>.seq_record.letter_annotations['*-dssp']
Parameters: - representative_only (bool) – If analysis should only be run on the representative structure
- force_rerun (bool) – If calculations should be rerun even if an output file exists
Todo
- Some errors arise from storing annotations for nonstandard amino acids, need to run DSSP separately for those
-
get_experimental_structures
()[source]¶ DictList: Return a DictList of all experimental structures in self.structures
-
get_freesasa_annotations
(include_hetatms=False, representative_only=True, force_rerun=False)[source]¶ Run freesasa on structures and store calculations.
Annotations are stored in the protein structure’s chain sequence at:
<chain_prop>.seq_record.letter_annotations['*-freesasa']
Parameters: - include_hetatms (bool) – If HETATMs should be included in calculations. Defaults to
False
. - representative_only (bool) – If analysis should only be run on the representative structure
- force_rerun (bool) – If calculations should be rerun even if an output file exists
- include_hetatms (bool) – If HETATMs should be included in calculations. Defaults to
-
get_homology_models
()[source]¶ DictList: Return a DictList of all homology models in self.structures
-
get_msms_annotations
(representative_only=True, force_rerun=False)[source]¶ Run MSMS on structures and store calculations.
Annotations are stored in the protein structure’s chain sequence at:
<chain_prop>.seq_record.letter_annotations['*-msms']
Parameters: - representative_only (bool) – If analysis should only be run on the representative structure
- force_rerun (bool) – If calculations should be rerun even if an output file exists
-
get_residue_annotations
(seq_resnum, seqprop=None, structprop=None, chain_id=None, use_representatives=False)[source]¶ Get all residue-level annotations stored in the SeqProp
letter_annotations
field for a given residue number.Uses the representative sequence, structure, and chain ID stored by default. If other properties from other structures are desired, input the proper IDs. An alignment for the given sequence to the structure must be present in the sequence_alignments list.
Parameters: - seq_resnum (int) – Residue number in the sequence
- seqprop (SeqProp) – SeqProp object
- structprop (StructProp) – StructProp object
- chain_id (str) – ID of the structure’s chain to get annotation from
- use_representatives (bool) – If the representative sequence/structure/chain IDs should be used
Returns: All available letter_annotations for this residue number
Return type: dict
-
get_seqprop_to_structprop_alignment_stats
(seqprop, structprop, chain_id)[source]¶ Get the sequence alignment information for a sequence to a structure’s chain.
-
get_sequence_properties
(representative_only=True)[source]¶ Run Biopython ProteinAnalysis and EMBOSS pepstats to summarize basic statistics of the protein sequences. Results are stored in the protein’s respective SeqProp objects at
.annotations
Parameters: representative_only (bool) – If analysis should only be run on the representative sequence
-
load_itasser_folder
(ident, itasser_folder, organize=False, outdir=None, organize_name=None, set_as_representative=False, representative_chain='X', force_rerun=False)[source]¶ Load the results folder from an I-TASSER run (local, not from the website) and copy relevant files over to the protein structures directory.
Parameters: - ident (str) – I-TASSER ID
- itasser_folder (str) – Path to results folder
- organize (bool) – If select files from modeling should be copied to the Protein directory
- outdir (str) – Path to directory where files will be copied and organized to
- organize_name (str) – Basename of files to rename results to. If not provided, will use id attribute.
- set_as_representative – If this structure should be set as the representative structure
- representative_chain (str) – If
set_as_representative
isTrue
, provide the representative chain ID - force_rerun (bool) – If the PDB should be reloaded if it is already in the list of structures
Returns: The object that is now contained in the structures attribute
Return type:
-
load_kegg
(kegg_id, kegg_organism_code=None, kegg_seq_file=None, kegg_metadata_file=None, set_as_representative=False, download=False, outdir=None, force_rerun=False)[source]¶ Load a KEGG ID, sequence, and metadata files into the sequences attribute.
Parameters: - kegg_id (str) – KEGG ID
- kegg_organism_code (str) – KEGG organism code to prepend to the kegg_id if not part of it already.
Example:
eco:b1244
,eco
is the organism code - kegg_seq_file (str) – Path to KEGG FASTA file
- kegg_metadata_file (str) – Path to KEGG metadata file (raw KEGG format)
- set_as_representative (bool) – If this KEGG ID should be set as the representative sequence
- download (bool) – If the KEGG sequence and metadata files should be downloaded if not provided
- outdir (str) – Where the sequence and metadata files should be downloaded to
- force_rerun (bool) – If ID should be reloaded and files redownloaded
Returns: object contained in the sequences attribute
Return type:
-
load_manual_sequence
(seq, ident=None, write_fasta_file=False, outdir=None, set_as_representative=False, force_rewrite=False)[source]¶ Load a manual sequence given as a string and optionally set it as the representative sequence. Also store it in the sequences attribute.
Parameters: - seq (str, Seq, SeqRecord) – Sequence string, Biopython Seq or SeqRecord object
- ident (str) – Optional identifier for the sequence, required if seq is a string. Also will override existing IDs in Seq or SeqRecord objects if set.
- write_fasta_file (bool) – If this sequence should be written out to a FASTA file
- outdir (str) – Path to output directory
- set_as_representative (bool) – If this sequence should be set as the representative one
- force_rewrite (bool) – If the FASTA file should be overwritten if it already exists
Returns: Sequence that was loaded into the
sequences
attributeReturn type:
-
load_manual_sequence_file
(ident, seq_file, copy_file=False, outdir=None, set_as_representative=False)[source]¶ Load a manual sequence, given as a FASTA file and optionally set it as the representative sequence. Also store it in the sequences attribute.
Parameters: - ident (str) – Sequence ID
- seq_file (str) – Path to sequence FASTA file
- copy_file (bool) – If the FASTA file should be copied to the protein’s sequences folder or the
outdir
, if protein folder has not been set - outdir (str) – Path to output directory
- set_as_representative (bool) – If this sequence should be set as the representative one
Returns: Sequence that was loaded into the
sequences
attributeReturn type:
-
load_pdb
(pdb_id, mapped_chains=None, pdb_file=None, file_type=None, is_experimental=True, set_as_representative=False, representative_chain=None, force_rerun=False)[source]¶ Load a structure ID and optional structure file into the structures attribute.
Parameters: - pdb_id (str) – PDB ID
- mapped_chains (str, list) – Chain ID or list of IDs which you are interested in
- pdb_file (str) – Path to PDB file
- file_type (str) – Type of PDB file
- is_experimental (bool) – If this structure file is experimental
- set_as_representative (bool) – If this structure should be set as the representative structure
- representative_chain (str) – If
set_as_representative
isTrue
, provide the representative chain ID - force_rerun (bool) – If the PDB should be reloaded if it is already in the list of structures
Returns: The object that is now contained in the structures attribute
Return type:
-
load_uniprot
(uniprot_id, uniprot_seq_file=None, uniprot_xml_file=None, download=False, outdir=None, set_as_representative=False, force_rerun=False)[source]¶ Load a UniProt ID and associated sequence/metadata files into the sequences attribute.
Sequence and metadata files can be provided, or alternatively downloaded with the download flag set to True. Metadata files will be downloaded as XML files.
Parameters: - uniprot_id (str) – UniProt ID/ACC
- uniprot_seq_file (str) – Path to FASTA file
- uniprot_xml_file (str) – Path to UniProt XML file
- download (bool) – If sequence and metadata files should be downloaded
- outdir (str) – Output directory for sequence and metadata files
- set_as_representative (bool) – If this sequence should be set as the representative one
- force_rerun (bool) – If files should be redownloaded and metadata reloaded
Returns: Sequence that was loaded into the
sequences
attributeReturn type:
-
map_seqprop_resnums_to_structprop_resnums
(resnums, seqprop=None, structprop=None, chain_id=None, use_representatives=False)[source]¶ Map a residue number in any SeqProp to the structure’s residue number for a specified chain.
Parameters: - resnums (int, list) – Residue numbers in the sequence
- seqprop (SeqProp) – SeqProp object
- structprop (StructProp) – StructProp object
- chain_id (str) – Chain ID to map to
- use_representatives (bool) – If the representative sequence and structure should be used. If True, seqprop, structprop, and chain_id do not need to be defined.
Returns: Mapping of sequence residue numbers to structure residue numbers
Return type: dict
-
map_structprop_resnums_to_seqprop_resnums
(resnums, structprop=None, chain_id=None, seqprop=None, use_representatives=False)[source]¶ Map a residue number in any StructProp + chain ID to any SeqProp’s residue number.
Parameters: - resnums (int, list) – Residue numbers in the structure
- structprop (StructProp) – StructProp object
- chain_id (str) – Chain ID to map from
- seqprop (SeqProp) – SeqProp object
- use_representatives (bool) – If the representative sequence and structure should be used. If True, seqprop, structprop, and chain_id do not need to be defined.
Returns: Mapping of structure residue numbers to sequence residue numbers
Return type: dict
-
map_uniprot_to_pdb
(seq_ident_cutoff=0.0, outdir=None, force_rerun=False)[source]¶ Map the representative sequence’s UniProt ID to PDB IDs using the PDBe “Best Structures” API. Will save a JSON file of the results to the protein sequences folder.
The “Best structures” API is available at https://www.ebi.ac.uk/pdbe/api/doc/sifts.html The list of PDB structures mapping to a UniProt accession sorted by coverage of the protein and, if the same, resolution.
Parameters: - seq_ident_cutoff (float) – Sequence identity cutoff in decimal form
- outdir (str) – Output directory to cache JSON results of search
- force_rerun (bool) – Force re-downloading of JSON results if they already exist
Returns: A rank-ordered list of PDBProp objects that map to the UniProt ID
Return type: list
-
num_sequences
¶ int – Return the total number of sequences
-
num_structures
¶ int – Return the total number of structures
-
num_structures_experimental
¶ int – Return the total number of experimental structures
-
num_structures_homology
¶ int – Return the total number of homology models
-
pairwise_align_sequences_to_representative
(gapopen=10, gapextend=0.5, outdir=None, engine='needle', parse=True, force_rerun=False)[source]¶ Pairwise all sequences in the sequences attribute to the representative sequence. Stores the alignments in the
sequence_alignments
DictList attribute.Parameters: - gapopen (int) – Only for
engine='needle'
- Gap open penalty is the score taken away when a gap is created - gapextend (float) – Only for
engine='needle'
- Gap extension penalty is added to the standard gap penalty for each base or residue in the gap - outdir (str) – Only for
engine='needle'
- Path to output directory. Default is the protein sequence directory. - engine (str) –
biopython
orneedle
- which pairwise alignment program to use.needle
is the standard EMBOSS tool to run pairwise alignments.biopython
is Biopython’s implementation of needle. Results can differ! - parse (bool) – Store locations of mutations, insertions, and deletions in the alignment object (as an annotation)
- force_rerun (bool) – Only for
engine='needle'
- Default False, set to True if you want to rerun the alignment if outfile exists.
- gapopen (int) – Only for
-
pairwise_align_sequences_to_representative_parallelize
(sc, gapopen=10, gapextend=0.5, outdir=None, engine='needle', parse=True, force_rerun=False)[source]¶ Pairwise all sequences in the sequences attribute to the representative sequence. Stores the alignments in the
sequence_alignments
DictList attribute.Parameters: - sc (SparkContext) – Configured spark context for parallelization
- gapopen (int) – Only for
engine='needle'
- Gap open penalty is the score taken away when a gap is created - gapextend (float) – Only for
engine='needle'
- Gap extension penalty is added to the standard gap penalty for each base or residue in the gap - outdir (str) – Only for
engine='needle'
- Path to output directory. Default is the protein sequence directory. - engine (str) –
biopython
orneedle
- which pairwise alignment program to use.needle
is the standard EMBOSS tool to run pairwise alignments.biopython
is Biopython’s implementation of needle. Results can differ! - parse (bool) – Store locations of mutations, insertions, and deletions in the alignment object (as an annotation)
- force_rerun (bool) – Only for
engine='needle'
- Default False, set to True if you want to rerun the alignment if outfile exists.
-
parse_all_stored_structures
(outdir=None, pdb_file_type=None, force_rerun=False)[source]¶ Runs parse_structure for any stored structure with a file available
-
pdb_downloader_and_metadata
(outdir=None, pdb_file_type=None, force_rerun=False)[source]¶ Download ALL mapped experimental structures to the protein structures directory.
Parameters: - outdir (str) – Path to output directory, if protein structures directory not set or other output directory is desired
- pdb_file_type (str) – Type of PDB file to download, if not already set or other format is desired
- force_rerun (bool) – If files should be re-downloaded if they already exist
Returns: List of PDB IDs that were downloaded
Return type: list
Todo
- Parse mmtf or PDB file for header information, rather than always getting the cif file for header info
-
pdb_file_type
= None¶ str –
pdb
,pdb.gz
,mmcif
,cif
,cif.gz
,xml.gz
,mmtf
,mmtf.gz
- choose a file type for files downloaded from the PDB
-
prep_itasser_modeling
(itasser_installation, itlib_folder, runtype, create_in_dir=None, execute_from_dir=None, print_exec=False, **kwargs)[source]¶ Prepare to run I-TASSER homology modeling for the representative sequence.
Parameters: - itasser_installation (str) – Path to I-TASSER folder, i.e.
~/software/I-TASSER4.4
- itlib_folder (str) – Path to ITLIB folder, i.e.
~/software/ITLIB
- runtype – How you will be running I-TASSER - local, slurm, or torque
- create_in_dir (str) – Local directory where folders will be created
- execute_from_dir (str) – Optional path to execution directory - use this if you are copying the homology models to another location such as a supercomputer for running
- all_genes (bool) – If all genes should be prepped, or only those without any mapped structures
- print_exec (bool) – If the execution statement should be printed to run modelling
Todo
- Document kwargs - extra options for I-TASSER, SLURM or Torque execution
- Allow modeling of any sequence in sequences attribute, select by ID or provide SeqProp?
- itasser_installation (str) – Path to I-TASSER folder, i.e.
-
protein_dir
¶ str – Protein folder
-
protein_statistics
¶ Get a dictionary of basic statistics describing this protein
-
representative_chain
= None¶ str – Chain ID in the representative structure which best represents a sequence
-
representative_chain_seq_coverage
= None¶ float – Percent identity of sequence coverage for the representative chain
-
representative_sequence
= None¶ SeqProp – Sequence set to represent this protein
-
representative_structure
= None¶ StructProp – Structure set to represent this protein, usually in monomeric form
-
root_dir
¶ str – Path to where the folder named by this protein’s ID will be created. Default is current working directory.
-
sequence_alignments
= None¶ DictList – Pairwise or multiple sequence alignments stored as
Bio.Align.MultipleSeqAlignment
objects
-
sequence_dir
¶ str – Directory where sequence related files are stored
-
sequence_mutation_summary
(alignment_ids=None, alignment_type=None)[source]¶ Summarize all mutations found in the sequence_alignments attribute.
Returns 2 dictionaries, single_counter and fingerprint_counter.
- single_counter:
Dictionary of
{point mutation: list of genes/strains}
Example:{ ('A', 24, 'V'): ['Strain1', 'Strain2', 'Strain4'], ('R', 33, 'T'): ['Strain2'] }
Here, we report which genes/strains have the single point mutation.
- fingerprint_counter:
Dictionary of
{mutation group: list of genes/strains}
Example:{ (('A', 24, 'V'), ('R', 33, 'T')): ['Strain2'], (('A', 24, 'V')): ['Strain1', 'Strain4'] }
Here, we report which genes/strains have the specific combinations (or “fingerprints”) of point mutations
Parameters: - alignment_ids (str, list) – Specified alignment ID or IDs to use
- alignment_type (str) – Specified alignment type contained in the
annotation
field of an alignment object,seqalign
orstructalign
are the current types.
Returns: single_counter, fingerprint_counter
Return type: dict, dict
-
sequences
= None¶ DictList – Stored protein sequences which are related to this protein
-
set_representative_sequence
(force_rerun=False)[source]¶ Automatically consolidate loaded sequences (manual, UniProt, or KEGG) and set a single representative sequence.
Manually set representative sequences override all existing mappings. UniProt mappings override KEGG mappings except when KEGG mappings have PDBs associated with them and UniProt doesn’t.
Parameters: force_rerun (bool) – Set to True to recheck stored sequences Returns: Which sequence was set as representative Return type: SeqProp
-
set_representative_structure
(seq_outdir=None, struct_outdir=None, pdb_file_type=None, engine='needle', always_use_homology=False, rez_cutoff=0.0, seq_ident_cutoff=0.5, allow_missing_on_termini=0.2, allow_mutants=True, allow_deletions=False, allow_insertions=False, allow_unresolved=True, clean=True, keep_chemicals=None, skip_large_structures=False, force_rerun=False)[source]¶ Set a representative structure from a structure in the structures attribute.
- Each gene can have a combination of the following, which will be analyzed to set a representative structure.
- Homology model(s)
- Ranked PDBs
- BLASTed PDBs
If the
always_use_homology
flag is true, homology models are always set as representative when they exist. If there are multiple homology models, we rank by the percent sequence coverage.Parameters: - seq_outdir (str) – Path to output directory of sequence alignment files, must be set if Protein directory was not created initially
- struct_outdir (str) – Path to output directory of structure files, must be set if Protein directory was not created initially
- pdb_file_type (str) –
pdb
,mmCif
,xml
,mmtf
- file type for files downloaded from the PDB - engine (str) –
biopython
orneedle
- which pairwise alignment program to use.needle
is the standard EMBOSS tool to run pairwise alignments.biopython
is Biopython’s implementation of needle. Results can differ! - always_use_homology (bool) – If homology models should always be set as the representative structure
- rez_cutoff (float) – Resolution cutoff, in Angstroms (only if experimental structure)
- seq_ident_cutoff (float) – Percent sequence identity cutoff, in decimal form
- allow_missing_on_termini (float) – Percentage of the total length of the reference sequence which will be ignored when checking for modifications. Example: if 0.1, and reference sequence is 100 AA, then only residues 5 to 95 will be checked for modifications.
- allow_mutants (bool) – If mutations should be allowed or checked for
- allow_deletions (bool) – If deletions should be allowed or checked for
- allow_insertions (bool) – If insertions should be allowed or checked for
- allow_unresolved (bool) – If unresolved residues should be allowed or checked for
- clean (bool) – If structure should be cleaned
- keep_chemicals (str, list) – Keep specified chemical names if structure is to be cleaned
- skip_large_structures (bool) – Default False – currently, large structures can’t be saved as a PDB file even if you just want to save a single chain, so Biopython will throw an error when trying to do so. As an alternative, if a large structure is selected as representative, the pipeline will currently point to it and not clean it. If you don’t want this to happen, set this to true.
- force_rerun (bool) – If sequence to structure alignment should be rerun
Returns: Representative structure from the list of structures. This is a not a map to the original structure, it is copied and optionally cleaned from the original one.
Return type: Todo
- Remedy large structure representative setting
-
structure_alignments
= None¶ DictList – Pairwise or multiple structure alignments - currently a placeholder
-
structure_dir
¶ str – Directory where structure related files are stored
-
structures
= None¶ DictList – Stored protein structures which are related to this protein
-
write_all_sequences_file
(outname, outdir=None)[source]¶ Write all the stored sequences as a single FASTA file. By default, sets IDs to model gene IDs.
Parameters: - outname (str) – Name of the output FASTA file without the extension
- outdir (str) – Path to output directory for the file, default is the sequences directory
- Load, parse, and store the same (ie. from different database sources) or similar (ie. from different strains)
protein sequences as SeqProp objects in the
Further reading¶
For examples in which tools from the Protein class have been used for analysis, please see the following:
[1] | Broddrick JT, Rubin BE, Welkie DG, Du N, Mih N, Diamond S, et al. Unique attributes of cyanobacterial metabolism revealed by improved genome-scale metabolic modeling and essential gene analysis. Proc Natl Acad Sci U S A. 2016;113: E8344–E8353. doi:10.1073/pnas.1613446113 |
[2] | Mih N, Brunk E, Bordbar A, Palsson BO. A Multi-scale Computational Platform to Mechanistically Assess the Effect of Genetic Variation on Drug Responses in Human Erythrocyte Metabolism. PLoS Comput Biol. 2016;12: e1005039. doi:10.1371/journal.pcbi.1005039 |
The StructProp Class¶

Introduction¶
This section will give an overview of the methods that can be executed for a single protein structure.
Tutorials¶
PDBProp - Working With a Single PDB Structure¶
This notebook gives a tutorial of the PDBProp object, specifically how chains are handled and how to map a sequence to it.
Imports¶
In [ ]:
from ssbio.databases.pdb import PDBProp
from ssbio.databases.uniprot import UniProtProp
In [ ]:
import sys
import logging
In [ ]:
# Create logger
logger = logging.getLogger()
logger.setLevel(logging.DEBUG) # SET YOUR LOGGING LEVEL HERE #
In [ ]:
# Other logger stuff for Jupyter notebooks
handler = logging.StreamHandler(sys.stderr)
formatter = logging.Formatter('[%(asctime)s] [%(name)s] %(levelname)s: %(message)s', datefmt="%Y-%m-%d %H:%M")
handler.setFormatter(formatter)
logger.handlers = [handler]
Basic methods¶
In [ ]:
my_structure = PDBProp(ident='5T4Q', description='E. coli ATP synthase')
Downloading will: - Download the file type of choice to the specific output directory - Parse the PDB header file to fill out the metadata fields
In [ ]:
import tempfile
my_structure.download_structure_file(outdir=tempfile.gettempdir(), file_type='mmtf')
In [ ]:
my_structure.get_dict()
The mapped_chains
attribute allows us to limit sequence analyses to
specified chains (see the later section where we align a sequence to
this structure). For this
example, the ATP synthase is a complex of a number of protein chains,
and if we are interested in a specific gene transcript, we can set
those.
In [ ]:
# Chains A, B, and C make up ATP synthase subunit alpha - from the gene b3734 (UniProt ID P0ABB0)
my_structure.add_mapped_chain_ids(['A', 'B', 'C'])
Parsing the structure will parse the sequences of each chain, and store
those in the chains
attribute. It will also return a Biopython
Structure object which opens up all methods available for structures in
Biopython.
In [ ]:
parsed_structure = my_structure.parse_structure()
print(type(parsed_structure.structure))
print(type(parsed_structure.first_model))
Cleaning a structure does the following: - Add missing chain identifiers to a PDB file - Select a single chain if noted - Remove alternate atom locations - Add atom occupancies - Add B (temperature) factors (default Biopython behavior)
In the example below, we will clean the structure so it only includes our mapped chains.
In [ ]:
cleaned_structure = my_structure.clean_structure(outdir='/tmp', keep_chains=my_structure.mapped_chains, force_rerun=True)
cleaned_structure
In [ ]:
# The original structure
my_structure.view_structure(recolor=False)
In [ ]:
# The cleaned structure
import nglview
nglview.show_structure_file(cleaned_structure)
FATCAT - Structure Similarity¶
This notebook shows how to run and parse FATCAT, a structural similarity calculator.
In [ ]:
import ssbio.protein.structure.properties.fatcat as fatcat
In [ ]:
import os
import os.path as op
import tempfile
ROOT_DIR = tempfile.gettempdir()
OUT_DIR = op.join(ROOT_DIR, 'fatcat_testing')
if not op.exists(OUT_DIR):
os.mkdir(OUT_DIR)
FATCAT_SH = 'fatcat'
Pairwise¶
In [ ]:
fatcat_outfile = fatcat.run_fatcat(structure_path_1='../../ssbio/test/test_files/structures/12as-A_clean.pdb',
structure_path_2='../../ssbio/test/test_files/structures/1a9x-A_clean.pdb',
outdir=OUT_DIR,
fatcat_sh=FATCAT_SH, print_cmd=True, force_rerun=True)
print('Output file:', fatcat_outfile)
In [ ]:
fatcat.parse_fatcat(fatcat_outfile)
All-by-all¶
In [ ]:
structs = ['../../ssbio/test/test_files/structures/12as-A_clean.pdb',
'../../ssbio/test/test_files/structures/1af6-A_clean.pdb',
'../../ssbio/test/test_files/structures/1a9x-A_clean.pdb']
In [ ]:
tm_scores = fatcat.run_fatcat_all_by_all(structs, fatcat_sh=FATCAT_SH, outdir=OUT_DIR)
tm_scores
In [ ]:
%matplotlib inline
import seaborn as sns
sns.heatmap(tm_scores)
Available functions¶
Sequence & structure-based predictions¶
Function | Description | Internal Python class used and functions provided |
External software to install |
Web server | Alternate external software to install |
---|---|---|---|---|---|
Homology modeling | Preparation scripts and parsers for executing homology modeling algorithms |
I-TASSER | |||
Transmembrane orientation |
Prediction of transmembrane domains and orientation in a membrane |
opm module |
OPM | ||
Kinetic folding rate | Prediction of protein folding rates from amino acid sequence |
kinetic_folding_rate module |
FOLD-RATE |
Structure-based calculations or functions¶
Function | Description | Internal Python class used and functions provided |
External software to install |
Web server | Alternate external software to install |
---|---|---|---|---|---|
Secondary structure | Calculations of secondary structure | DSSP | STRIDE | ||
Solvent accessibilities | Calculations of per-residue absolute and relative solvent accessibilities |
DSSP | FreeSASA | ||
Residue depths | Calculations of residue depths | MSMS | |||
Structural similarity | Pairwise calculations of 3D structural similarity |
fatcat module |
FATCAT | ||
Various structure properties |
Basic properties of the structure, such as distance measurements between residues or number of disulfide bridges |
||||
Quality | Custom functions to allow ranking of structures by percent identity to a defined sequence, structure resolution, and other structure quality metrics |
set_representative_structure function |
|||
Structure cleaning, mutating |
Custom functions to allow for the preparation of structure files for molecular modeling, with options to remove hydrogens/waters/heteroatoms, select specific chains, or mutate specific residues. |
AmberTools |
API¶
StructProp¶
-
class
ssbio.protein.structure.structprop.
StructProp
(ident, description=None, chains=None, mapped_chains=None, is_experimental=False, structure_path=None, file_type=None)[source]¶ Generic class to represent information for a protein structure.
The main utilities of this class are to:
- Provide access to the 3D coordinates using a Biopython Structure object through the method
parse_structure
. - Run predictions and computations on the structure
- Analyze specific chains using the
mapped_chains
attribute - Provide wrapper methods to
nglview
to view the structure in a Jupyter notebook
Parameters: - ident (str) – Unique identifier for this structure
- description (str) – Optional human-readable description
- chains (str, list) – Chain ID or list of IDs
- mapped_chains (str, list) – A chain ID or IDs to indicate what chains should be analyzed
- is_experimental (bool) – Flag to indicate if structure is an experimental or computational model
- structure_path (str) – Path to structure file
- file_type (str) – Type of structure file -
pdb
,pdb.gz
,mmcif
,cif
,cif.gz
,xml.gz
,mmtf
,mmtf.gz
-
add_chain_ids
(chains)[source]¶ Add chains by ID into the chains attribute
Parameters: chains (str, list) – Chain ID or list of IDs
-
add_mapped_chain_ids
(mapped_chains)[source]¶ Add chains by ID into the mapped_chains attribute
Parameters: mapped_chains (str, list) – Chain ID or list of IDs
-
add_residues_highlight_to_nglview
(view, structure_resnums, chain=None, res_color='red')[source]¶ Add a residue number or numbers to an NGLWidget view object.
Parameters: - view (NGLWidget) – NGLWidget view object
- structure_resnums (int, list) – Residue number(s) to highlight, structure numbering
- chain (str, list) – Chain ID or IDs of which residues are a part of. If not provided, all chains in the mapped_chains attribute will be used. If that is also empty, and exception is raised.
- res_color (str) – Color to highlight residues with
-
add_scaled_residues_highlight_to_nglview
(view, structure_resnums, chain=None, color='red', unique_colors=False, opacity_range=(0.5, 1), scale_range=(0.7, 10))[source]¶ - Add a list of residue numbers (which may contain repeating residues) to a view, or add a dictionary of
- residue numbers to counts. Size and opacity of added residues are scaled by counts.
Parameters: - view (NGLWidget) – NGLWidget view object
- structure_resnums (int, list, dict) – Residue number(s) to highlight, or a dictionary of residue number to frequency count
- chain (str, list) – Chain ID or IDs of which residues are a part of. If not provided, all chains in the mapped_chains attribute will be used. If that is also empty, and exception is raised.
- color (str) – Color to highlight residues with
- unique_colors (bool) – If each mutation should be colored uniquely (will override color argument)
- opacity_range (tuple) – Min/max opacity values (residues that have higher frequency counts will be opaque)
- scale_range (tuple) – Min/max size values (residues that have higher frequency counts will be bigger)
-
chains
= None¶ DictList – A DictList of chains have their sequence stored in them, along with residue-specific
-
clean_structure
(out_suffix='_clean', outdir=None, force_rerun=False, remove_atom_alt=True, keep_atom_alt_id='A', remove_atom_hydrogen=True, add_atom_occ=True, remove_res_hetero=True, keep_chemicals=None, keep_res_only=None, add_chain_id_if_empty='X', keep_chains=None)[source]¶ Clean the structure file associated with this structure, and save it as a new file. Returns the file path.
Parameters: - out_suffix (str) – Suffix to append to original filename
- outdir (str) – Path to output directory
- force_rerun (bool) – If structure should be re-cleaned if a clean file exists already
- remove_atom_alt (bool) – Remove alternate positions
- keep_atom_alt_id (str) – If removing alternate positions, which alternate ID to keep
- remove_atom_hydrogen (bool) – Remove hydrogen atoms
- add_atom_occ (bool) – Add atom occupancy fields if not present
- remove_res_hetero (bool) – Remove all HETATMs
- keep_chemicals (str, list) – If removing HETATMs, keep specified chemical names
- keep_res_only (str, list) – Keep ONLY specified resnames, deletes everything else!
- add_chain_id_if_empty (str) – Add a chain ID if not present
- keep_chains (str, list) – Keep only these chains
Returns: Path to cleaned PDB file
Return type: str
-
file_type
= None¶ str – Type of structure file
-
find_disulfide_bridges
(threshold=3.0)[source]¶ Run Biopython’s search_ss_bonds to find potential disulfide bridges for each chain and store in ChainProp.
-
get_dict_with_chain
(chain, only_keys=None, chain_keys=None, exclude_attributes=None, df_format=False)[source]¶ get_dict method which incorporates attributes found in a specific chain. Does not overwrite any attributes in the original StructProp.
Parameters: - chain –
- only_keys –
- chain_keys –
- exclude_attributes –
- df_format –
Returns: attributes of StructProp + the chain specified
Return type: dict
-
get_dssp_annotations
(outdir, force_rerun=False)[source]¶ Run DSSP on this structure and store the DSSP annotations in the corresponding ChainProp SeqRecords
Calculations are stored in the ChainProp’s
letter_annotations
at the following keys:SS-dssp
RSA-dssp
ASA-dssp
PHI-dssp
PSI-dssp
Parameters: - outdir (str) – Path to where DSSP dataframe will be stored.
- force_rerun (bool) – If DSSP results should be recalculated
Todo
- Also parse global properties, like total accessible surface area. Don’t think Biopython parses those?
-
get_freesasa_annotations
(outdir, include_hetatms=False, force_rerun=False)[source]¶ Run
freesasa
on this structure and store the calculated properties in the corresponding ChainProps
-
get_msms_annotations
(outdir, force_rerun=False)[source]¶ Run MSMS on this structure and store the residue depths/ca depths in the corresponding ChainProp SeqRecords
-
get_structure_seqs
(model)[source]¶ Gather chain sequences and store in their corresponding
ChainProp
objects in thechains
attribute.Parameters: model (Model) – Biopython Model object of the structure you would like to parse
-
is_experimental
= None¶ bool – Flag to note if this structure is an experimental model or a homology model
-
load_structure_path
(structure_path, file_type)[source]¶ Load a structure file and provide pointers to its location
Parameters: - structure_path (str) – Path to structure file
- file_type (str) – Type of structure file
-
mapped_chains
= None¶ list – A simple list of chain IDs (strings) that will be used to subset analyses
-
parse_structure
(store_in_memory=False)[source]¶ Read the 3D coordinates of a structure file and return it as a Biopython Structure object. Also create ChainProp objects in the chains attribute for each chain in the first model.
Parameters: store_in_memory (bool) – If the Biopython Structure object should be stored in the attribute structure
.Returns: Biopython Structure object Return type: Structure
-
parsed
= None¶ bool – Simple flag to track if this structure has had its structure + chain sequences parsed
-
structure
= None¶ Structure – Biopython Structure object, only used if
store_in_memory
option ofparse_structure
is set to True
-
structure_file
= None¶ str – Name of the structure file
-
view_structure
(only_chains=None, opacity=1.0, recolor=False, gui=False)[source]¶ Use NGLviewer to display a structure in a Jupyter notebook
Parameters: - only_chains (str, list) – Chain ID or IDs to display
- opacity (float) – Opacity of the structure
- recolor (bool) – If structure should be cleaned and recolored to silver
- gui (bool) – If the NGLview GUI should show up
Returns: NGLviewer object
- Provide access to the 3D coordinates using a Biopython Structure object through the method
The SeqProp Class¶

Introduction¶
This section will give an overview of the methods that can be executed for a single protein sequence.
Tutorials¶
SeqProp - Protein Sequence Properties¶
This notebook gives an overview the available calculations for properties of a single protein sequence.
Note
See ssbio.protein.sequence.seqprop.SeqProp
for a description of all the available attributes and functions.
Imports¶
In [ ]:
import sys
import logging
import os.path as op
In [ ]:
# Import the SeqProp class
from ssbio.protein.sequence.seqprop import SeqProp
In [ ]:
# Printing multiple outputs per cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
Set the logging level in logger.setLevel(logging.<LEVEL_HERE>)
to
specify how verbose you want the pipeline to be. Debug is most verbose.
CRITICAL
- Only really important messages shown
ERROR
- Major errors
WARNING
- Warnings that don’t affect running of the pipeline
INFO
(default)- Info such as the number of structures mapped per gene
DEBUG
- Really detailed information that will print out a lot of stuff
In [ ]:
# Create logger
logger = logging.getLogger()
logger.setLevel(logging.INFO) # SET YOUR LOGGING LEVEL HERE #
In [ ]:
# Other logger stuff for Jupyter notebooks
handler = logging.StreamHandler(sys.stderr)
formatter = logging.Formatter('[%(asctime)s] [%(name)s] %(levelname)s: %(message)s', datefmt="%Y-%m-%d %H:%M")
handler.setFormatter(formatter)
logger.handlers = [handler]
Initialization of the project¶
Set these two things:
PROTEIN_ID
- Your protein ID
PROTEIN_SEQ
- Your protein sequence
In [ ]:
# SET IDS HERE
PROTEIN_ID = 'YIAJ_ECOLI'
PROTEIN_SEQ = 'MGKEVMGKKENEMAQEKERPAGSQSLFRGLMLIEILSNYPNGCPLAHLSELAGLNKSTVHRLLQGLQSCGYVTTAPAAGSYRLTTKFIAVGQKALSSLNIIHIAAPHLEALNIATGETINFSSREDDHAILIYKLEPTTGMLRTRAYIGQHMPLYCSAMGKIYMAFGHPDYVKSYWESHQHEIQPLTRNTITELPAMFDELAHIRESGAAMDREENELGVSCIAVPVFDIHGRVPYAVSISLSTSRLKQVGEKNLLKPLRETAQAISNELGFTVRDDLGAIT'
In [ ]:
# Create the SeqProp object
my_seq = SeqProp(id=PROTEIN_ID, seq=PROTEIN_SEQ)
-
SeqProp.
write_fasta_file
(outfile, force_rerun=False)[source] Write a FASTA file for the protein sequence,
seq
will now load directly from this file.Parameters: - outfile (str) – Path to new FASTA file to be written to
- force_rerun (bool) – If an existing file should be overwritten
In [ ]:
# Write temporary FASTA file for property calculations that require FASTA file as input
import tempfile
ROOT_DIR = tempfile.gettempdir()
my_seq.write_fasta_file(outfile=op.join(ROOT_DIR, 'tmp.fasta'), force_rerun=True)
my_seq.sequence_path
Computing and storing protein properties¶
A SeqProp
object is simply an extension of the Biopython
SeqRecord
object. Global properties which describe or summarize the
entire protein sequence are stored in the annotations
attribute,
while local residue-specific properties are stored in the
letter_annotations
attribute.
-
SeqProp.
get_biopython_pepstats
()[source] Run Biopython’s built in ProteinAnalysis module and store statistics in the
annotations
attribute.
In [ ]:
# Global properties using the Biopython ProteinAnalysis module
my_seq.get_biopython_pepstats()
{k:v for k,v in my_seq.annotations.items() if k.endswith('-biop')}
-
SeqProp.
get_emboss_pepstats
()[source] Run the EMBOSS pepstats program on the protein sequence.
Stores statistics in the
annotations
attribute. Saves a.pepstats
file of the results where the sequence file is located.
In [ ]:
# Global properties from the EMBOSS pepstats program
my_seq.get_emboss_pepstats()
{k:v for k,v in my_seq.annotations.items() if k.endswith('-pepstats')}
-
SeqProp.
get_aggregation_propensity
(email, password, cutoff_v=5, cutoff_n=5, run_amylmuts=False, outdir=None)[source] Run the AMYLPRED2 web server to calculate the aggregation propensity of this protein sequence, which is the number of aggregation-prone segments on the unfolded protein sequence.
Stores statistics in the
annotations
attribute, under the key aggprop-amylpred.See
ssbio.protein.sequence.properties.aggregation_propensity
for instructions and details.
In [ ]:
# Aggregation propensity - the predicted number of aggregation-prone segments on an unfolded protein sequence
my_seq.get_aggregation_propensity(outdir=ROOT_DIR, email='nmih@ucsd.edu', password='ssbiotest', cutoff_v=5, cutoff_n=5, run_amylmuts=False)
{k:v for k,v in my_seq.annotations.items() if k.endswith('-amylpred')}
-
SeqProp.
get_kinetic_folding_rate
(secstruct, at_temp=None)[source] Run the FOLD-RATE web server to calculate the kinetic folding rate given an amino acid sequence and its structural classficiation (alpha/beta/mixed)
Stores statistics in the
annotations
attribute, under the key kinetic_folding_rate_<TEMP>-foldrate.See
ssbio.protein.sequence.properties.kinetic_folding_rate.get_foldrate()
for instructions and details.
In [ ]:
# Kinetic folding rate - the predicted rate of folding for this protein sequence
secstruct_class = 'mixed'
my_seq.get_kinetic_folding_rate(secstruct=secstruct_class)
{k:v for k,v in my_seq.annotations.items() if k.endswith('-foldrate')}
-
SeqProp.
get_thermostability
(at_temp)[source] Run the thermostability calculator using either the Dill or Oobatake methods.
Stores calculated (dG, Keq) tuple in the
annotations
attribute, under the key thermostability_<TEMP>-<METHOD_USED>.See
ssbio.protein.sequence.properties.thermostability.get_dG_at_T()
for instructions and details.
In [ ]:
# Thermostability - prediction of free energy of unfolding dG from protein sequence
# Stores (dG, Keq)
my_seq.get_thermostability(at_temp=32.0)
my_seq.get_thermostability(at_temp=37.0)
my_seq.get_thermostability(at_temp=42.0)
{k:v for k,v in my_seq.annotations.items() if k.startswith('thermostability_')}
Available functions¶
Sequence-based predictions¶
Function | Description | Internal Python class used and functions provided |
External software to install |
Web server | Alternate external software to install |
---|---|---|---|---|---|
Secondary structure and solvent accessibilities |
Predictions of secondary structure and relative solvent accessibilities per residue |
scratch module |
SCRATCH | ||
Thermostability | Free energy of unfolding (ΔG), adapted from Oobatake (Oobatake & Ooi 1993) and Dill (Dill et al. 2011) |
thermostability module |
|||
Transmembrane domains | Prediction of transmembrane domains from sequence | tmhmm module |
TMHMM | ||
Aggregation propensity | Consensus method to predict the aggregation propensity of proteins, specifically the number of aggregation-prone segments on an unfolded protein sequence |
aggregation_propensity module |
AMYLPRED2 |
Sequence-based calculations¶
Function | Description | Internal Python class used and functions provided |
External software to install |
Web server | Alternate external software to install |
---|---|---|---|---|---|
Various sequence properties |
Basic properties of the sequence, such as percent of polar, non-polar, hydrophobic or hydrophilic residues. |
EMBOSS pepstats | |||
Sequence alignment | Basic functions to run pairwise or multiple sequence alignments |
EMBOSS needle |
API¶
SeqProp¶
-
class
ssbio.protein.sequence.seqprop.
SeqProp
(seq, id, name='<unknown name>', description='<unknown description>', sequence_path=None, metadata_path=None, feature_path=None)[source]¶ Generic class to represent information for a protein sequence.
Extends the Biopython SeqRecord class. The main functionality added is the ability to set and load directly from sequence, metadata, and feature files. Additionally, methods are provided to calculate and store sequence properties in the
annotations
andletter_annotations
field of a SeqProp. These can then be accessed for a range of residue numbers.-
id
¶ str – Unique identifier for this protein sequence
-
seq
¶ Seq – Protein sequence as a Biopython Seq object
-
name
¶ str – Optional name for this sequence
-
description
¶ str – Optional description for this sequence
-
bigg
¶ str, list – BiGG IDs mapped to this sequence
-
kegg
¶ str, list – KEGG IDs mapped to this sequence
-
refseq
¶ str, list – RefSeq IDs mapped to this sequence
-
uniprot
¶ str, list – UniProt IDs mapped to this sequence
-
gene_name
¶ str, list – Gene names mapped to this sequence
-
pdbs
¶ list – PDB IDs mapped to this sequence
-
go
¶ str, list – GO terms mapped to this sequence
-
pfam
¶ str, list – PFAMs mapped to this sequence
-
ec_number
¶ str, list – EC numbers mapped to this sequence
-
sequence_file
¶ str – FASTA file for this sequence
-
metadata_file
¶ str – Metadata file (any format) for this sequence
-
feature_file
¶ str – GFF file for this sequence
-
features
¶ list – List of protein sequence features, which define regions of the protein
-
annotations
¶ dict – Annotations of this protein sequence, which summarize global properties
-
letter_annotations
¶ RestrictedDict – Residue-level annotations, which describe single residue properties
Todo
- Properly inherit methods from the Object class…
-
add_point_feature
(resnum, feat_type=None, feat_id=None)[source]¶ Add a feature to the features list describing a single residue.
Parameters: - resnum (int) – Protein sequence residue number
- feat_type (str, optional) – Optional description of the feature type (ie. ‘catalytic residue’)
- feat_id (str, optional) – Optional ID of the feature type (ie. ‘TM1’)
-
add_region_feature
(start_resnum, end_resnum, feat_type=None, feat_id=None)[source]¶ Add a feature to the features list describing a region of the protein sequence.
Parameters: - start_resnum (int) – Start residue number of the protein sequence feature
- end_resnum (int) – End residue number of the protein sequence feature
- feat_type (str, optional) – Optional description of the feature type (ie. ‘binding domain’)
- feat_id (str, optional) – Optional ID of the feature type (ie. ‘TM1’)
-
blast_pdb
(seq_ident_cutoff=0, evalue=0.0001, display_link=False, outdir=None, force_rerun=False)[source]¶ BLAST this sequence to the PDB
-
equal_to
(seq_prop)[source]¶ Test if the sequence is equal to another SeqProp’s sequence
Parameters: seq_prop – SeqProp object Returns: If the sequences are the same Return type: bool
-
feature_path_unset
()[source]¶ Copy features to memory and remove the association of the feature file.
-
features
list – Get the features stored in memory or in the GFF file
-
get_aggregation_propensity
(email, password, cutoff_v=5, cutoff_n=5, run_amylmuts=False, outdir=None)[source]¶ Run the AMYLPRED2 web server to calculate the aggregation propensity of this protein sequence, which is the number of aggregation-prone segments on the unfolded protein sequence.
Stores statistics in the
annotations
attribute, under the key aggprop-amylpred.See
ssbio.protein.sequence.properties.aggregation_propensity
for instructions and details.
-
get_biopython_pepstats
()[source]¶ Run Biopython’s built in ProteinAnalysis module and store statistics in the
annotations
attribute.
-
get_dict
(only_attributes=None, exclude_attributes=None, df_format=False)[source]¶ Get a dictionary of this object’s attributes. Optional format for storage in a Pandas DataFrame.
Parameters: - only_attributes (str, list) – Attributes that should be returned. If not provided, all are returned.
- exclude_attributes (str, list) – Attributes that should be excluded.
- df_format (bool) – If dictionary values should be formatted for a dataframe (everything possible is transformed into strings, int, or float - if something can’t be transformed it is excluded)
Returns: Dictionary of attributes
Return type: dict
-
get_emboss_pepstats
()[source]¶ Run the EMBOSS pepstats program on the protein sequence.
Stores statistics in the
annotations
attribute. Saves a.pepstats
file of the results where the sequence file is located.
-
get_kinetic_folding_rate
(secstruct, at_temp=None)[source]¶ Run the FOLD-RATE web server to calculate the kinetic folding rate given an amino acid sequence and its structural classficiation (alpha/beta/mixed)
Stores statistics in the
annotations
attribute, under the key kinetic_folding_rate_<TEMP>-foldrate.See
ssbio.protein.sequence.properties.kinetic_folding_rate.get_foldrate()
for instructions and details.
-
get_residue_annotations
(start_resnum, end_resnum=None)[source]¶ Retrieve letter annotations for a residue or a range of residues
Parameters: - start_resnum (int) – Residue number
- end_resnum (int) – Optional residue number, specify if a range is desired
Returns: Letter annotations for this residue or residues
Return type: dict
-
get_thermostability
(at_temp)[source]¶ Run the thermostability calculator using either the Dill or Oobatake methods.
Stores calculated (dG, Keq) tuple in the
annotations
attribute, under the key thermostability_<TEMP>-<METHOD_USED>.See
ssbio.protein.sequence.properties.thermostability.get_dG_at_T()
for instructions and details.
-
num_pdbs
¶ int – Report the number of PDB IDs stored in the
pdbs
attribute
-
seq
Seq – Dynamically loaded Seq object from the sequence file
-
seq_len
¶ int – Get the sequence length
-
seq_str
¶ str – Get the sequence formatted as a string
-
Index: Software¶
This section provides a simple list of external software that may be required to carry out specific computations on a protein sequence or structure. This list only contains software that is wrapped with ssbio – there may be other programs that carry out these same functions, and do it better (or worse)!
Tables describing functionalities of these software packages in relation to their input, as well as links to internal wrappers and parses are found on The SeqProp Class and The StructProp Class pages.
Protein structure predictions¶
Homology modeling¶
I-TASSER¶

- Home page: I-TASSER
- Download link: I-TASSER Suite
I-TASSER (Iterative Threading ASSEmbly Refinement) is a program for protein homology modeling and functional prediction from a protein sequence. The I-TASSER suite provides numerous other tools such as for ligand-binding site predictions, model refinement, secondary structure predictions, B-factor estimations, and more. ssbio mainly provides tools to run and parse I-TASSER homology modeling results, as well as COACH consensus binding site predictions (optionally with EC number and GO term predictions). Also, scripts are provided to automate homology modeling on a large scale using TORQUE or Slurm job schedulers in a cluster computing environment.
Note
These instructions were created on an Ubuntu 17.04 system.
Note
Read the README on the I-TASSER Suite page for the most up-to-date instructions
Make sure you have Java installed and it can be run from the command line with
java
Head to the I-TASSER download page and register for an license (academic only) to get a password emailed to you
Log in to the I-TASSER download page and download the archive
Unpack the software archive into a convenient directory - a library should also be downloaded to this directory
Run
download_lib.pl
to then download the library files - this will take some time:/path/to/<I-TASSER_directory>/download_lib.pl -libdir ITLIB
Now, I-TASSER can be run according to the README under section 4
To enable GO term predictions…
- under construction…
Tip: to update template libraries, create a new command in your crontab (first run
crontab -e
), and make sure to replace<USERNAME>
with your username:0 4 * * 1,5 <USERNAME> /path/to/I-TASSER4.4/download_lib.pl -libdir /path/to/ITLIB
That will run the library update at 4 am every Monday and Friday.
To run the program using the ssbio Python wrapper, see: ssbio.protein.path.to.wrapper()
What is a homology model?
- A predicted 3D structure model of a protein sequence. Models can be template-based, when they are based on an existing experimental structure; or ab initio, generated without a template. Generally, ab initio models are much less reliable.
Can I just run I-TASSER using their web server and parse those results with ssbio?
- Not yet, but you can manually input the model1.pdb file as a new structure for now.
How do I cite I-TASSER?
- Roy A, Kucukural A & Zhang Y (2010) I-TASSER: a unified platform for automated protein structure and function prediction. Nat. Protoc. 5: 725–738 Available at: http://dx.doi.org/10.1038/nprot.2010.5
How do I run I-TASSER with TORQUE or Slurm job schedulers?
- under construction…
I’m having issues running I-TASSER…
- See the ssbio wiki for (hopefully) some solutions - or add yours in when you find the answer!
-
class
ssbio.protein.structure.homology.itasser.itasserprep.
ITASSERPrep
(ident, seq_str, root_dir, itasser_path, itlib_path, execute_dir=None, light=True, runtype='local', print_exec=False, java_home=None, binding_site_pred=False, ec_pred=False, go_pred=False, additional_options=None, job_scheduler_header=None)[source]¶ Prepare a protein sequence for an I-TASSER homology modeling run.
The main utilities of this class are to:
- Allow for the input of a protein sequence string and paths to I-TASSER to create execution scripts
- Automate large-scale homology modeling efforts by creating Slurm or TORQUE job scheduling scripts
Parameters: - ident – Identifier for your sequence. Will be used as the global ID (folder name, sequence name)
- seq_str – Sequence in string format
- root_dir – Local directory where I-TASSER folder will be created
- itasser_path – Path to I-TASSER folder, i.e. ‘~/software/I-TASSER4.4’
- itlib_path – Path to ITLIB folder, i.e. ‘~/software/ITLIB’
- execute_dir – Optional path to execution directory - use this if you are copying the homology models to another location such as a supercomputer for running
- light – If simulations should be limited to 5 runs
- runtype – How you will be running I-TASSER - local, slurm, or torque
- print_exec – If the execution script should be printed out
- java_home – Path to Java executable
- binding_site_pred – If binding site predictions should be run
- ec_pred – If EC number predictions should be run
- go_pred – If GO term predictions should be run
- additional_options – Any other additional I-TASSER options, appended to the command
- job_scheduler_header – Any job scheduling options, prepended as a header to the file
-
class
ssbio.protein.structure.homology.itasser.itasserprop.
ITASSERProp
(ident, original_results_path, coach_results_folder='model1/coach', model_to_use='model1')[source]¶ Parse all available information for a local I-TASSER modeling run.
Initializes a class to collect I-TASSER modeling information and optionally copy results to a new directory. SEE: https://zhanglab.ccmb.med.umich.edu/papers/2015_1.pdf for detailed information.
Parameters: - ident (str) – ID of I-TASSER modeling run
- original_results_path (str) – Path to I-TASSER modeling folder
- coach_results_folder (str) – Path to original COACH results
- model_to_use (str) – Which I-TASSER model to use. Default is “model1”
-
copy_results
(copy_to_dir, rename_model_to=None, force_rerun=False)[source]¶ Copy the raw information from I-TASSER modeling to a new folder.
Copies all files in the list _attrs_to_copy.
Parameters: - copy_to_dir (str) – Directory to copy the minimal set of results per sequence.
- rename_model_to (str) – New file name (without extension)
- force_rerun (bool) – If existing models and results should be overwritten.
-
get_dict
(only_attributes=None, exclude_attributes=None, df_format=False)[source]¶ Summarize the I-TASSER run in a dictionary containing modeling results and top predictions from COACH
Parameters: - only_attributes (str, list) – Attributes that should be returned. If not provided, all are returned.
- exclude_attributes (str, list) – Attributes that should be excluded.
- df_format (bool) – If dictionary values should be formatted for a dataframe (everything possible is transformed into strings, int, or float - if something can’t be transformed it is excluded)
Returns: Dictionary of attributes
Return type: dict
-
ssbio.protein.structure.homology.itasser.itasserprop.
parse_bfp_dat
(infile)[source]¶ Parse the B-factor predictions in BFP.dat
Parameters: infile (str) – Path to BFP.dat Returns: List of B-factor predictions for all residues Return type: list
-
ssbio.protein.structure.homology.itasser.itasserprop.
parse_coach_bsites_inf
(infile)[source]¶ Parse the Bsites.inf output file of COACH and return a list of rank-ordered binding site predictions
Bsites.inf contains the summary of COACH clustering results after all other prediction algorithms have finished For each site (cluster), there are three lines:
- Line 1: site number, c-score of coach prediction, cluster size
- Line 2: algorithm, PDB ID, ligand ID, center of binding site (cartesian coordinates), c-score of the algorithm’s prediction, binding residues from single template
- Line 3: Statistics of ligands in the cluster
C-score information:
- “In our training data, a prediction with C-score>0.35 has average false positive and false negative rates below 0.16 and 0.13, respectively.” (https://zhanglab.ccmb.med.umich.edu/COACH/COACH.pdf)
Parameters: infile (str) – Path to Bsites.inf Returns: Ranked list of dictionaries, keys defined below site_num
: cluster which is the consensus binding sitec_score
: confidence score of the cluster predictioncluster_size
: number of predictions within this clusteralgorithm
: main? algorithm used to make the predictionpdb_template_id
: PDB ID of the template used to make the predictionpdb_template_chain
: chain of the PDB which has the ligandpdb_ligand
: predicted ligand to bindbinding_location_coords
: centroid of the predicted ligand position in the homology modelc_score_method
: confidence score for the main algorithmbinding_residues
: predicted residues to bind the ligandligand_cluster_counts
: number of predictions per ligand
Return type: list
-
ssbio.protein.structure.homology.itasser.itasserprop.
parse_coach_ec
(infile)[source]¶ Parse the EC.dat output file of COACH and return a list of rank-ordered EC number predictions
EC.dat contains the predicted EC number and active residues. The columns are: PDB_ID, TM-score, RMSD, Sequence identity, Coverage, Confidence score, EC number, and Active site residues
Parameters: infile (str) – Path to EC.dat Returns: Ranked list of dictionaries, keys defined below pdb_template_id
: PDB ID of the template used to make the predictionpdb_template_chain
: chain of the PDB which has the ligandtm_score
: TM-score of the template to the model (similarity score)rmsd
: RMSD of the template to the model (also a measure of similarity)seq_ident
: percent sequence identityseq_coverage
: percent sequence coveragec_score
: confidence score of the EC predictionec_number
: predicted EC numberbinding_residues
: predicted residues to bind the ligand
Return type: list
-
ssbio.protein.structure.homology.itasser.itasserprop.
parse_coach_ec_df
(infile)[source]¶ Parse the EC.dat output file of COACH and return a dataframe of results
EC.dat contains the predicted EC number and active residues. The columns are: PDB_ID, TM-score, RMSD, Sequence identity, Coverage, Confidence score, EC number, and Active site residues
Parameters: infile (str) – Path to EC.dat Returns: Pandas DataFrame summarizing EC number predictions Return type: DataFrame
-
ssbio.protein.structure.homology.itasser.itasserprop.
parse_coach_go
(infile)[source]¶ Parse a GO output file from COACH and return a rank-ordered list of GO term predictions
The columns in all files are: GO terms, Confidence score, Name of GO terms. The files are:
- GO_MF.dat - GO terms in ‘molecular function’
- GO_BP.dat - GO terms in ‘biological process’
- GO_CC.dat - GO terms in ‘cellular component’
Parameters: infile (str) – Path to any COACH GO prediction file Returns: Organized dataframe of results, columns defined below go_id
: GO term IDgo_term
: GO term textc_score
: confidence score of the GO prediction
Return type: Pandas DataFrame
-
ssbio.protein.structure.homology.itasser.itasserprop.
parse_cscore
(infile)[source]¶ Parse the cscore file to return a dictionary of scores.
Parameters: infile (str) – Path to cscore Returns: Dictionary of scores Return type: dict
-
ssbio.protein.structure.homology.itasser.itasserprop.
parse_exp_dat
(infile)[source]¶ Parse the solvent accessibility predictions in exp.dat
Parameters: infile (str) – Path to exp.dat Returns: List of solvent accessibility predictions for all residues Return type: list
-
ssbio.protein.structure.homology.itasser.itasserprop.
parse_init_dat
(infile)[source]¶ Parse the main init.dat file which contains the modeling results
The first line of the file init.dat contains stuff like:
"120 easy 40 8"
The other lines look like this:
" 161 11.051 1 1guqA MUSTER"
and getting the first 10 gives you the top 10 templates used in modeling
Parameters: infile (stt) – Path to init.dat Returns: Dictionary of parsed information Return type: dict
Transmembrane orientations¶
OPM¶
OPM is a program to predict the location of transmembrane planes in protein structures, utilizing the atomic coordinates. ssbio provides a wrapper to submit PDB files to the web server, cache, and parse the results.
- Use the function
ssbio.protein.structure.properties.opm.run_ppm_server()
to upload a PDB file to the PPM server.
How can I install OPM?
- OPM is only available as a web server. ssbio provides a wrapper for the web server and allows you to submit protein structures to it along with caching the output files.
How do I cite OPM?
- Lomize MA, Pogozheva ID, Joo H, Mosberg HI & Lomize AL (2012) OPM database and PPM web server: resources for positioning of proteins in membranes. Nucleic Acids Res. 40: D370–6 Available at: http://dx.doi.org/10.1093/nar/gkr703
I’m having issues running OPM…
- See the ssbio wiki for (hopefully) some solutions - or add yours in when you find the answer!
-
ssbio.protein.structure.properties.opm.
run_ppm_server
(pdb_file, outfile, force_rerun=False)[source]¶ Run the PPM server from OPM to predict transmembrane residues.
Parameters: - pdb_file (str) – Path to PDB file
- outfile (str) – Path to output HTML results file
- force_rerun (bool) – Flag to rerun PPM if HTML results file already exists
Returns: Dictionary of information from the PPM run, including a link to download the membrane protein file
Return type: dict
Kinetic folding rate¶
FOLD-RATE¶
This module provides a function to predict the kinetic folding rate (kf) given an amino acid sequence and its structural classficiation (alpha/beta/mixed).
- Obtain your protein’s sequence
- Determine the main secondary structure composition of the protein (
all-alpha
,all-beta
,mixed
, orunknown
) - Input the sequence and secondary structure composition into the function
ssbio.protein.sequence.properties.kinetic_folding_rate.get_foldrate()
What is the main secondary structure composition of my protein?
all-alpha
= dominated by α-helices; α > 40% and β < 5%all-beta
= dominated by β-strands; β > 40% and α < 5%mixed
= contain both α-helices and β-strands; α > 15% and β > 10%
What is the kinetic folding rate?
- Protein folding rate is a measure of slow/fast folding of proteins from the unfolded state to native three-dimensional structure.
What units is it in?
- Number of proteins folded per second
How can I install FOLD-RATE?
- FOLD-RATE is only available as a web server. ssbio provides a wrapper for the web server and allows you to submit protein sequences to it along with caching the output files.
How do I cite FOLD-RATE?
- Gromiha MM, Thangakani AM & Selvaraj S (2006) FOLD-RATE: prediction of protein folding rates from amino acid sequence. Nucleic Acids Res. 34: W70–4 Available at: http://dx.doi.org/10.1093/nar/gkl043
How can this parameter be used on a genome-scale?
- See: Chen K, Gao Y, Mih N, O’Brien EJ, Yang L & Palsson BO (2017) Thermosensitivity of growth is determined by chaperone-mediated proteome reallocation. Proceedings of the National Academy of Sciences 114: 11548–11553 Available at: http://www.pnas.org/content/114/43/11548.abstract
I’m having issues running FOLD-RATE…
- See the ssbio wiki for (hopefully) some solutions - or add yours in when you find the answer!
-
ssbio.protein.sequence.properties.kinetic_folding_rate.
get_foldrate
(seq, secstruct)[source]¶ Submit sequence and structural class to FOLD-RATE calculator (http://www.iitm.ac.in/bioinfo/fold-rate/) to calculate kinetic folding rate.
Parameters: - seq (str, Seq, SeqRecord) – Amino acid sequence
- secstruct (str) – Structural class: all-alpha`,
all-beta
,mixed
, orunknown
Returns: Kinetic folding rate k_f
Return type: float
-
ssbio.protein.sequence.properties.kinetic_folding_rate.
get_foldrate_at_temp
(ref_rate, new_temp, ref_temp=37.0)[source]¶ Scale the predicted kinetic folding rate of a protein to temperature T, based on the relationship ln(k_f)∝1/T
Parameters: - ref_rate (float) – Kinetic folding rate calculated from the function
get_foldrate()
- new_temp (float) – Temperature in degrees C
- ref_temp (float) – Reference temperature, default to 37 C
Returns: Kinetic folding rate k_f at temperature T
Return type: float
- ref_rate (float) – Kinetic folding rate calculated from the function
Protein structure calculations¶
Secondary structure¶
DSSP¶

DSSP (Define Secondary Structure of Proteins) is the standard method used to assign secondary structure annotations to a protein structure. DSSP utilizes the atomic coordinates of a structure to assign the secondary codes, which are:
Code | Description |
---|---|
H | Alpha helix |
B | Beta bridge |
E | Strand |
G | Helix-3 |
I | Helix-5 |
T | Turn |
S | Bend |
Furthermore, DSSP calculates geometric properties such as the phi and psi angles between residues and solvent accessibilities. ssbio provides wrappers around the Biopython DSSP module to execute and parse DSSP results, as well as converting the information into a Pandas DataFrame format with calculated relative solvent accessbilities (see ssbio.protein.structure.properties.dssp
for details).
Note
These instructions were created on an Ubuntu 17.04 system.
Install the DSSP package
sudo apt-get install dssp
The program installs itself as
mkdssp
, notdssp
, and Biopython looks to executedssp
, so we need to symlink the namedssp
tomkdssp
sudo ln -s /usr/bin/mkdssp /usr/bin/dssp
Then you should be able to run
dssp
in your terminal
To run the program on its own in the shell…
dssp -i <path_to_pdb_file> -o <new_path_to_output_file>
To run the program using the ssbio Python wrapper, see: ssbio.protein.structure.properties.dssp.get_dssp_df_on_file()
How do I cite DSSP?
- Kabsch W & Sander C (1983) DSSP: definition of secondary structure of proteins given a set of 3D coordinates. Biopolymers 22: 2577–2637
I’m having issues running DSSP…
- See the ssbio wiki for (hopefully) some solutions - or add yours in when you find the answer!
-
ssbio.protein.structure.properties.dssp.
all_dssp_props
(filename, file_type)[source]¶ Returns a large dictionary of SASA, secondary structure composition, and surface/buried composition. Values are computed using DSSP. Input: PDB or MMCIF filename Output: Dictionary of values obtained from dssp
-
ssbio.protein.structure.properties.dssp.
calc_sasa
(dssp_df)[source]¶ Calculation of SASA utilizing the DSSP program.
DSSP must be installed for biopython to properly call it. Install using apt-get on Ubuntu or from: http://swift.cmbi.ru.nl/gv/dssp/
Input: PDB or CIF structure file Output: SASA (integer) of structure
-
ssbio.protein.structure.properties.dssp.
calc_surface_buried
(dssp_df)[source]¶ Calculates the percent of residues that are in the surface or buried, as well as if they are polar or nonpolar. Returns a dictionary of this.
-
ssbio.protein.structure.properties.dssp.
get_dssp_df_on_file
(pdb_file, outfile=None, outdir=None, outext='_dssp.df', force_rerun=False)[source]¶ Run DSSP directly on a structure file with the Biopython method Bio.PDB.DSSP.dssp_dict_from_pdb_file
Avoids errors like: PDBException: Structure/DSSP mismatch at <Residue MSE het= resseq=19 icode= > by not matching information to the structure file (DSSP fills in the ID “X” for unknown residues)
Parameters: - pdb_file – Path to PDB file
- outfile – Name of output file
- outdir – Path to output directory
- outext – Extension of output file
- force_rerun – If DSSP should be rerun if the outfile exists
Returns: DSSP results summarized
Return type: DataFrame
STRIDE¶

STRIDE (Structural identification) is a program used to assign secondary structure annotations to a protein structure. STRIDE has slightly more complex criteria to assign codes compared to DSSP. STRIDE utilizes the atomic coordinates of a structure to assign the structure codes, which are:
Code | Description |
---|---|
H | Alpha helix |
G | 3-10 helix |
I | PI-helix |
E | Extended conformation |
B or b | Isolated bridge |
T | Turn |
C | Coil (none of the above) |
Note
These instructions were created on an Ubuntu 17.04 system.
Download the source from the STRIDE download page
Create a new folder named “stride” in a place where you store software and extract the source into it
mkdir /path/to/software/stride cp /path/to/downloaded/stride.tar.gz /path/to/software/stride cd /path/to/software/stride tar -zxf stride.tar.gz
Build the program from source and copy its binary:
cd /path/to/software/stride make cp stride /usr/local/bin
To run the program using the ssbio Python wrapper, see: ssbio.protein.path.to.wrapper()
How do I cite STRIDE?
- Frishman D & Argos P (1995) Knowledge-based protein secondary structure assignment. Proteins 23: 566–579 Available at: http://dx.doi.org/10.1002/prot.340230412
I’m having issues running STRIDE…
- See the ssbio wiki for (hopefully) some solutions - or add yours in when you find the answer!
Solvent accessibilities¶
FreeSASA¶

FreeSASA is an open source library written in C for calculating solvent accessible surface areas of a protein. FreeSASA also contains Python bidings, and the plan is to include these bindings with ssbio in the future.
Note
These instructions were created on an Ubuntu 17.04 system with a Python installation through Anaconda3.
Note
FreeSASA Python bindings are slightly difficult to install with Python 3 - ssbio provides wrappers for the command line executable instead
Download the latest tarball (see FreeSASA home page), expand it and run
./configure --disable-json --disable-xml make
Install with
sudo make install
To run the program using the ssbio Python wrapper, see: ssbio.protein.structure.properties.freesasa.run_freesasa()
How do I cite FreeSASA?
- Mitternacht S (2016) FreeSASA: An open source C library for solvent accessible surface area calculations. F1000Res. 5: 189 Available at: http://dx.doi.org/10.12688/f1000research.7931.1
I’m having issues running FreeSASA…
- See the ssbio wiki for (hopefully) some solutions - or add yours in when you find the answer!
-
ssbio.protein.structure.properties.freesasa.
parse_rsa_data
(rsa_outfile, ignore_hets=True)[source]¶ Process a NACCESS or freesasa RSA output file. Adapted from Biopython NACCESS modele.
Parameters: - rsa_outfile (str) – Path to RSA output file
- ignore_hets (bool) – If HETATMs should be excluded from the final dictionary. This is extremely important when loading this information into a ChainProp’s SeqRecord, since this will throw off the sequence matching.
Returns: Per-residue dictionary of RSA values
Return type: dict
-
ssbio.protein.structure.properties.freesasa.
run_freesasa
(infile, outfile, include_hetatms=True, outdir=None, force_rerun=False)[source]¶ Run freesasa on a PDB file, output using the NACCESS RSA format.
Parameters: - infile (str) – Path to PDB file (only PDB file format is accepted)
- outfile (str) – Path or filename of output file
- include_hetatms (bool) – If heteroatoms should be included in the SASA calculations
- outdir (str) – Path to output file if not specified in outfile
- force_rerun (bool) – If freesasa should be rerun even if outfile exists
Returns: Path to output SASA file
Return type: str
Residue depths¶
MSMS¶

MSMS computes solvent excluded surfaces on a protein structure. Generally, MSMS is used to calculate residue depths (in Angstroms) from the surface of a protein, using a PDB file as an input. ssbio provides wrappers through Biopython to run MSMS as well as store the depths in an associated StructProp
object.
Note
These instructions were created on an Ubuntu 17.04 system.
Head to the Download page, and under the header “MSMS 2.6.X - Current Release” download the “Unix/Linux i86_64” version - if this doesn’t work though you’ll want to try the “Unix/Linux i86” version later.
Download it, unarchive it to your library path:
sudo mkdir /usr/local/lib/msms cd /usr/local/lib/msms tar zxvf /path/to/your/downloaded/file/msms_i86_64Linux2_2.6.1.tar.gz
Symlink the binaries (or alternatively, add the two locations to your PATH):
sudo ln -s /usr/local/lib/msms/msms.x86_64Linux2.2.6.1 /usr/local/bin/msms sudo ln -s /usr/local/lib/msms/pdb_to_xyzr* /usr/local/bin
Fix a bug in the pdb_to_xyzr file (see: http://mailman.open-bio.org/pipermail/biopython/2015-November/015787.html):
sudo vi /usr/local/lib/msms/pdb_to_xyzr
at line 34, change:
numfile = "./atmtypenumbers"
to:
numfile = "/usr/local/lib/msms/atmtypenumbers"
Repeat step 5 for the file
/usr/local/lib/msms/pdb_to_xyzrn
Now try running
msms
in the terminal, it should say:$ msms MSMS 2.6.1 started on structure Copyright M.F. Sanner (1994) Compilation flags -O2 -DVERBOSE -DTIMING MSMS: No input stream specified
To run the program using the ssbio Python wrapper, see: ssbio.protein.path.to.wrapper()
How do I cite MSMS?
- Sanner MF, Olson AJ & Spehner J-C (1996) Reduced surface: an efficient way to compute molecular surfaces. Biopolymers 38: 305–320. Available at: http://mgl.scripps.edu/people/sanner/html/papers/msmsTextAndFigs.pdf
How long does it take to run?
- Depending on the size of the protein structure, the program can take up to a couple minutes to execute.
I’m having issues running MSMS…
- See the ssbio wiki for (hopefully) some solutions - or add yours in when you find the answer!
-
ssbio.protein.structure.properties.msms.
get_msms_df
(model, pdb_id, outfile=None, outdir=None, outext='_msms.df', force_rerun=False)[source]¶ Run MSMS (using Biopython) on a Biopython Structure Model.
Depths are in units Angstroms. 1A = 10^-10 m = 1nm. Returns a dictionary of:
{ chain_id:{ resnum1_id: (res_depth, ca_depth), resnum2_id: (res_depth, ca_depth) } }
Parameters: model – Biopython Structure Model Returns: ResidueDepth property_dict, reformatted Return type: Pandas DataFrame
-
ssbio.protein.structure.properties.msms.
get_msms_df_on_file
(pdb_file, outfile=None, outdir=None, outext='_msms.df', force_rerun=False)[source]¶ Run MSMS (using Biopython) on a PDB file.
- Saves a CSV file of:
- chain: chain ID resnum: residue number (PDB numbering) icode: residue insertion code res_depth: average depth of all atoms in a residue ca_depth: depth of the alpha carbon atom
Depths are in units Angstroms. 1A = 10^-10 m = 1nm
Parameters: - pdb_file – Path to PDB file
- outfile – Optional name of output file (without extension)
- outdir – Optional output directory
- outext – Optional extension for the output file
- outext – Suffix appended to json results file
- force_rerun – Rerun MSMS even if results exist already
Returns: ResidueDepth property_dict, reformatted
Return type: Pandas DataFrame
Structural similarity¶
FATCAT¶
FATCAT is a structural alignment tool that allows you to determine the similarity of a pair of protein structures.
Warning
Parsing FATCAT results is currently incomplete and will only return TM-scores as of now - but TM-scores only show up in development versions of jFATCAT
Note
These instructions were created on an Ubuntu 17.04 system.
- Make sure Java is installed on your system and can be run with the command
java
- Download the Java port of FATCAT from the jFATCAT download page, under the section “Older file downloads” with the filename
protein-comparison-tool\_<DATE>.tar.gz
, with the most recent date. - Extract it to a place where you store software
To run the program using the ssbio Python wrapper, see: ssbio.protein.structure.properties.fatcat.run_fatcat()
. Run it on two structures, pointing to the path of the runFATCAT.sh script.
How do I cite FATCAT?
- Ye Y & Godzik A (2003) Flexible structure alignment by chaining aligned fragment pairs allowing twists. Bioinformatics 19 Suppl 2: ii246–55 Available at: https://www.ncbi.nlm.nih.gov/pubmed/14534198
I’m having issues running FATCAT…
- See the ssbio wiki for (hopefully) some solutions - or add yours in when you find the answer!
-
ssbio.protein.structure.properties.fatcat.
parse_fatcat
(fatcat_xml)[source]¶ Parse a FATCAT XML result file.
Parameters: fatcat_xml (str) – Path to FATCAT XML result file Returns: Parsed information from the output Return type: dict Todo
- Only returning TM-score at the moment
-
ssbio.protein.structure.properties.fatcat.
run_fatcat
(structure_path_1, structure_path_2, fatcat_sh, outdir='', silent=False, print_cmd=False, force_rerun=False)[source]¶ Run FATCAT on two PDB files, and return the path of the XML result file.
Parameters: - structure_path_1 (str) – Path to PDB file
- structure_path_2 (str) – Path to PDB file
- fatcat_sh (str) – Path to “runFATCAT.sh” executable script
- outdir (str) – Path to where FATCAT XML output files will be saved
- silent (bool) – If stdout should be silenced from showing up in Python console output
- print_cmd (bool) – If command to run FATCAT should be printed to stdout
- force_rerun (bool) – If FATCAT should be run even if XML output files already exist
Returns: Path to XML output file
Return type: str
-
ssbio.protein.structure.properties.fatcat.
run_fatcat_all_by_all
(list_of_structure_paths, fatcat_sh, outdir='', silent=True, force_rerun=False)[source]¶ Run FATCAT on all pairs of structures given a list of structures.
Parameters: - list_of_structure_paths (list) – List of PDB file paths
- fatcat_sh (str) – Path to “runFATCAT.sh” executable script
- outdir (str) – Path to where FATCAT XML output files will be saved
- silent (bool) – If command to run FATCAT should be printed to stdout
- force_rerun (bool) – If FATCAT should be run even if XML output files already exist
Returns: TM-scores (similarity) between all structures
Return type: Pandas DataFrame
Various structure properties¶
Structure cleaning, mutating¶
Protein sequence predictions¶
Secondary structure¶
SCRATCH¶

SCRATCH is a suite of tools to predict many types of structural properties directly from sequence. ssbio contains wrappers to execute and parse results from SSpro/SSpro8 - predictors of secondary structure, and ACCpro/ACCpro20 - predictors of solvent accessibility.
Note
These instructions were created on an Ubuntu 17.04 system.
Download the source and install it using the perl script:
mkdir /path/to/my/software/scratch cd /path/to/my/software/scratch wget http://download.igb.uci.edu/SCRATCH-1D_1.1.tar.gz tar -zxf SCRATCH-1D_1.1.tar.gz cd SCRATCH-1D_1.1 perl install.pl
To run it from the command line directly:
ssbio also provides command line wrappers to run it and parse the results, see for details.
To run the program on its own in the shell…
/path/to/my/software/scratch/SCRATCH-1D_1.1/bin/run_SCRATCH-1D_predictors.sh input_fasta output_prefix [num_threads]
To run the program using the ssbio Python wrapper, see: ssbio.protein.sequence.properties.scratch.SCRATCH.run_scratch()
How do I cite SCRATCH?
- Cheng J, Randall AZ, Sweredoski MJ & Baldi P (2005) SCRATCH: a protein structure and structural feature prediction server. Nucleic Acids Res. 33: W72–6 Available at: http://dx.doi.org/10.1093/nar/gki396
I’m having issues running STRIDE…
- See the ssbio wiki for (hopefully) some solutions - or add yours in when you find the answer!
-
class
ssbio.protein.sequence.properties.scratch.
SCRATCH
(project_name, seq_file=None, seq_str=None)[source]¶ Provide wrappers for running and parsing SCRATCH on a sequence file or sequence string.
To run from the command line:
./run_SCRATCH-1D_predictors.sh input_fasta output_prefix [num_threads]
SCRATCH predicts:
Secondary structure
- 3 classes (helix, strand, other) using SSpro
- 8 classes (standard DSSP definitions) using SSpro8
Relative solvent accessibility (RSA, also known as relative accessible surface area)
- @ 25% exposed RSA cutoff (<25% RSA means it is buried)
- @ all cutoffs in 5% increments from 0 to 100
-
accpro20_results
()[source]¶ Parse the ACCpro output file and return a dict of secondary structure compositions
-
accpro20_summary
(cutoff)[source]¶ Parse the ACCpro output file and return a summary of percent exposed/buried residues based on a cutoff.
Below the cutoff = buried Equal to or greater than cutoff = exposed The default cutoff used in accpro is 25%.
- The output file is just a FASTA formatted file, so you can get residue level
- information by parsing it like a normal sequence file.
Parameters: cutoff (float) – Cutoff for defining a buried or exposed residue. Returns: Percentage of buried and exposed residues Return type: dict
-
accpro_results
()[source]¶ Parse the ACCpro output file and return a dict of secondary structure compositions.
-
accpro_summary
()[source]¶ Parse the ACCpro output file and return a summary of percent exposed/buried residues.
- The output file is just a FASTA formatted file, so you can get residue level
- information by parsing it like a normal sequence file.
Returns: Percentage of buried and exposed residues Return type: dict
-
run_scratch
(path_to_scratch, num_cores=1, outname=None, outdir=None, force_rerun=False)[source]¶ Run SCRATCH on the sequence_file that was loaded into the class.
Parameters: - path_to_scratch – Path to the SCRATCH executable, run_SCRATCH-1D_predictors.sh
- outname – Prefix to name the output files
- outdir – Directory to store the output files
- force_rerun – Flag to force rerunning of SCRATCH even if the output files exist
Returns:
-
sspro8_results
()[source]¶ Parse the SSpro8 output file and return a dict of secondary structure compositions.
-
sspro8_summary
()[source]¶ Parse the SSpro8 output file and return a summary of secondary structure composition.
- The output file is just a FASTA formatted file, so you can get residue level
- information by parsing it like a normal sequence file.
Returns: - Percentage of:
- H: alpha-helix G: 310-helix I: pi-helix (extremely rare) E: extended strand B: beta-bridge T: turn S: bend C: the rest
Return type: dict
-
sspro_results
()[source]¶ Parse the SSpro output file and return a dict of secondary structure compositions.
Returns: - Keys are sequence IDs, values are the lists of secondary structure predictions.
- H: helix E: strand C: the rest
Return type: dict
-
sspro_summary
()[source]¶ Parse the SSpro output file and return a summary of secondary structure composition.
- The output file is just a FASTA formatted file, so you can get residue level
- information by parsing it like a normal sequence file.
Returns: - Percentage of:
- H: helix E: strand C: the rest
Return type: dict
-
ssbio.protein.sequence.properties.scratch.
read_accpro20
(infile)[source]¶ Read the accpro20 output (.acc20) and return the parsed FASTA records.
Keeps the spaces between the accessibility numbers.
Parameters: infile – Path to .acc20 file Returns: Dictionary of accessibilities with keys as the ID Return type: dict
Solvent accessibilities¶
Thermostability¶
Transmembrane domains¶
TMHMM¶
TMHMM is a program to predict the location of transmembrane helices in proteins, directly from sequence. ssbio provides a wrapper to execute and parse the “long” output format of TMHMM.
Note
These instructions were created on an Ubuntu 17.04 system.
Register for the software (academic license only) at the TMHMM download page
Receive instructions to download the software at your email address
Download the file tmhmm-2.0c.Linux.tar.gz
Extract it to a place where you store software
Install it according to the TMHMM installation instructions, repeated and annotated below…
- Insert the correct path for perl 5.x in the first line of the scripts bin/tmhmm and bin/tmhmmformat.pl (if not
/usr/local/bin/perl
). Usewhich perl
andperl -v
in the terminal to help find the correct path. - Make sure you have an executable version of decodeanhmm in the bin directory.
- Include the directory containing tmhmm in your path (how do I add something to my Path?)
- Read the TMHMM2.0.guide.html
- Insert the correct path for perl 5.x in the first line of the scripts bin/tmhmm and bin/tmhmmformat.pl (if not
To run the program on its own, execute the following command with your protein sequences contained in a FASTA file:
tmhmm my_sequences.fasta
To run the program using the ssbio Python wrapper, see: ssbio.protein.path.to.wrapper()
How do I cite TMHMM?
- Krogh A, Larsson B, von Heijne G & Sonnhammer EL (2001) Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J. Mol. Biol. 305: 567–580 Available at: http://dx.doi.org/10.1006/jmbi.2000.4315
I’m having issues running TMHMM…
- See the ssbio wiki for (hopefully) some solutions - or add yours in when you find the answer!
-
ssbio.protein.sequence.properties.tmhmm.
label_TM_tmhmm_residue_numbers_and_leaflets
(tmhmm_seq)[source]¶ Determine the residue numbers of the TM-helix residues that cross the membrane and label them by leaflet.
Parameters: tmhmm_seq – g.protein.representative_sequence.seq_record.letter_annotations[‘TM-tmhmm’] Returns: a dictionary with leaflet_variable : [residue list] where the variable is inside or outside TM_boundary dict: outputs a dictionar with : TM helix number : [TM helix residue start , TM helix residue end] Return type: leaflet_dict Todo
untested method!
Aggregation propensity¶
AMYLPRED2¶
This module provides a function to predict the aggregation propensity of proteins, specifically the number of aggregation-prone segments on an unfolded protein sequence. AMYLPRED2 is a consensus method of different methods. In order to obtain the best balance between sensitivity and specificity, we follow the author’s guidelines to consider every 5 consecutive residues agreed among at least 5 methods contributing 1 to the aggregation propensity.
- Create an account on the webserver at the AMYLPRED2 registration link.
- Create a new AMYLPRED object with your email and password initialized along with it.
- Run
ssbio.protein.sequence.properties.aggregation_propensity.AMYLPRED.get_aggregation_propensity()
on a protein sequence.
What is aggregation propensity?
- The number of aggregation-prone segments on an unfolded protein sequence.
How can I install AMYLPRED2?
- AMYLPRED2 is only available as a web server. ssbio provides a wrapper for the web server and allows you to submit protein sequences to it along with caching the output files.
How do I cite AMYLPRED2?
- Tsolis AC, Papandreou NC, Iconomidou VA & Hamodrakas SJ (2013) A consensus method for the prediction of ‘aggregation-prone’ peptides in globular proteins. PLoS One 8: e54175 Available at: http://dx.doi.org/10.1371/journal.pone.0054175
How can this parameter be used on a genome-scale?
- See: Chen K, Gao Y, Mih N, O’Brien EJ, Yang L & Palsson BO (2017) Thermosensitivity of growth is determined by chaperone-mediated proteome reallocation. Proceedings of the National Academy of Sciences 114: 11548–11553 Available at: http://www.pnas.org/content/114/43/11548.abstract
I’m having issues running AMYLPRED2…
- See the ssbio wiki for (hopefully) some solutions - or add yours in when you find the answer!
-
class
ssbio.protein.sequence.properties.aggregation_propensity.
AMYLPRED
(email, password)[source]¶ Class to submit sequences to AMYLPRED2.
Instructions:
- Create an account on the webserver at the AMYLPRED2 registration link.
- Create a new AMYLPRED object with your email and password initialized along with it.
- Run
get_aggregation_propensity
on a protein sequence.
-
email
¶ str – Account email
-
password
¶ str – Account password
Todo
- Properly implement force_rerun and caching functions
-
get_aggregation_propensity
(seq, outdir, cutoff_v=5, cutoff_n=5, run_amylmuts=False)[source]¶ Run the AMYLPRED2 web server for a protein sequence and get the consensus result for aggregation propensity.
Parameters: - seq (str, Seq, SeqRecord) – Amino acid sequence
- outdir (str) – Directory to where output files should be saved
- cutoff_v (int) – The minimal number of methods that agree on a residue being a aggregation-prone residue
- cutoff_n (int) – The minimal number of consecutive residues to be considered as a ‘stretch’ of aggregation-prone region
- run_amylmuts (bool) – If AMYLMUTS method should be run, default False. AMYLMUTS is optional as it is the most time consuming and generates a slightly different result every submission.
Returns: Aggregation propensity - the number of aggregation-prone segments on an unfolded protein sequence
Return type: int
-
run_amylpred2
(seq, outdir, run_amylmuts=False)[source]¶ Run all methods on the AMYLPRED2 web server for an amino acid sequence and gather results.
Result files are cached in
/path/to/outdir/AMYLPRED2_results
.Parameters: - seq (str) – Amino acid sequence as a string
- outdir (str) – Directory to where output files should be saved
- run_amylmuts (bool) – If AMYLMUTS method should be run, default False
Returns: Result for each method run
Return type: dict
Protein sequence calculations¶
Various sequence properties¶
EMBOSS¶
EMBOSS is the European Molecular Biology Open Software Suite. EMBOSS contains a wide array of general purpose bioinformatics programs. For the GEM-PRO pipeline, we mainly need the needle pairwise alignment tool (although this can be replaced with Biopython’s built-in pairwise alignment function), and the pepstats protein sequence statistics tool.
Note
These instructions were created on an Ubuntu 17.04 system.
Install the EMBOSS package which contains many programs
sudo apt-get install emboss
And then once that installs, try running the
needle
program:needle
Just install after downloading the EMBOSS source code
./configure make sudo make install
How do I cite EMBOSS?
- Rice P, Longden I & Bleasby A (2000) EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet. 16: 276–277 Available at: http://www.ncbi.nlm.nih.gov/pubmed/10827456
I’m having issues running EMBOSS programs…
- See the ssbio wiki for (hopefully) some solutions - or add yours in when you find the answer!
-
ssbio.protein.sequence.properties.residues.
biopython_protein_analysis
(inseq)[source]¶ Utiize Biopython’s ProteinAnalysis module to return general sequence properties of an amino acid string.
For full definitions see: http://biopython.org/DIST/docs/api/Bio.SeqUtils.ProtParam.ProteinAnalysis-class.html
Parameters: inseq – Amino acid sequence Returns: Dictionary of sequence properties. Some definitions include: instability_index: Any value above 40 means the protein is unstable (has a short half life). secondary_structure_fraction: Percentage of protein in helix, turn or sheet Return type: dict Todo
Finish definitions of dictionary
-
ssbio.protein.sequence.properties.residues.
emboss_pepstats_on_fasta
(infile, outfile='', outdir='', outext='.pepstats', force_rerun=False)[source]¶ Run EMBOSS pepstats on a FASTA file.
Parameters: - infile – Path to FASTA file
- outfile – Name of output file without extension
- outdir – Path to output directory
- outext – Extension of results file, default is “.pepstats”
- force_rerun – Flag to rerun pepstats
Returns: Path to output file.
Return type: str
-
ssbio.protein.sequence.properties.residues.
emboss_pepstats_parser
(infile)[source]¶ Get dictionary of pepstats results.
Parameters: infile – Path to pepstats outfile Returns: Parsed information from pepstats Return type: dict Todo
Only currently parsing the bottom of the file for percentages of properties.
-
ssbio.protein.sequence.properties.residues.
flexibility_index
(aa_one)[source]¶ From Smith DK, Radivoja P, ObradovicZ, et al. Improved amino acid flexibility parameters, Protein Sci.2003, 12:1060
Author: Ke Chen
Parameters: aa_one – Returns:
-
ssbio.protein.sequence.properties.residues.
grantham_score
(ref_aa, mut_aa)[source]¶ https://github.com/ashutoshkpandey/Annotation/blob/master/Grantham_score_calculator.py
Sequence alignment¶
Index: Tutorials¶
Welcome to the ssbio Binder! Here you can interactively launch notebook tutorials or even alter them to run on your own data.
Testing¶
These notebooks make sure the Binder environment has been installed correctly.
- Software Installation Tester - This notebook simply tests if external programs have been installed correctly and can run in a Binder environment.
- I-TASSER and TMHMM Install Guide - This notebook provides a guide to installing I-TASSER and TMHMM in Binder, since these require you to register with your email and cannot be installed beforehand.
The GEM-PRO Pipeline¶
The GEM-PRO pipeline is focused on annotating genome-scale models with protein structure information, and subsequently making it easier to work with proteins at this scale.
- GEM-PRO - SBML Model (iNJ661) - This notebook gives an example of how to run the GEM-PRO pipeline with a SBML model, in this case iNJ661, the metabolic model of M. tuberculosis.
- GEM-PRO - Genes & Sequences - This notebook gives an example of how to run the GEM-PRO pipeline with a dictionary of gene IDs and their protein sequences.
- GEM-PRO - List of Gene IDs - This notebook gives an example of how to run the GEM-PRO pipeline with a list of gene IDs.
- GEM-PRO - Calculating Protein Properties - This notebook gives an example of how to calculate protein properties for a list of proteins.
The Protein Class¶
- Protein - Structure Mapping, Alignments, and Visualization - This notebook gives an example of how to map a single protein sequence to its structure, along with conducting sequence alignments and visualizing the mutations.
The StructProp Class¶
- PDBProp - Working With a Single PDB Structure - This notebook gives a tutorial of the PDBProp object, specifically how chains are handled and how to map a sequence to it.
The SeqProp Class¶
- SeqProp - Protein Sequence Properties - This notebook gives an overview the available calculations for properties of a single protein sequence.
Other tutorials¶
- SWISS-MODEL - Downloading homology models - This notebook gives a tutorial of the SWISSMODEL object, which is a simple class to parse and download models from the SWISS-MODEL homology model repository.
- FATCAT - Structure Similarity - This notebook shows how to run and parse FATCAT, a structural similarity calculator.
Python API¶
Information on select functions, classes, or methods.
ssbio.pipeline.gempro
¶
GEMPRO¶
-
class
ssbio.pipeline.gempro.
GEMPRO
(gem_name, root_dir=None, pdb_file_type='mmtf', gem=None, gem_file_path=None, gem_file_type=None, genes_list=None, genes_and_sequences=None, genome_path=None, write_protein_fasta_files=True, description=None, custom_spont_id=None)[source]¶ Generic class to represent all information for a GEM-PRO project.
Initialize the GEM-PRO project with a genome-scale model, a list of genes, or a dict of genes and sequences. Specify the name of your project, along with the root directory where a folder with that name will be created.
Main methods provided are:
Automated mapping of sequence IDs
- With KEGG mapper
- With UniProt mapper
- Allowing manual gene ID –> protein sequence entry
- Allowing manual gene ID –> UniProt ID
Consolidating sequence IDs and setting a representative sequence
- Currently these are set based on available PDB IDs
Mapping of representative sequence –> structures
- With UniProt –> ranking of PDB structures
- BLAST representative sequence –> PDB database
Preparation of files for homology modeling (currently for I-TASSER)
- Mapping to existing models
- Preparation for running I-TASSER
- Parsing I-TASSER runs
Running QC/QA on structures and setting a representative structure
- Various cutoffs (mutations, insertions, deletions) can be set to filter structures
Automation of protein sequence and structure property calculation
Creation of Pandas DataFrame summaries directly from downloaded metadata
Parameters: - gem_name (str) – The name of your GEM or just your project in general. This will be the name of the main folder that is created in root_dir.
- root_dir (str) – Path to where the folder named after
gem_name
will be created. If not provided, directories will not be created and output directories need to be specified for some steps. - pdb_file_type (str) –
pdb
,mmCif
,xml
,mmtf
- file type for files downloaded from the PDB - gem (Model) – COBRApy Model object
- gem_file_path (str) – Path to GEM file
- gem_file_type (str) – GEM model type -
sbml
(orxml
),mat
, orjson
formats - genes_list (list) – List of gene IDs that you want to map
- genes_and_sequences (dict) – Dictionary of gene IDs and their amino acid sequence strings
- genome_path (str) – FASTA file of all protein sequences
- write_protein_fasta_files (bool) – If individual protein FASTA files should be written out
- description (str) – Description string of your project
- custom_spont_id (str) – ID of spontaneous genes in a COBRA model which will be ignored for analysis
-
add_gene_ids
(genes_list)[source]¶ Add gene IDs manually into the GEM-PRO project.
Parameters: genes_list (list) – List of gene IDs as strings.
-
base_dir
¶ str – GEM-PRO project folder.
-
blast_seqs_to_pdb
(seq_ident_cutoff=0, evalue=0.0001, all_genes=False, display_link=False, outdir=None, force_rerun=False)[source]¶ BLAST each representative protein sequence to the PDB. Saves raw BLAST results (XML files).
Parameters: - seq_ident_cutoff (float, optional) – Cutoff results based on percent coverage (in decimal form)
- evalue (float, optional) – Cutoff for the E-value - filters for significant hits. 0.001 is liberal, 0.0001 is stringent (default).
- all_genes (bool) – If all genes should be BLASTed, or only those without any structures currently mapped
- display_link (bool, optional) – Set to True if links to the HTML results should be displayed
- outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
- force_rerun (bool, optional) – If existing BLAST results should not be used, set to True. Default is False
-
custom_spont_id
= None¶ str – ID of spontaneous genes in a COBRA model which will be ignored for analysis
-
data_dir
¶ str – Directory where all data are stored.
-
df_homology_models
¶ DataFrame – Get a dataframe of I-TASSER homology model results
-
df_kegg_metadata
¶ DataFrame – Pandas DataFrame of KEGG metadata per protein.
-
df_pdb_blast
¶ DataFrame – Get a dataframe of PDB BLAST results
-
df_pdb_metadata
¶ DataFrame – Get a dataframe of PDB metadata (PDBs have to be downloaded first).
-
df_pdb_ranking
¶ DataFrame – Get a dataframe of UniProt -> best structure in PDB results
-
df_proteins
¶ DataFrame – Get a summary dataframe of all proteins in the project.
-
df_representative_sequences
¶ DataFrame – Pandas DataFrame of representative sequence information per protein.
-
df_representative_structures
¶ DataFrame – Get a dataframe of representative protein structure information.
-
df_uniprot_metadata
¶ DataFrame – Pandas DataFrame of UniProt metadata per protein.
-
find_disulfide_bridges
(representatives_only=True)[source]¶ Run Biopython’s disulfide bridge finder and store found bridges.
Annotations are stored in the protein structure’s chain sequence at:
<chain_prop>.seq_record.annotations['SSBOND-biopython']
Parameters: representative_only (bool) – If analysis should only be run on the representative structure
-
find_disulfide_bridges_parallelize
(sc, representatives_only=True)[source]¶ Run Biopython’s disulfide bridge finder and store found bridges.
Annotations are stored in the protein structure’s chain sequence at:
<chain_prop>.seq_record.annotations['SSBOND-biopython']
Parameters: representative_only (bool) – If analysis should only be run on the representative structure
-
functional_genes
¶ DictList – All genes with a representative protein structure.
-
genes
= None¶ DictList – All protein-coding genes in this GEM-PRO project
-
genes_dir
¶ str – Directory where all gene specific information is stored.
-
genes_with_a_representative_sequence
¶ DictList – All genes with a representative sequence.
-
genes_with_a_representative_structure
¶ DictList – All genes with a representative protein structure.
-
genes_with_experimental_structures
¶ DictList – All genes that have at least one experimental structure.
-
genes_with_homology_models
¶ DictList – All genes that have at least one homology model.
-
genes_with_structures
¶ DictList – All genes with any mapped protein structures.
-
genome_path
= None¶ str – Simple link to the filepath of the FASTA file containing all protein sequences
-
get_dssp_annotations
(representatives_only=True, force_rerun=False)[source]¶ Run DSSP on structures and store calculations.
Annotations are stored in the protein structure’s chain sequence at:
<chain_prop>.seq_record.letter_annotations['*-dssp']
Parameters: - representative_only (bool) – If analysis should only be run on the representative structure
- force_rerun (bool) – If calculations should be rerun even if an output file exists
-
get_dssp_annotations_parallelize
(sc, representatives_only=True, force_rerun=False)[source]¶ Run DSSP on structures and store calculations.
Annotations are stored in the protein structure’s chain sequence at:
<chain_prop>.seq_record.letter_annotations['*-dssp']
Parameters: - representative_only (bool) – If analysis should only be run on the representative structure
- force_rerun (bool) – If calculations should be rerun even if an output file exists
-
get_freesasa_annotations
(include_hetatms=False, representatives_only=True, force_rerun=False)[source]¶ Run freesasa on structures and store calculations.
Annotations are stored in the protein structure’s chain sequence at:
<chain_prop>.seq_record.letter_annotations['*-freesasa']
Parameters: - include_hetatms (bool) – If HETATMs should be included in calculations. Defaults to
False
. - representative_only (bool) – If analysis should only be run on the representative structure
- force_rerun (bool) – If calculations should be rerun even if an output file exists
- include_hetatms (bool) – If HETATMs should be included in calculations. Defaults to
-
get_freesasa_annotations_parallelize
(sc, include_hetatms=False, representatives_only=True, force_rerun=False)[source]¶ Run freesasa on structures and store calculations.
Annotations are stored in the protein structure’s chain sequence at:
<chain_prop>.seq_record.letter_annotations['*-freesasa']
Parameters: - include_hetatms (bool) – If HETATMs should be included in calculations. Defaults to
False
. - representative_only (bool) – If analysis should only be run on the representative structure
- force_rerun (bool) – If calculations should be rerun even if an output file exists
- include_hetatms (bool) – If HETATMs should be included in calculations. Defaults to
-
get_itasser_models
(homology_raw_dir, custom_itasser_name_mapping=None, outdir=None, force_rerun=False)[source]¶ Copy generated I-TASSER models from a directory to the GEM-PRO directory.
Parameters: - homology_raw_dir (str) – Root directory of I-TASSER folders.
- custom_itasser_name_mapping (dict) – Use this if your I-TASSER folder names differ from your model gene names. Input a dict of {model_gene: ITASSER_folder}.
- outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
- force_rerun (bool) – If homology files should be copied again even if they exist in the GEM-PRO directory
-
get_manual_homology_models
(input_dict, outdir=None, clean=True, force_rerun=False)[source]¶ Copy homology models to the GEM-PRO project.
Requires an input of a dictionary formatted like so:
{ model_gene: { homology_model_id1: { 'model_file': '/path/to/homology/model.pdb', 'file_type': 'pdb' 'additional_info': info_value }, homology_model_id2: { 'model_file': '/path/to/homology/model.pdb' 'file_type': 'pdb' } } }
Parameters: - input_dict (dict) – Dictionary of dictionaries of gene names to homology model IDs and other information
- outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
- clean (bool) – If homology files should be cleaned and saved as a new PDB file
- force_rerun (bool) – If homology files should be copied again even if they exist in the GEM-PRO directory
-
get_msms_annotations
(representatives_only=True, force_rerun=False)[source]¶ Run MSMS on structures and store calculations.
Annotations are stored in the protein structure’s chain sequence at:
<chain_prop>.seq_record.letter_annotations['*-msms']
Parameters: - representative_only (bool) – If analysis should only be run on the representative structure
- force_rerun (bool) – If calculations should be rerun even if an output file exists
-
get_msms_annotations_parallelize
(sc, representatives_only=True, force_rerun=False)[source]¶ Run MSMS on structures and store calculations.
Annotations are stored in the protein structure’s chain sequence at:
<chain_prop>.seq_record.letter_annotations['*-msms']
Parameters: - representative_only (bool) – If analysis should only be run on the representative structure
- force_rerun (bool) – If calculations should be rerun even if an output file exists
-
get_scratch_predictions
(path_to_scratch, results_dir, scratch_basename='scratch', num_cores=1, exposed_buried_cutoff=25, custom_gene_mapping=None)[source]¶ Run and parse
SCRATCH
results to predict secondary structure and solvent accessibility. Annotations are stored in the protein’s representative sequence at:.annotations
.letter_annotations
Parameters: - path_to_scratch (str) – Path to SCRATCH executable
- results_dir (str) – Path to SCRATCH results folder, which will have the files (scratch.ss, scratch.ss8, scratch.acc, scratch.acc20)
- scratch_basename (str) – Basename of the SCRATCH results (‘scratch’ is default)
- num_cores (int) – Number of cores to use to parallelize SCRATCH run
- exposed_buried_cutoff (int) – Cutoff of exposed/buried for the acc20 predictions
- custom_gene_mapping (dict) – Default parsing of SCRATCH output files is to look for the model gene IDs. If your output files contain IDs which differ from the model gene IDs, use this dictionary to map model gene IDs to result file IDs. Dictionary keys must match model genes.
-
get_sequence_properties
(representatives_only=True)[source]¶ Run Biopython ProteinAnalysis and EMBOSS pepstats to summarize basic statistics of all protein sequences. Results are stored in the protein’s respective SeqProp objects at
.annotations
Parameters: representative_only (bool) – If analysis should only be run on the representative sequences
-
get_tmhmm_predictions
(tmhmm_results, custom_gene_mapping=None)[source]¶ Parse TMHMM results and store in the representative sequences.
This is a basic function to parse pre-run TMHMM results. Run TMHMM from the web service (http://www.cbs.dtu.dk/services/TMHMM/) by doing the following:
- Write all representative sequences in the GEM-PRO using the function
write_representative_sequences_file
- Upload the file to http://www.cbs.dtu.dk/services/TMHMM/ and choose “Extensive, no graphics” as the output
- Copy and paste the results (ignoring the top header and above “HELP with output formats”) into a file and save it
- Run this function on that file
Parameters: - tmhmm_results (str) – Path to TMHMM results (long format)
- custom_gene_mapping (dict) – Default parsing of TMHMM output is to look for the model gene IDs. If your output file contains IDs which differ from the model gene IDs, use this dictionary to map model gene IDs to result file IDs. Dictionary keys must match model genes.
- Write all representative sequences in the GEM-PRO using the function
-
kegg_mapping_and_metadata
(kegg_organism_code, custom_gene_mapping=None, outdir=None, set_as_representative=False, force_rerun=False)[source]¶ Map all genes in the model to KEGG IDs using the KEGG service.
- Steps:
- Download all metadata and sequence files in the sequences directory
- Creates a KEGGProp object in the protein.sequences attribute
- Returns a Pandas DataFrame of mapping results
Parameters: - kegg_organism_code (str) – The three letter KEGG code of your organism
- custom_gene_mapping (dict) – If your model genes differ from the gene IDs you want to map, custom_gene_mapping allows you to input a dictionary which maps model gene IDs to new ones. Dictionary keys must match model gene IDs.
- outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
- set_as_representative (bool) – If mapped KEGG IDs should be set as representative sequences
- force_rerun (bool) – If you want to overwrite any existing mappings and files
-
kegg_mapping_and_metadata_parallelize
(sc, kegg_organism_code, custom_gene_mapping=None, outdir=None, set_as_representative=False, force_rerun=False)[source]¶ Map all genes in the model to KEGG IDs using the KEGG service.
- Steps:
- Download all metadata and sequence files in the sequences directory
- Creates a KEGGProp object in the protein.sequences attribute
- Returns a Pandas DataFrame of mapping results
Parameters: - sc (SparkContext) – Spark Context to parallelize this function
- kegg_organism_code (str) – The three letter KEGG code of your organism
- custom_gene_mapping (dict) – If your model genes differ from the gene IDs you want to map, custom_gene_mapping allows you to input a dictionary which maps model gene IDs to new ones. Dictionary keys must match model gene IDs.
- outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
- set_as_representative (bool) – If mapped KEGG IDs should be set as representative sequences
- force_rerun (bool) – If you want to overwrite any existing mappings and files
-
load_cobra_model
(model)[source]¶ Load a COBRApy Model object into the GEM-PRO project.
Parameters: model (Model) – COBRApy Model
object
-
manual_seq_mapping
(gene_to_seq_dict, outdir=None, write_fasta_files=True, set_as_representative=True)[source]¶ Read a manual input dictionary of model gene IDs –> protein sequences. By default sets them as representative.
Parameters: - gene_to_seq_dict (dict) – Mapping of gene IDs to their protein sequence strings
- outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
- write_fasta_files (bool) – If individual protein FASTA files should be written out
- set_as_representative (bool) – If mapped sequences should be set as representative
-
manual_uniprot_mapping
(gene_to_uniprot_dict, outdir=None, set_as_representative=True)[source]¶ Read a manual dictionary of model gene IDs –> UniProt IDs. By default sets them as representative.
This allows for mapping of the missing genes, or overriding of automatic mappings.
Input a dictionary of:
{ <gene_id1>: <uniprot_id1>, <gene_id2>: <uniprot_id2>, }
Parameters: - gene_to_uniprot_dict – Dictionary of mappings as shown above
- outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
- set_as_representative (bool) – If mapped UniProt IDs should be set as representative sequences
-
map_uniprot_to_pdb
(seq_ident_cutoff=0.0, outdir=None, force_rerun=False)[source]¶ Map all representative sequences’ UniProt ID to PDB IDs using the PDBe “Best Structures” API. Will save a JSON file of the results to each protein’s
sequences
folder.The “Best structures” API is available at https://www.ebi.ac.uk/pdbe/api/doc/sifts.html The list of PDB structures mapping to a UniProt accession sorted by coverage of the protein and, if the same, resolution.
Parameters: - seq_ident_cutoff (float) – Sequence identity cutoff in decimal form
- outdir (str) – Output directory to cache JSON results of search
- force_rerun (bool) – Force re-downloading of JSON results if they already exist
Returns: A rank-ordered list of PDBProp objects that map to the UniProt ID
Return type: list
-
missing_homology_models
¶ list – List of genes with no mapping to any homology models.
-
missing_kegg_mapping
¶ list – List of genes with no mapping to KEGG.
-
missing_pdb_structures
¶ list – List of genes with no mapping to any experimental PDB structure.
-
missing_representative_sequence
¶ list – List of genes with no mapping to a representative sequence.
-
missing_representative_structure
¶ list – List of genes with no mapping to a representative structure.
-
missing_uniprot_mapping
¶ list – List of genes with no mapping to UniProt.
-
model
= None¶ Model – COBRApy model object
-
model_dir
¶ str – Directory where original GEMs and GEM-related files are stored.
-
pdb_downloader_and_metadata
(outdir=None, pdb_file_type=None, force_rerun=False)[source]¶ Download ALL mapped experimental structures to each protein’s structures directory.
Parameters: - outdir (str) – Path to output directory, if GEM-PRO directories were not set or other output directory is desired
- pdb_file_type (str) – Type of PDB file to download, if not already set or other format is desired
- force_rerun (bool) – If files should be re-downloaded if they already exist
-
pdb_file_type
= None¶ str –
pdb
,mmCif
,xml
,mmtf
- file type for files downloaded from the PDB
-
prep_itasser_modeling
(itasser_installation, itlib_folder, runtype, create_in_dir=None, execute_from_dir=None, all_genes=False, print_exec=False, **kwargs)[source]¶ Prepare to run I-TASSER homology modeling for genes without structures, or all genes.
Parameters: - itasser_installation (str) – Path to I-TASSER folder, i.e.
~/software/I-TASSER4.4
- itlib_folder (str) – Path to ITLIB folder, i.e.
~/software/ITLIB
- runtype – How you will be running I-TASSER - local, slurm, or torque
- create_in_dir (str) – Local directory where folders will be created
- execute_from_dir (str) – Optional path to execution directory - use this if you are copying the homology models to another location such as a supercomputer for running
- all_genes (bool) – If all genes should be prepped, or only those without any mapped structures
- print_exec (bool) – If the execution statement should be printed to run modelling
Todo
- Document kwargs - extra options for I-TASSER, SLURM or Torque execution
- Allow modeling of any sequence in sequences attribute, select by ID or provide SeqProp?
- itasser_installation (str) – Path to I-TASSER folder, i.e.
-
root_dir
¶ str – Directory where GEM-PRO project folder named after the attribute
base_dir
is located.
-
set_representative_sequence
(force_rerun=False)[source]¶ Automatically consolidate loaded sequences (manual, UniProt, or KEGG) and set a single representative sequence.
Manually set representative sequences override all existing mappings. UniProt mappings override KEGG mappings except when KEGG mappings have PDBs associated with them and UniProt doesn’t.
Parameters: force_rerun (bool) – Set to True to recheck stored sequences
-
set_representative_structure
(seq_outdir=None, struct_outdir=None, pdb_file_type=None, engine='needle', always_use_homology=False, rez_cutoff=0.0, seq_ident_cutoff=0.5, allow_missing_on_termini=0.2, allow_mutants=True, allow_deletions=False, allow_insertions=False, allow_unresolved=True, skip_large_structures=False, clean=True, force_rerun=False)[source]¶ Set all representative structure for proteins from a structure in the structures attribute.
Each gene can have a combination of the following, which will be analyzed to set a representative structure.
- Homology model(s)
- Ranked PDBs
- BLASTed PDBs
If the
always_use_homology
flag is true, homology models are always set as representative when they exist. If there are multiple homology models, we rank by the percent sequence coverage.Parameters: - seq_outdir (str) – Path to output directory of sequence alignment files, must be set if GEM-PRO directories were not created initially
- struct_outdir (str) – Path to output directory of structure files, must be set if GEM-PRO directories were not created initially
- pdb_file_type (str) –
pdb
,mmCif
,xml
,mmtf
- file type for files downloaded from the PDB - engine (str) –
biopython
orneedle
- which pairwise alignment program to use.needle
is the standard EMBOSS tool to run pairwise alignments.biopython
is Biopython’s implementation of needle. Results can differ! - always_use_homology (bool) – If homology models should always be set as the representative structure
- rez_cutoff (float) – Resolution cutoff, in Angstroms (only if experimental structure)
- seq_ident_cutoff (float) – Percent sequence identity cutoff, in decimal form
- allow_missing_on_termini (float) – Percentage of the total length of the reference sequence which will be ignored when checking for modifications. Example: if 0.1, and reference sequence is 100 AA, then only residues 5 to 95 will be checked for modifications.
- allow_mutants (bool) – If mutations should be allowed or checked for
- allow_deletions (bool) – If deletions should be allowed or checked for
- allow_insertions (bool) – If insertions should be allowed or checked for
- allow_unresolved (bool) – If unresolved residues should be allowed or checked for
- skip_large_structures (bool) – Default False – currently, large structures can’t be saved as a PDB file even if you just want to save a single chain, so Biopython will throw an error when trying to do so. As an alternative, if a large structure is selected as representative, the pipeline will currently point to it and not clean it. If you don’t want this to happen, set this to true.
- clean (bool) – If structures should be cleaned
- force_rerun (bool) – If sequence to structure alignment should be rerun
Todo
- Remedy large structure representative setting
-
structures_dir
¶ str – Directory where all structures are stored.
-
uniprot_mapping_and_metadata
(model_gene_source, custom_gene_mapping=None, outdir=None, set_as_representative=False, force_rerun=False)[source]¶ Map all genes in the model to UniProt IDs using the UniProt mapping service. Also download all metadata and sequences.
Parameters: - model_gene_source (str) –
the database source of your model gene IDs. See: http://www.uniprot.org/help/api_idmapping Common model gene sources are:
- Ensembl Genomes -
ENSEMBLGENOME_ID
(i.e. E. coli b-numbers) - Entrez Gene (GeneID) -
P_ENTREZGENEID
- RefSeq Protein -
P_REFSEQ_AC
- Ensembl Genomes -
- custom_gene_mapping (dict) – If your model genes differ from the gene IDs you want to map, custom_gene_mapping allows you to input a dictionary which maps model gene IDs to new ones. Dictionary keys must match model genes.
- outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
- set_as_representative (bool) – If mapped UniProt IDs should be set as representative sequences
- force_rerun (bool) – If you want to overwrite any existing mappings and files
- model_gene_source (str) –
-
write_representative_sequences_file
(outname, outdir=None, set_ids_from_model=True)[source]¶ Write all the model’s sequences as a single FASTA file. By default, sets IDs to model gene IDs.
Parameters: - outname (str) – Name of the output FASTA file without the extension
- outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
- set_ids_from_model (bool) – If the gene ID source should be the model gene IDs, not the original sequence ID
ssbio.databases
¶
PDBProp¶
-
class
ssbio.databases.pdb.
PDBProp
(ident, description=None, chains=None, mapped_chains=None, structure_path=None, file_type=None)[source]¶ Store information about a protein structure from the Protein Data Bank.
Extends the
StructProp
class to allow initialization of the structure by its PDB ID, and then enabling downloads of the structure file as well as parsing its metadata.Parameters: - ident (str) –
- description (str) –
- chains (str) –
- mapped_chains (str) –
- structure_path (str) –
- file_type (str) –
pdb
,mmCif
,xml
,mmtf
- file type for files downloaded from the PDB
-
biological_assemblies
= None¶ DictList – A list for storing Bioassembly objects related to this PDB ID
-
download_structure_file
(outdir, file_type=None, load_header_metadata=True, force_rerun=False)[source]¶ Download a structure file from the PDB, specifying an output directory and a file type. Optionally download the mmCIF header file and parse data from it to store within this object.
Parameters: - outdir (str) – Path to output directory
- file_type (str) –
pdb
,mmCif
,xml
,mmtf
- file type for files downloaded from the PDB - load_header_metadata (bool) – If header metadata should be loaded into this object, fastest with mmtf files
- force_rerun (bool) – If structure file should be downloaded even if it already exists
-
ssbio.databases.pdb.
best_structures
(uniprot_id, outname=None, outdir=None, seq_ident_cutoff=0.0, force_rerun=False)[source]¶ Use the PDBe REST service to query for the best PDB structures for a UniProt ID.
More information found here: https://www.ebi.ac.uk/pdbe/api/doc/sifts.html Link used to retrieve results: https://www.ebi.ac.uk/pdbe/api/mappings/best_structures/:accession The list of PDB structures mapping to a UniProt accession sorted by coverage of the protein and, if the same, resolution.
Here is the ranking algorithm described by the PDB paper: https://nar.oxfordjournals.org/content/44/D1/D385.full
“Finally, a single quality indicator is also calculated for each entry by taking the harmonic average of all the percentile scores representing model and model-data-fit quality measures and then subtracting 10 times the numerical value of the resolution (in Angstrom) of the entry to ensure that resolution plays a role in characterising the quality of a structure. This single empirical ‘quality measure’ value is used by the PDBe query system to sort results and identify the ‘best’ structure in a given context. At present, entries determined by methods other than X-ray crystallography do not have similar data quality information available and are not considered as ‘best structures’.”
Parameters: - uniprot_id (str) – UniProt Accession ID
- outname (str) – Basename of the output file of JSON results
- outdir (str) – Path to output directory of JSON results
- seq_ident_cutoff (float) – Cutoff results based on percent coverage (in decimal form)
- force_rerun (bool) – Obtain best structures mapping ignoring previously downloaded results
Returns: - Rank-ordered list of dictionaries representing chain-specific PDB entries. Keys are:
- pdb_id: the PDB ID which maps to the UniProt ID
- chain_id: the specific chain of the PDB which maps to the UniProt ID
- coverage: the percent coverage of the entire UniProt sequence
- resolution: the resolution of the structure
- start: the structure residue number which maps to the start of the mapped sequence
- end: the structure residue number which maps to the end of the mapped sequence
- unp_start: the sequence residue number which maps to the structure start
- unp_end: the sequence residue number which maps to the structure end
- experimental_method: type of experiment used to determine structure
- tax_id: taxonomic ID of the protein’s original organism
Return type: list
-
ssbio.databases.pdb.
blast_pdb
(seq, outfile='', outdir='', evalue=0.0001, seq_ident_cutoff=0.0, link=False, force_rerun=False)[source]¶ Returns a list of BLAST hits of a sequence to available structures in the PDB.
Parameters: - seq (str) – Your sequence, in string format
- outfile (str) – Name of output file
- outdir (str, optional) – Path to output directory. Default is the current directory.
- evalue (float, optional) – Cutoff for the E-value - filters for significant hits. 0.001 is liberal, 0.0001 is stringent (default).
- seq_ident_cutoff (float, optional) – Cutoff results based on percent coverage (in decimal form)
- link (bool, optional) – Set to True if a link to the HTML results should be displayed
- force_rerun (bool, optional) – If existing BLAST results should not be used, set to True. Default is False
Returns: Rank ordered list of BLAST hits in dictionaries.
Return type: list
-
ssbio.databases.pdb.
download_mmcif_header
(pdb_id, outdir='', force_rerun=False)[source]¶ Download a mmCIF header file from the RCSB PDB by ID.
Parameters: - pdb_id – PDB ID
- outdir – Optional output directory, default is current working directory
- force_rerun – If the file should be downloaded again even if it exists
Returns: Path to outfile
Return type: str
-
ssbio.databases.pdb.
download_sifts_xml
(pdb_id, outdir='', force_rerun=False)[source]¶ Download the SIFTS file for a PDB ID.
Parameters: - pdb_id (str) – PDB ID
- outdir (str) – Output directory, current working directory if not specified.
- force_rerun (bool) – If the file should be downloaded again even if it exists
Returns: Path to downloaded file
Return type: str
-
ssbio.databases.pdb.
download_structure
(pdb_id, file_type, outdir='', only_header=False, force_rerun=False)[source]¶ Download a structure from the RCSB PDB by ID. Specify the file type desired.
Parameters: - pdb_id – PDB ID
- file_type – pdb, pdb.gz, mmcif, cif, cif.gz, xml.gz, mmtf, mmtf.gz
- outdir – Optional output directory
- only_header – If only the header file should be downloaded
- force_rerun – If the file should be downloaded again even if it exists
Returns: Path to outfile
Return type: str
Deprecated since version 1.0: This will be removed in 2.0. Use Biopython’s PDBList.retrieve_pdb_file function instead
-
ssbio.databases.pdb.
get_bioassembly_info
(pdb_id, biomol_num, cache=False, outdir=None, force_rerun=False)[source]¶ Get metadata about a bioassembly from the RCSB PDB’s REST API.
See: https://www.rcsb.org/pdb/rest/bioassembly/bioassembly?structureId=1hv4&nr=1 The API returns an XML file containing the information on a biological assembly that looks like this:
<bioassembly structureId="1HV4" assemblyNr="1" method="PISA" desc="author_and_software_defined_assembly"> <transformations operator="1" chainIds="A,B,C,D"> <transformation index="1"> <matrix m11="1.00000000" m12="0.00000000" m13="0.00000000" m21="0.00000000" m22="1.00000000" m23="0.00000000" m31="0.00000000" m32="0.00000000" m33="1.00000000"/> <shift v1="0.00000000" v2="0.00000000" v3="0.00000000"/> </transformation> </transformations> </bioassembly>
Parameters: - pdb_id (str) – PDB ID
- biomol_num (int) – Biological assembly number you are interested in
- cache (bool) – If the XML file should be downloaded
- outdir (str) – If cache, then specify the output directory
- force_rerun (bool) – If cache, and if file exists, specify if API should be queried again
-
ssbio.databases.pdb.
get_num_bioassemblies
(pdb_id, cache=False, outdir=None, force_rerun=False)[source]¶ Check if there are bioassemblies using the PDB REST API, and if there are, get the number of bioassemblies available.
See: https://www.rcsb.org/pages/webservices/rest, section ‘List biological assemblies’
Not all PDB entries have biological assemblies available and some have multiple. Details that are necessary to recreate a biological assembly from the asymmetric unit can be accessed from the following requests.
- Number of biological assemblies associated with a PDB entry
- Access the transformation information needed to generate a biological assembly (nr=0 will return information for the asymmetric unit, nr=1 will return information for the first assembly, etc.)
A query of https://www.rcsb.org/pdb/rest/bioassembly/nrbioassemblies?structureId=1hv4 returns this:
<nrBioAssemblies structureId="1HV4" hasAssemblies="true" count="2"/>
Parameters: - pdb_id (str) – PDB ID
- cache (bool) – If the XML file should be downloaded
- outdir (str) – If cache, then specify the output directory
- force_rerun (bool) – If cache, and if file exists, specify if API should be queried again
-
ssbio.databases.pdb.
get_release_date
(pdb_id)[source]¶ Quick way to get the release date of a PDB ID using the table of results from the REST service
Returns None if the release date is not available.
Returns: Organism of a PDB ID Return type: str
-
ssbio.databases.pdb.
get_resolution
(pdb_id)[source]¶ Quick way to get the resolution of a PDB ID using the table of results from the REST service
Returns infinity if the resolution is not available.
Returns: resolution of a PDB ID in Angstroms Return type: float Todo
- Unit test
-
ssbio.databases.pdb.
map_uniprot_resnum_to_pdb
(uniprot_resnum, chain_id, sifts_file)[source]¶ Map a UniProt residue number to its corresponding PDB residue number.
This function requires that the SIFTS file be downloaded, and also a chain ID (as different chains may have different mappings).
Parameters: - uniprot_resnum (int) – integer of the residue number you’d like to map
- chain_id (str) – string of the PDB chain to map to
- sifts_file (str) – Path to the SIFTS XML file
Returns: tuple containing:
mapped_resnum (int): Mapped residue number is_observed (bool): Indicates if the 3D structure actually shows the residue
Return type: (tuple)
-
ssbio.databases.pdb.
parse_mmcif_header
(infile)[source]¶ Parse a couple important fields from the mmCIF file format with some manual curation of ligands.
If you want full access to the mmCIF file just use the MMCIF2Dict class in Biopython.
Parameters: infile – Path to mmCIF file Returns: Dictionary of parsed header Return type: dict
-
ssbio.databases.pdb.
parse_mmtf_header
(infile)[source]¶ Parse an MMTF file and return basic header-like information.
Parameters: infile (str) – Path to MMTF file Returns: Dictionary of parsed header Return type: dict Todo
- Can this be sped up by not parsing the 3D coordinate info somehow?
- OR just store the sequences when this happens since it is already being parsed.
PISA¶
-
ssbio.databases.pisa.
download_pisa_multimers_xml
(pdb_ids, save_single_xml_files=True, outdir=None, force_rerun=False)[source]¶ Download the PISA XML file for multimers.
See: http://www.ebi.ac.uk/pdbe/pisa/pi_download.html for more info
- XML description of macromolecular assemblies:
- http://www.ebi.ac.uk/pdbe/pisa/cgi-bin/multimers.pisa?pdbcodelist where “pdbcodelist” is a comma-separated (strictly no spaces) list of PDB codes. The resulting file contain XML output of assembly data, equivalent to that displayed in PISA assembly pages, for each of the specified PDB entries. NOTE: If a mass-download is intended, please minimize the number of retrievals by specifying as many PDB codes in the URL as feasible (20-50 is a good range), and never send another URL request until the previous one has been completed (meaning that the multimers.pisa file has been downloaded). Excessive requests will silently die in the server queue.
Parameters: - pdb_ids (str, list) – PDB ID or list of IDs
- save_single_xml_files (bool) – If single XML files should be saved per PDB ID. If False, if multiple PDB IDs are provided, then a single, combined XML output file is downloaded
- outdir (str) – Directory to output PISA XML files
- force_rerun (bool) – Redownload files if they already exist
Returns: of files downloaded
Return type: list
-
ssbio.databases.pisa.
parse_pisa_multimers_xml
(pisa_multimers_xml, download_structures=False, outdir=None, force_rerun=False)[source]¶ Retrieve PISA information from an XML results file
See: http://www.ebi.ac.uk/pdbe/pisa/pi_download.html for more info
- XML description of macromolecular assemblies:
- http://www.ebi.ac.uk/pdbe/pisa/cgi-bin/multimers.pisa?pdbcodelist where “pdbcodelist” is a comma-separated (strictly no spaces) list of PDB codes. The resulting file contain XML output of assembly data, equivalent to that displayed in PISA assembly pages, for each of the specified PDB entries. NOTE: If a mass-download is intended, please minimize the number of retrievals by specifying as many PDB codes in the URL as feasible (20-50 is a good range), and never send another URL request until the previous one has been completed (meaning that the multimers.pisa file has been downloaded). Excessive requests will silently die in the server queue.
Parameters: - pisa_multimers_xml (str) – Path to PISA XML output file
- download_structures (bool) – If assembly files should be downloaded
- outdir (str) – Directory to output assembly files
- force_rerun (bool) – Redownload files if they already exist
Returns: of parsed PISA information
Return type: dict
-
ssbio.databases.pisa.
pdb_chain_stoichiometry_biomolone
(pdbid)[source]¶ Get the stoichiometry of the chains in biological assembly 1 as a dictionary.
Steps taken are: 1) Download PDB and parse header, make biomolecule if provided 2) Count how many times each chain appears in biomolecule #1 3) Convert chain id to uniprot id 4) Return final dictionary
Parameters: pdbid (str) – 4 character PDB ID Returns: {(ChainID,UniProtID): # occurences} Return type: dict
SWISSMODEL¶
-
class
ssbio.databases.swissmodel.
SWISSMODEL
(metadata_dir)[source]¶ Methods to parse through a SWISS-MODEL metadata set.
Download a particular organism’s metadata from SWISS-MODEL here: https://swissmodel.expasy.org/repository
Parameters: metadata_dir (str) – Path to the extracted SWISS-MODEL_Repository folder -
all_models
= None¶ dict – Dictionary of lists, UniProt ID as the keys
-
download_models
(uniprot_acc, outdir='', force_rerun=False)[source]¶ Download all models available for a UniProt accession number.
Parameters: - uniprot_acc (str) – UniProt ACC/ID
- outdir (str) – Path to output directory, uses working directory if not set
- force_rerun (bool) – Force a redownload the models if they already exist
Returns: Paths to the downloaded models
Return type: list
-
get_model_filepath
(infodict)[source]¶ Get the path to the homology model using information from the index dictionary for a single model.
- Example: use self.get_models(UNIPROT_ID) to get all the models, which returns a list of dictionaries.
- Use one of those dictionaries as input to this function to get the filepath to the model itself.
Parameters: infodict (dict) – Information about a model from get_models Returns: Path to homology model Return type: str
-
get_models
(uniprot_acc)[source]¶ Return all available models for a UniProt accession number.
Parameters: uniprot_acc (str) – UniProt ACC/ID Returns: All available models in SWISS-MODEL for this UniProt entry Return type: dict
-
metadata_dir
= None¶ str – Path to the extracted SWISS-MODEL_Repository folder
-
metadata_index_json
¶ str – Path to the INDEX_JSON file.
-
organize_models
(outdir, force_rerun=False)[source]¶ Organize and rename SWISS-MODEL models to a single folder with a name containing template information.
Parameters: - outdir (str) – New directory to copy renamed models to
- force_rerun (bool) – If models should be copied again even if they already exist
Returns: Dictionary of lists, UniProt IDs as the keys and new file paths as the values
Return type: dict
-
uniprots_modeled
¶ list – Return all UniProt accession numbers with at least one model
-
-
ssbio.databases.swissmodel.
get_oligomeric_state
(swiss_model_path)[source]¶ Parse the oligomeric prediction in a SWISS-MODEL repository file
As of 2018-02-26, works on all E. coli models. Untested on other pre-made organism models.
Parameters: swiss_model_path (str) – Path to SWISS-MODEL PDB file Returns: Information parsed about the oligomeric state Return type: dict
-
ssbio.databases.swissmodel.
translate_ostat
(ostat)[source]¶ Translate the OSTAT field to an integer.
As of 2018-02-26, works on all E. coli models. Untested on other pre-made organism models.
Parameters: ostat (str) – Predicted oligomeric state of the PDB file Returns: Translated string to integer Return type: int
UniProtProp¶
-
class
ssbio.databases.uniprot.
UniProtProp
(seq, id, name='<unknown name>', description='<unknown description>', fasta_path=None, xml_path=None, gff_path=None)[source]¶ Generic class to store information on a UniProt entry, extended from a SeqProp object.
The main utilities of this class are to:
- Download and/or parse UniProt text or xml files
- Store extra parsed information in attributes
-
uniprot
¶ str – Main UniProt accession code
-
alt_uniprots
¶ list – Alternate accession codes that point to the main one
-
file_type
¶ str – Metadata file type
-
reviewed
¶ bool – If this entry is a “reviewed” entry. If None, then status is unknown.
-
ec_number
¶ str – EC number
-
pfam
¶ list – PFAM IDs
-
entry_version
¶ str – Date of last update of the UniProt entry
-
seq_version
¶ str – Date of last update of the UniProt sequence
-
features
¶ list – Get the features from the feature file, metadata file, or in memory
-
ranking_score
()[source]¶ Provide a score for this UniProt ID based on reviewed (True=1, False=0) + number of PDBs
Returns: Scoring for this ID Return type: int
-
seq
¶ Seq – Get the Seq object from the sequence file, metadata file, or in memory
-
ssbio.databases.uniprot.
blast_uniprot
(seq_str, seq_ident=1, evalue=0.0001, reviewed_only=True, organism=None)[source]¶ BLAST the UniProt db to find what IDs match the sequence input
Parameters: - seq_str – Sequence string
- seq_ident – Percent identity to match
- evalue – E-value of BLAST hit
Returns:
-
ssbio.databases.uniprot.
download_uniprot_file
(uniprot_id, filetype, outdir='', force_rerun=False)[source]¶ Download a UniProt file for a UniProt ID/ACC
Parameters: - uniprot_id – Valid UniProt ID
- filetype – txt, fasta, xml, rdf, or gff
- outdir – Directory to download the file
Returns: Absolute path to file
Return type: str
-
ssbio.databases.uniprot.
get_fasta
(uniprot_id)[source]¶ Get the protein sequence for a UniProt ID as a string.
Parameters: uniprot_id – Valid UniProt ID Returns: String of the protein (amino acid) sequence Return type: str
-
ssbio.databases.uniprot.
is_valid_uniprot_id
(instring)[source]¶ Check if a string is a valid UniProt ID.
See regex from: http://www.uniprot.org/help/accession_numbers
Parameters: instring – any string identifier Returns: True if the string is a valid UniProt ID
-
ssbio.databases.uniprot.
old_parse_uniprot_txt_file
(infile)[source]¶ From: boscoh/uniprot github Parses the text of metadata retrieved from uniprot.org.
Only a few fields have been parsed, but this provides a template for the other fields.
A single description is generated from joining alternative descriptions.
Returns a dictionary with the main UNIPROT ACC as keys.
-
ssbio.databases.uniprot.
parse_uniprot_txt_file
(infile)[source]¶ Parse a raw UniProt metadata file and return a dictionary.
Parameters: infile – Path to metadata file Returns: Metadata dictionary Return type: dict
-
ssbio.databases.uniprot.
parse_uniprot_xml_metadata
(sr)[source]¶ Load relevant attributes and dbxrefs from a parsed UniProt XML file in a SeqRecord.
Returns: All parsed information Return type: dict
-
ssbio.databases.uniprot.
uniprot_ec
(uniprot_id)[source]¶ Retrieve the EC number annotation for a UniProt ID.
Parameters: uniprot_id – Valid UniProt ID Returns:
-
ssbio.databases.uniprot.
uniprot_reviewed_checker
(uniprot_id)[source]¶ Check if a single UniProt ID is reviewed or not.
Parameters: uniprot_id – Returns: If the entry is reviewed Return type: bool
-
ssbio.databases.uniprot.
uniprot_reviewed_checker_batch
(uniprot_ids)[source]¶ Batch check if uniprot IDs are reviewed or not
Parameters: uniprot_ids – UniProt ID or list of UniProt IDs Returns: Boolean} Return type: A dictionary of {UniProtID
-
ssbio.databases.uniprot.
uniprot_sites
(uniprot_id)[source]¶ Retrieve a list of UniProt sites parsed from the feature file
Sites are defined here: http://www.uniprot.org/help/site and here: http://www.uniprot.org/help/function_section
Parameters: uniprot_id – Valid UniProt ID Returns:
KEGGProp¶
-
class
ssbio.databases.kegg.
KEGGProp
(seq, id, name='<unknown name>', description='<unknown description>', fasta_path=None, txt_path=None, gff_path=None)[source]¶
-
ssbio.databases.kegg.
download_kegg_aa_seq
(gene_id, outdir=None, force_rerun=False)[source]¶ Download a FASTA sequence of a protein from the KEGG database and return the path.
Parameters: - gene_id – the gene identifier
- outdir – optional path to output directory
Returns: Path to FASTA file
-
ssbio.databases.kegg.
download_kegg_gene_metadata
(gene_id, outdir=None, force_rerun=False)[source]¶ Download the KEGG flatfile for a KEGG ID and return the path.
Parameters: - gene_id – KEGG gene ID (with organism code), i.e. “eco:1244”
- outdir – optional output directory of metadata
Returns: Path to metadata file
-
ssbio.databases.kegg.
map_kegg_all_genes
(organism_code, target_db)[source]¶ Map all of an organism’s gene IDs to the target database.
This is faster than supplying a specific list of genes to map, plus there seems to be a limit on the number you can map with a manual REST query anyway.
Parameters: - organism_code – the three letter KEGG code of your organism
- target_db – ncbi-proteinid | ncbi-geneid | uniprot
Returns: Dictionary of ID mapping
ssbio.protein.structure.utils
¶
CleanPDB¶
-
ssbio.protein.structure.utils.cleanpdb.
clean_pdb
(pdb_file, out_suffix='_clean', outdir=None, force_rerun=False, remove_atom_alt=True, keep_atom_alt_id='A', remove_atom_hydrogen=True, add_atom_occ=True, remove_res_hetero=True, keep_chemicals=None, keep_res_only=None, add_chain_id_if_empty='X', keep_chains=None)[source]¶ Clean a PDB file.
Parameters: - pdb_file (str) – Path to input PDB file
- out_suffix (str) – Suffix to append to original filename
- outdir (str) – Path to output directory
- force_rerun (bool) – If structure should be re-cleaned if a clean file exists already
- remove_atom_alt (bool) – Remove alternate positions
- keep_atom_alt_id (str) – If removing alternate positions, which alternate ID to keep
- remove_atom_hydrogen (bool) – Remove hydrogen atoms
- add_atom_occ (bool) – Add atom occupancy fields if not present
- remove_res_hetero (bool) – Remove all HETATMs
- keep_chemicals (str, list) – If removing HETATMs, keep specified chemical names
- keep_res_only (str, list) – Keep ONLY specified resnames, deletes everything else!
- add_chain_id_if_empty (str) – Add a chain ID if not present
- keep_chains (str, list) – Keep only these chains
Returns: Path to cleaned PDB file
Return type: str
MutatePDB¶
DOCK¶
-
class
ssbio.protein.structure.utils.dock.
DOCK
(structure_id, pdb_file, amb_file, flex1_file, flex2_file, root_dir=None)[source]¶ Class to prepare a structure file for docking with DOCK6.
Attributes:
-
auto_flexdock
(binding_residues, radius, ligand_path=None, force_rerun=False)[source]¶ Run DOCK6 on a PDB file, given its binding residues and a radius around them.
Provide a path to a ligand to dock a ligand to it. If no ligand is provided, DOCK6 preparations will be run on that structure file.
Parameters: - binding_residues (str) – Comma separated string of residues (eg: ‘144,170,199’)
- radius (int, float) – Radius around binding residues to dock to
- ligand_path (str) – Path to ligand (mol2 format) to dock to protein
- force_rerun (bool) – If method should be rerun even if output files exist
-
binding_site_mol2
(residues, force_rerun=False)[source]¶ Create mol2 of only binding site residues from the receptor
This function will take in a .pdb file (preferably the _receptor_noH.pdb file) and a string of residues (eg: ‘144,170,199’) and delete all other residues in the .pdb file. It then saves the coordinates of the selected residues as a .mol2 file. This is necessary for Chimera to select spheres within the radius of the binding site.
Parameters: - residues (str) – Comma separated string of residues (eg: ‘144,170,199’)
- force_rerun (bool) – If method should be rerun even if output file exists
-
dms_maker
(force_rerun=False)[source]¶ Create surface representation (dms file) of receptor
Parameters: force_rerun (bool) – If method should be rerun even if output file exists
-
do_dock6_flexible
(ligand_path, force_rerun=False)[source]¶ Dock a ligand to the protein.
Parameters: - ligand_path (str) – Path to ligand (mol2 format) to dock to protein
- force_rerun (bool) – If method should be rerun even if output file exists
-
dock_dir
¶ str – DOCK folder
-
dockprep
(force_rerun=False)[source]¶ Prepare a PDB file for docking by first converting it to mol2 format.
Parameters: force_rerun (bool) – If method should be rerun even if output file exists
-
grid
(force_rerun=False)[source]¶ Create the scoring grid within the dummy box.
Parameters: force_rerun (bool) – If method should be rerun even if output file exists
-
protein_only_and_noH
(keep_ligands=None, force_rerun=False)[source]¶ Isolate the receptor by stripping everything except protein and specified ligands.
Parameters: - keep_ligands (str, list) – Ligand(s) to keep in PDB file
- force_rerun (bool) – If method should be rerun even if output file exists
-
root_dir
¶ str – Directory where DOCK project folder is located
-
showbox
(force_rerun=False)[source]¶ Create the dummy PDB box around the selected spheres.
Parameters: force_rerun (bool) – If method should be rerun even if output file exists
-
ssbio.protein.structure.properties
¶
Structure Residues¶
-
ssbio.protein.structure.properties.residues.
distance_to_site
(residue_of_interest, residues, model)[source]¶ Calculate the distance between an amino acid and a group of amino acids.
Parameters: - residue_of_interest – Residue number you are interested in (ie. a mutation)
- residues – List of residue numbers
Returns: Distance (in Angstroms) to the group of residues
Return type: float
-
ssbio.protein.structure.properties.residues.
get_structure_seqrecords
(model)[source]¶ Get a dictionary of a PDB file’s sequences.
- Special cases include:
- Insertion codes. In the case of residue numbers like “15A”, “15B”, both residues are written out. Example: 9LPR
- HETATMs. Currently written as an “X”, or unknown amino acid.
Parameters: model – Biopython Model object of a Structure Returns: List of SeqRecords Return type: list
-
ssbio.protein.structure.properties.residues.
get_structure_seqs
(pdb_file, file_type)[source]¶ Get a dictionary of a PDB file’s sequences.
- Special cases include:
- Insertion codes. In the case of residue numbers like “15A”, “15B”, both residues are written out. Example: 9LPR
- HETATMs. Currently written as an “X”, or unknown amino acid.
Parameters: pdb_file – Path to PDB file Returns: Dictionary of: {chain_id: sequence} Return type: dict
-
ssbio.protein.structure.properties.residues.
hse_output
(pdb_file, file_type)[source]¶ The solvent exposure of an amino acid residue is important for analyzing, understanding and predicting aspects of protein structure and function [73]. A residue’s solvent exposure can be classified as four categories: exposed, partly exposed, buried and deeply buried residues. Hamelryck et al. [73] established a new 2D measure that provides a different view of solvent exposure, i.e. half-sphere exposure (HSE). By conceptually dividing the sphere of a residue into two halves- HSE-up and HSE-down, HSE provides a more detailed description of an amino acid residue’s spatial neighborhood. HSE is calculated by the hsexpo module implemented in the BioPython package [74] from a PDB file.
http://onlinelibrary.wiley.com/doi/10.1002/prot.20379/abstract
Parameters: pdb_file – Returns:
-
ssbio.protein.structure.properties.residues.
match_structure_sequence
(orig_seq, new_seq, match='X', fill_with='X', ignore_excess=False)[source]¶ Correct a sequence to match inserted X’s in a structure sequence
- This is useful for mapping a sequence obtained from structural tools like MSMS or DSSP
- to the sequence obtained by the get_structure_seqs method.
Examples
>>> structure_seq = 'XXXABCDEF' >>> prop_list = [4, 5, 6, 7, 8, 9] >>> match_structure_sequence(structure_seq, prop_list) ['X', 'X', 'X', 4, 5, 6, 7, 8, 9]
>>> match_structure_sequence(structure_seq, prop_list, fill_with=float('Inf')) [inf, inf, inf, 4, 5, 6, 7, 8, 9]
>>> structure_seq = '---ABCDEF---' >>> prop_list = ('H','H','H','C','C','C') >>> match_structure_sequence(structure_seq, prop_list, match='-', fill_with='-') ('-', '-', '-', 'H', 'H', 'H', 'C', 'C', 'C', '-', '-', '-')
>>> structure_seq = 'ABCDEF---' >>> prop_list = 'HHHCCC' >>> match_structure_sequence(structure_seq, prop_list, match='-', fill_with='-') 'HHHCCC---'
>>> structure_seq = 'AXBXCXDXEXF' >>> prop_list = ['H', 'H', 'H', 'C', 'C', 'C'] >>> match_structure_sequence(structure_seq, prop_list, match='X', fill_with='X') ['H', 'X', 'H', 'X', 'H', 'X', 'C', 'X', 'C', 'X', 'C']
Parameters: - orig_seq (str, Seq, SeqRecord) – Sequence to match to
- new_seq (str, tuple, list) – Sequence to fill in
- match (str) – What to match
- fill_with – What to fill in when matches are found
- ignore_excess (bool) – If excess sequence on the tail end of new_seq should be ignored
Returns: new_seq which will match the length of orig_seq
Return type: str, tuple, list
-
ssbio.protein.structure.properties.residues.
resname_in_proximity
(resname, model, chains, resnums, threshold=5)[source]¶ Search within the proximity of a defined list of residue numbers and their chains for any specifed residue name.
Parameters: - resname (str) – Residue name to search for in proximity of specified chains + resnums
- model – Biopython Model object
- chains (str, list) – Chain ID or IDs to check
- resnums (int, list) – Residue numbers within the chain to check
- threshold (float) – Cutoff in Angstroms for returning True if a RESNAME is near
Returns: True if a RESNAME is within the threshold cutoff
Return type: bool
-
ssbio.protein.structure.properties.residues.
search_ss_bonds
(model, threshold=3.0)[source]¶ Searches S-S bonds based on distances between atoms in the structure (first model only). Average distance is 2.05A. Threshold is 3A default. Returns iterator with tuples of residues.
ADAPTED FROM JOAO RODRIGUES’ BIOPYTHON GSOC PROJECT (http://biopython.org/wiki/GSOC2010_Joao)
ssbio.protein.sequence.utils
¶
Sequence Alignment¶
-
ssbio.protein.sequence.utils.alignment.
get_alignment_df
(a_aln_seq, b_aln_seq, a_seq_id=None, b_seq_id=None)[source]¶ Summarize two alignment strings in a dataframe.
Parameters: - a_aln_seq (str) – Aligned sequence string
- b_aln_seq (str) – Aligned sequence string
- a_seq_id (str) – Optional ID of a_seq
- b_seq_id (str) – Optional ID of b_aln_seq
Returns: a per-residue level annotation of the alignment
Return type: DataFrame
-
ssbio.protein.sequence.utils.alignment.
get_alignment_df_from_file
(alignment_file, a_seq_id=None, b_seq_id=None)[source]¶ Get a Pandas DataFrame of the Needle alignment results. Contains all positions of the sequences.
Parameters: - alignment_file –
- a_seq_id – Optional specification of the ID of the reference sequence
- b_seq_id – Optional specification of the ID of the aligned sequence
Returns: all positions in the alignment
Return type: Pandas DataFrame
-
ssbio.protein.sequence.utils.alignment.
get_deletions
(aln_df)[source]¶ Get a list of tuples indicating the first and last residues of a deletion region, as well as the length of the deletion.
Examples
# Deletion of residues 1 to 4, length 4 >>> test = {‘id_a’: {0: ‘a’, 1: ‘a’, 2: ‘a’, 3: ‘a’}, ‘id_a_aa’: {0: ‘M’, 1: ‘G’, 2: ‘I’, 3: ‘T’}, ‘id_a_pos’: {0: 1.0, 1: 2.0, 2: 3.0, 3: 4.0}, ‘id_b’: {0: ‘b’, 1: ‘b’, 2: ‘b’, 3: ‘b’}, ‘id_b_aa’: {0: np.nan, 1: np.nan, 2: np.nan, 3: np.nan}, ‘id_b_pos’: {0: np.nan, 1: np.nan, 2: np.nan, 3: np.nan}, ‘type’: {0: ‘deletion’, 1: ‘deletion’, 2: ‘deletion’, 3: ‘deletion’}} >>> my_alignment = pd.DataFrame.from_dict(test) >>> get_deletions(my_alignment) [((1.0, 4.0), 4)]
Parameters: aln_df (DataFrame) – Alignment DataFrame Returns: A list of tuples with the format ((deletion_start_resnum, deletion_end_resnum), deletion_length) Return type: list
-
ssbio.protein.sequence.utils.alignment.
get_insertions
(aln_df)[source]¶ Get a list of tuples indicating the first and last residues of a insertion region, as well as the length of the insertion.
- If the first tuple is:
- (-1, 1) that means the insertion is at the beginning of the original protein (X, Inf) where X is the length of the original protein, that means the insertion is at the end of the protein
Examples
# Insertion at beginning, length 3 >>> test = {‘id_a’: {0: ‘a’, 1: ‘a’, 2: ‘a’, 3: ‘a’}, ‘id_a_aa’: {0: np.nan, 1: np.nan, 2: np.nan, 3: ‘M’}, ‘id_a_pos’: {0: np.nan, 1: np.nan, 2: np.nan, 3: 1.0}, ‘id_b’: {0: ‘b’, 1: ‘b’, 2: ‘b’, 3: ‘b’}, ‘id_b_aa’: {0: ‘M’, 1: ‘M’, 2: ‘L’, 3: ‘M’}, ‘id_b_pos’: {0: 1, 1: 2, 2: 3, 3: 4}, ‘type’: {0: ‘insertion’, 1: ‘insertion’, 2: ‘insertion’, 3: ‘match’}} >>> my_alignment = pd.DataFrame.from_dict(test) >>> get_insertions(my_alignment) [((-1, 1.0), 3)]
Parameters: aln_df (DataFrame) – Alignment DataFrame Returns: A list of tuples with the format ((insertion_start_resnum, insertion_end_resnum), insertion_length) Return type: list
-
ssbio.protein.sequence.utils.alignment.
get_mutations
(aln_df)[source]¶ Get a list of residue numbers (in the original sequence’s numbering) that are mutated
Parameters: - aln_df (DataFrame) – Alignment DataFrame
- just_resnums – If only the residue numbers should be returned, instead of a list of tuples of (original_residue, resnum, mutated_residue)
Returns: Residue mutations
Return type: list
-
ssbio.protein.sequence.utils.alignment.
get_percent_identity
(a_aln_seq, b_aln_seq)[source]¶ Get the percent identity between two alignment strings
-
ssbio.protein.sequence.utils.alignment.
get_unresolved
(aln_df)[source]¶ Get a list of residue numbers (in the original sequence’s numbering) that are unresolved
Parameters: aln_df (DataFrame) – Alignment DataFrame Returns: Residue numbers that are mutated Return type: list
-
ssbio.protein.sequence.utils.alignment.
map_resnum_a_to_resnum_b
(a_resnum, a_aln, b_aln)[source]¶ Map a residue number in a sequence to the corresponding residue number in an aligned sequence.
Examples: >>> map_resnum_a_to_resnum_b(5, ‘–ABCDEF’, ‘XXABCDEF’) 7
Parameters: - a_resnum (int) – Residue number in the first aligned sequence
- a_aln (str, Seq, SeqRecord) – Aligned sequence string
- b_aln (str, Seq, SeqRecord) – Aligned sequence string
Returns: Residue number in the second aligned sequence
Return type: int
-
ssbio.protein.sequence.utils.alignment.
needle_statistics
(infile)[source]¶ Reads in a needle alignment file and spits out statistics of the alignment.
Parameters: infile (str) – Alignment file name Returns: alignment_properties - a dictionary telling you the number of gaps, identity, etc. Return type: dict
-
ssbio.protein.sequence.utils.alignment.
pairwise_sequence_alignment
(a_seq, b_seq, engine, a_seq_id=None, b_seq_id=None, gapopen=10, gapextend=0.5, outfile=None, outdir=None, force_rerun=False)[source]¶ Run a global pairwise sequence alignment between two sequence strings.
Parameters: - a_seq (str, Seq, SeqRecord, SeqProp) – Reference sequence
- b_seq (str, Seq, SeqRecord, SeqProp) – Sequence to be aligned to reference
- engine (str) – biopython or needle - which pairwise alignment program to use
- a_seq_id (str) – Reference sequence ID. If not set, is “a_seq”
- b_seq_id (str) – Sequence to be aligned ID. If not set, is “b_seq”
- gapopen (int) – Only for needle - Gap open penalty is the score taken away when a gap is created
- gapextend (float) – Only for needle - Gap extension penalty is added to the standard gap penalty for each base or residue in the gap
- outfile (str) – Only for needle - name of output file. If not set, is {id_a}_{id_b}_align.txt
- outdir (str) – Only for needle - Path to output directory. Default is the current directory.
- force_rerun (bool) – Only for needle - Default False, set to True if you want to rerun the alignment if outfile exists.
Returns: Biopython object to represent an alignment
Return type: MultipleSeqAlignment
-
ssbio.protein.sequence.utils.alignment.
run_needle_alignment
(seq_a, seq_b, gapopen=10, gapextend=0.5, outdir=None, outfile=None, force_rerun=False)[source]¶ Run the needle alignment program for two strings and return the raw alignment result.
More info: EMBOSS needle: http://www.bioinformatics.nl/cgi-bin/emboss/help/needle Biopython wrapper: http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc84 Using strings as input: https://www.biostars.org/p/91124/
Parameters: - id_a – ID of reference sequence
- seq_a (str, Seq, SeqRecord) – Reference sequence
- id_b – ID of sequence to be aligned
- seq_b (str, Seq, SeqRecord) – String representation of sequence to be aligned
- gapopen – Gap open penalty is the score taken away when a gap is created
- gapextend – Gap extension penalty is added to the standard gap penalty for each base or residue in the gap
- outdir (str, optional) – Path to output directory. Default is the current directory.
- outfile (str, optional) – Name of output file. If not set, is {id_a}_{id_b}_align.txt
- force_rerun (bool) – Default False, set to True if you want to rerun the alignment if outfile exists.
Returns: Raw alignment result of the needle alignment in srspair format.
Return type: str
-
ssbio.protein.sequence.utils.alignment.
run_needle_alignment_on_files
(id_a, faa_a, id_b, faa_b, gapopen=10, gapextend=0.5, outdir='', outfile='', force_rerun=False)[source]¶ Run the needle alignment program for two fasta files and return the raw alignment result.
More info: EMBOSS needle: http://www.bioinformatics.nl/cgi-bin/emboss/help/needle Biopython wrapper: http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc84
Parameters: - id_a – ID of reference sequence
- faa_a – File path to reference sequence
- id_b – ID of sequence to be aligned
- faa_b – File path to sequence to be aligned
- gapopen – Gap open penalty is the score taken away when a gap is created
- gapextend – Gap extension penalty is added to the standard gap penalty for each base or residue in the gap
- outdir (str, optional) – Path to output directory. Default is the current directory.
- outfile (str, optional) – Name of output file. If not set, is {id_a}_{id_b}_align.txt
- force_rerun (bool) – Default False, set to True if you want to rerun the alignment if outfile exists.
Returns: Raw alignment result of the needle alignment in srspair format.
Return type: str
Sequence BLAST¶
-
ssbio.protein.sequence.utils.blast.
calculate_bbh
(blast_results_1, blast_results_2, r_name=None, g_name=None, outdir='')[source]¶ Calculate the best bidirectional BLAST hits (BBH) and save a dataframe of results.
Parameters: - blast_results_1 (str) – BLAST results for reference vs. other genome
- blast_results_2 (str) – BLAST results for other vs. reference genome
- r_name – Name of reference genome
- g_name – Name of other genome
- outdir – Directory where BLAST results are stored.
Returns: Path to Pandas DataFrame of the BBH results.
-
ssbio.protein.sequence.utils.blast.
create_orthology_matrix
(r_name, genome_to_bbh_files, pid_cutoff=None, bitscore_cutoff=None, evalue_cutoff=None, filter_condition='OR', outname='', outdir='', force_rerun=False)[source]¶ Create an orthology matrix using best bidirectional BLAST hits (BBH) outputs.
Parameters: - r_name (str) – Name of the reference genome
- genome_to_bbh_files (dict) – Mapping of genome names to the BBH csv output from the
calculate_bbh()
method - pid_cutoff (float) – Minimum percent identity between BLAST hits to filter for in the range [0, 100]
- bitscore_cutoff (float) – Minimum bitscore allowed between BLAST hits
- evalue_cutoff (float) – Maximum E-value allowed between BLAST hits
- filter_condition (str) – ‘OR’ or ‘AND’, how to combine cutoff filters. ‘OR’ gives more results since it is less stringent, as you will be filtering for hits with (>80% PID or >30 bitscore or <0.0001 evalue).
- outname – Name of output file of orthology matrix
- outdir – Path to output directory
- force_rerun (bool) – Force recreation of the orthology matrix even if the outfile exists
Returns: Path to orthologous genes matrix.
Return type: str
-
ssbio.protein.sequence.utils.blast.
print_run_bidirectional_blast
(reference, other_genome, dbtype, outdir)[source]¶ Write torque submission files for running bidirectional blast on a server and print execution command.
Parameters: - reference (str) – Path to “reference” genome, aka your “base strain”
- other_genome (str) – Path to other genome which will be BLASTed to the reference
- dbtype (str) – “nucl” or “prot” - what format your genome files are in
- outdir (str) – Path to folder where Torque scripts should be placed
-
ssbio.protein.sequence.utils.blast.
run_bidirectional_blast
(reference, other_genome, dbtype, outdir='')[source]¶ BLAST a genome against another, and vice versa.
This function requires BLAST to be installed, do so by running: sudo apt install ncbi-blast+
Parameters: - reference (str) – path to “reference” genome, aka your “base strain”
- other_genome (str) – path to other genome which will be BLASTed to the reference
- dbtype (str) – “nucl” or “prot” - what format your genome files are in
- outdir (str) – path to folder where BLAST outputs should be placed
Returns: Paths to BLAST output files. (reference_vs_othergenome.out, othergenome_vs_reference.out)
-
ssbio.protein.sequence.utils.blast.
run_makeblastdb
(infile, dbtype, outdir='')[source]¶ Make the BLAST database for a genome file.
Parameters: - infile (str) – path to genome FASTA file
- dbtype (str) – “nucl” or “prot” - what format your genome files are in
- outdir (str) – path to directory to output database files (default is original folder)
Returns: Paths to BLAST databases.
ssbio.protein.sequence.properties
¶
Thermostability¶
- This module provides functions to predict thermostability parameters (specifically the free energy of unfolding dG)
- of an amino acid sequence.
These methods are adapted from:
- Oobatake, M., & Ooi, T. (1993). ‘Hydration and heat stability effects on protein unfolding’,
- Progress in biophysics and molecular biology, 59/3: 237–84.
- Dill, K. A., Ghosh, K., & Schmit, J. D. (2011). ‘Physical limits of cells and proteomes’,
- Proceedings of the National Academy of Sciences of the United States of America, 108/44: 17876–82. DOI: 10.1073/pnas.1114477108
For an example of usage of these parameters in a genome-scale model:
- Chen, K., Gao, Y., Mih, N., O’Brien, E., Yang, L., Palsson, B.O. (2017).
- ‘Thermo-sensitivity of growth is determined by chaperone-mediated proteome re-allocation.’, Submitted to PNAS.
-
ssbio.protein.sequence.properties.thermostability.
calculate_dill_dG
(seq_len, temp)[source]¶ Get free energy of unfolding (dG) using Dill method in units J/mol.
Parameters: - seq_len (int) – Length of amino acid sequence
- temp (float) – Temperature in degrees C
Returns: Free energy of unfolding dG (J/mol)
Return type: float
-
ssbio.protein.sequence.properties.thermostability.
calculate_oobatake_dG
(seq, temp)[source]¶ Get free energy of unfolding (dG) using Oobatake method in units cal/mol.
Parameters: - seq (str, Seq, SeqRecord) – Amino acid sequence
- temp (float) – Temperature in degrees C
Returns: Free energy of unfolding dG (J/mol)
Return type: float
-
ssbio.protein.sequence.properties.thermostability.
calculate_oobatake_dH
(seq, temp)[source]¶ Get dH using Oobatake method in units cal/mol.
Parameters: - seq (str, Seq, SeqRecord) – Amino acid sequence
- temp (float) – Temperature in degrees C
Returns: dH in units cal/mol
Return type: float
-
ssbio.protein.sequence.properties.thermostability.
calculate_oobatake_dS
(seq, temp)[source]¶ Get dS using Oobatake method in units cal/mol.
Parameters: - seq (str, Seq, SeqRecord) – Amino acid sequence
- temp (float) – Temperature in degrees C
Returns: dS in units cal/mol
Return type: float
-
ssbio.protein.sequence.properties.thermostability.
get_dG_at_T
(seq, temp)[source]¶ Predict dG at temperature T, using best predictions from Dill or Oobatake methods.
Parameters: - seq (str, Seq, SeqRecord) – Amino acid sequence
- temp (float) – Temperature in degrees C
Returns: tuple containing:
dG (float) Free energy of unfolding dG (cal/mol) keq (float): Equilibrium constant Keq method (str): Method used to calculate
Return type: (tuple)