ssbio: A Framework for Structural Systems Biology

Introduction

This Python package provides a collection of tools for people with questions in the realm of structural systems biology. The main goals of this package are to:

  1. Provide an easy way to map hundreds or thousands of genes to their encoded protein sequences and structures
  2. Directly link protein structures to genome-scale metabolic models
  3. Demonstrate fully-featured Python scientific analysis environments in Jupyter notebooks

Example questions you can (start to) answer with this package:

  • How can I determine the number of protein structures available for my list of genes?
  • What is the best, representative structure for my protein?
  • Where, in a metabolic network, do these proteins work?
  • Where do popular mutations show up on a protein?
  • How can I compare the structural features of entire proteomes?
  • How do structural properties correlate with my experimental datasets?
  • How can I improve the contents of my metabolic model with structural data?

Try it without installing

https://mybinder.org/badge.svg

Note

Binder notebooks are still in beta, but they mostly work! Third-party programs are also preinstalled in the Binder notebooks except I-TASSER and TMHMM due to licensing restrictions.

Installation

First install NGLview using pip, then install ssbio

pip install nglview
jupyter-nbextension enable nglview --py --sys-prefix
pip install ssbio

Updating

pip install ssbio --upgrade

Uninstalling

pip uninstall ssbio

Dependencies

See: Software for a list of external programs to install, along with the functionality that they add. Most of these additional programs are used to predict or calculate properties of proteins, and are only required if you desire to calculate the described properties.

Tutorials

Check out some Jupyter notebook tutorials for a single Protein and or for many in a GEM-PRO model. See a list of all Tutorials.

Citation

The manuscript for the ssbio package can be found and cited at [1].

[1]Mih N, Brunk E, Chen K, Catoiu E, Sastry A, Kavvas E, Monk JM, Zhang Z, Palsson BO. 2018. ssbio: A Python Framework for Structural Systems Biology. Bioinformatics. https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/bty077/4850940.

Table of Contents

Getting Started

Introduction

This section will give a quick outline of the design of ssbio and the scientific topics behind it. If you would like to read a pre-print version of the manuscript, please see [1].

ssbio general outline

Overview of the design and functionality of ssbio. Underlined fixed-width text in blue indicates added functionality to COBRApy for a genome-scale model loaded using ssbio. A) A simplified schematic showing the addition of a Protein to the core objects of COBRApy (fixed-width text in gray). A gene is directly associated with a protein, which can act as a monomeric enzyme or form an active complex with itself or other proteins (the asterisk denotes that methods for complexes are currently under development). B) Summary of properties and functions available for a protein sequence and structure. C) Uses of a GEM-PRO, from the bottom-up and the top-down. Once all protein sequences and structures are mapped to a genome-scale model, the resulting GEM-PRO has uses in multiple areas of study.

The basics

ssbio was developed with simplicity in mind - we wanted to make it as easy as possible to work with protein sequences and structures. Furthermore, we didn’t want to reinvent the wheel wherever possible, thus systems models are treated as a direct extension of COBRApy, and Biopython classes and modules are used wherever possible. To best explain the utility of the package, we will outline its features from 2 different viewpoints: as a systems biologist used to looking at the “big picture”; and as a structural biologist where the “devil is in the details”.

From a systems perspective

Systems biology is broadly concerned with the modeling and understanding of complex biological systems. What you may be taught in biochemistry 101 at this level will usually be reflected in a kind of interaction map, such as metabolic map shown here:

Metabolic Metro Map

A “metabolic metro map”. By Dctrzl, changed work of Chakazul [CC BY-SA 4.0], via Wikimedia Commons 1

This map details the reactions needed to sustain the metabolic function of a cell. Typically, nodes will represent enzymes, and edges the metabolites they act upon (this is reversed in some graphical representations). There can be hundreds or thousands of reactions being modeled at once, in silico. These models can be stored in a single file, such as the Systems Biology Markup Language (SBML). ssbio can load SBML models, and so far we have mainly used it in the further annotation of genome-scale metabolic models, or GEMs. The goal of GEMs is to provide a comprehensive annotation of all the metabolic enzymes encoded within a genome, along with a generating a computable model (such as at a steady state, using constraint-based modeling methods, a.k.a. COBRA). That brings us to our first class: the GEMPRO object.

The objectives of the GEM-PRO pipeline (genome-scale models integrated with protein structures) have previously been detailed [2]. A GEM-PRO directly integrates structural information within a curated GEM, and streamlines identifier mapping, representative object selection, and property calculation for a set of proteins. The pipeline provided in ssbio functions with an input of a GEM (or any other kind of network model that can be loaded with COBRApy), but if this is unfamiliar to you, do not fret! A GEM-PRO can be built simply from a list of gene/protein IDs, and can simply be treated as a way to easily analyze a large number of proteins at once.

See GEMPRO for a detailed explanation of this object, Jupyter notebook tutorials of prospective use cases, and an explanation of select functions.

From a structures perspective

Structural biology is broadly concerned with elucidating and understanding the structure and function of proteins and other macromolecules. Ribbons, molecules, and chemical interactions are the name of the game here:

Serpin protein conformational change

A protein undergoing conformational changes. By Thomas Shafee (Own work) [CC BY 4.0], via Wikimedia Commons 2

An abundance of information is stored within structural data, and we believe that it should not be ignored even when looking at thousands of proteins at once within a systems model. To that end, the Protein object aims to integrate analyses on the level of a single protein’s sequence (and related sequences) along with its available structures.

A Protein is representative of its associated gene’s translated polypeptide chain (in other words, we are only considering monomers at this point). The object holds related amino acid sequences and structures, allowing for a single representative sequence and structure to be set from these. Multiple available structures such as those from PDB or homology models can be subjected to QC/QA based on set cutoffs such as sequence coverage and X-ray resolution. Proteins with no structures available can be prepared for homology modeling through the I-TASSER platform. Biopython representations of sequences (SeqRecord objects) and structures (Structure objects) are utilized to allow access to analysis functions available for their respective objects.

See Protein for a detailed explanation of this object, Jupyter notebook tutorials of prospective use cases, and an explanation of select functions.

Modules & submodules

ssbio is organized into the following submodules for defined purposes. Please see the Python API for function documentation.

  • ssbio.databases: modules that heavily depend on the Bioservices package [3] and custom code to enable pulling information from web services such as UniProt, KEGG, and the PDB, and to directly convert that information into sequence and structure objects to load into a protein.
  • ssbio.protein.sequence: modules which allow a user to execute and parse sequence-based utilities such as sequence alignment algorithms or structural feature predictors.
  • ssbio.protein.structure: modules that mirror the sequence module but instead work with structural information to calculate properties, and also to streamline the generation of homology models as well as to prepare structures for molecular modeling tools such as docking or molecular dynamics.
  • ssbio.pipeline.gempro: a pipeline that simplifies the execution of these tools per protein while placing them into the context of a genome-scale model.

References

[1]Mih N, Brunk E, Chen K, Catoiu E, Sastry A, Kavvas E, et al. ssbio: A Python Framework for Structural Systems Biology. bioRxiv. 2017. p. 165506. doi:10.1101/165506
[2]Brunk E, Mih N, Monk J, Zhang Z, O’Brien EJ, Bliven SE, et al. Systems biology of the structural proteome. BMC Syst Biol. 2016;10: 26. doi:10.1186/s12918-016-0271-6
[3]Cokelaer, T, Pultz, D, Harder, LM, Serra-Musach, J, & Saez-Rodriguez, J. (2013). BioServices: a common Python package to access biological Web Services programmatically. Bioinformatics, 29/24: 3241–2. DOI: 10.1093/bioinformatics/btt547

The GEM-PRO Pipeline

Introduction

The GEM-PRO pipeline is focused on annotating genome-scale models with protein structure information. Any SBML model can be used as input to the pipeline, although it is not required to have a one. Here are the possible starting points for using the pipeline:

  1. An SBML model in SBML (.sbml, .xml), or MATLAB (.mat) formats
  2. A list of gene IDs (['b0001', 'b0002', ...])
  3. A dictionary of gene IDs and their sequences ({'b0001':'MSAVEVEEAP..', 'b0002':'AERAPLS', ...})

A GEM-PRO object can be thought of at a high-level as simply an annotation project. Creating a new project with any of the above starting points will create a new folder where protein sequences and structures will be downloaded to.

Tutorials

GEM-PRO - Calculating Protein Properties

This notebook gives an example of how to calculate protein properties for a list of proteins. The main features demonstrated are:

  1. Information retrieval from UniProt and linking residue numbering sites to structure
  2. Calculating or predicting global protein sequence and structure properties
  3. Calculating or predicting local protein sequence and structure properties
Input: List of gene IDs
Output: Representative protein structures and properties associated with them
Imports
In [1]:
import sys
import logging
In [2]:
# Import the GEM-PRO class
from ssbio.pipeline.gempro import GEMPRO
In [3]:
# Printing multiple outputs per cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
Logging

Set the logging level in logger.setLevel(logging.<LEVEL_HERE>) to specify how verbose you want the pipeline to be. Debug is most verbose.

  • CRITICAL
    • Only really important messages shown
  • ERROR
    • Major errors
  • WARNING
    • Warnings that don’t affect running of the pipeline
  • INFO (default)
    • Info such as the number of structures mapped per gene
  • DEBUG
    • Really detailed information that will print out a lot of stuff
Warning: DEBUG mode prints out a large amount of information, especially if you have a lot of genes. This may stall your notebook!
In [4]:
# Create logger
logger = logging.getLogger()
logger.setLevel(logging.INFO)  # SET YOUR LOGGING LEVEL HERE #
In [5]:
# Other logger stuff for Jupyter notebooks
handler = logging.StreamHandler(sys.stderr)
formatter = logging.Formatter('[%(asctime)s] [%(name)s] %(levelname)s: %(message)s', datefmt="%Y-%m-%d %H:%M")
handler.setFormatter(formatter)
logger.handlers = [handler]

Initialization

Set these three things:

  • ROOT_DIR
    • The directory where a folder named after your PROJECT will be created
  • PROJECT
    • Your project name
  • LIST_OF_GENES
    • Your list of gene IDs

A directory will be created in ROOT_DIR with your PROJECT name. The folders are organized like so:

ROOT_DIR
└── PROJECT
    ├── data  # General storage for pipeline outputs
    ├── model  # SBML and GEM-PRO models are stored here
    ├── genes  # Per gene information
    │   ├── <gene_id1>  # Specific gene directory
    │   │   └── protein
    │   │       ├── sequences  # Protein sequence files, alignments, etc.
    │   │       └── structures  # Protein structure files, calculations, etc.
    │   └── <gene_id2>
    │       └── protein
    │           ├── sequences
    │           └── structures
    ├── reactions  # Per reaction information
    │   └── <reaction_id1>  # Specific reaction directory
    │       └── complex
    │           └── structures  # Protein complex files
    └── metabolites  # Per metabolite information
        └── <metabolite_id1>  # Specific metabolite directory
            └── chemical
                └── structures  # Metabolite 2D and 3D structure files
Note: Methods for protein complexes and metabolites are still in development.
In [6]:
# SET FOLDERS AND DATA HERE
import tempfile
ROOT_DIR = tempfile.gettempdir()

PROJECT = 'ssbio_protein_properties'
LIST_OF_GENES = ['b1276', 'b0118']
In [7]:
# Create the GEM-PRO project
my_gempro = GEMPRO(gem_name=PROJECT, root_dir=ROOT_DIR, genes_list=LIST_OF_GENES, pdb_file_type='pdb')
[2018-02-05 16:52] [ssbio.pipeline.gempro] INFO: Creating GEM-PRO project directory in folder /tmp
[2018-02-05 16:52] [ssbio.pipeline.gempro] INFO: /tmp/ssbio_protein_properties: GEM-PRO project location
[2018-02-05 16:52] [ssbio.pipeline.gempro] INFO: 2: number of genes
Mapping gene ID –> sequence

First, we need to map these IDs to their protein sequences. There are 2 ID mapping services provided to do this - through KEGG or UniProt. The end goal is to map a UniProt ID to each ID, since there is a comprehensive mapping (and some useful APIs) between UniProt and the PDB.

Note: You only need to map gene IDs using one service. However you can run both if some genes don’t map in one service and do map in another!

GEMPRO.uniprot_mapping_and_metadata(model_gene_source, custom_gene_mapping=None, outdir=None, set_as_representative=False, force_rerun=False)[source]

Map all genes in the model to UniProt IDs using the UniProt mapping service. Also download all metadata and sequences.

Parameters:
  • model_gene_source (str) –

    the database source of your model gene IDs. See: http://www.uniprot.org/help/api_idmapping Common model gene sources are:

    • Ensembl Genomes - ENSEMBLGENOME_ID (i.e. E. coli b-numbers)
    • Entrez Gene (GeneID) - P_ENTREZGENEID
    • RefSeq Protein - P_REFSEQ_AC
  • custom_gene_mapping (dict) – If your model genes differ from the gene IDs you want to map, custom_gene_mapping allows you to input a dictionary which maps model gene IDs to new ones. Dictionary keys must match model genes.
  • outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
  • set_as_representative (bool) – If mapped UniProt IDs should be set as representative sequences
  • force_rerun (bool) – If you want to overwrite any existing mappings and files
In [8]:
# UniProt mapping
my_gempro.uniprot_mapping_and_metadata(model_gene_source='ENSEMBLGENOME_ID')
print('Missing UniProt mapping: ', my_gempro.missing_uniprot_mapping)
my_gempro.df_uniprot_metadata.head()
[2018-02-05 16:52] [root] INFO: getUserAgent: Begin
[2018-02-05 16:52] [root] INFO: getUserAgent: user_agent: EBI-Sample-Client/ (services.py; Python 3.6.3; Linux) Python-requests/2.18.4
[2018-02-05 16:52] [root] INFO: getUserAgent: End
A Jupyter Widget

[2018-02-05 16:52] [ssbio.pipeline.gempro] INFO: 2/2: number of genes mapped to UniProt
[2018-02-05 16:52] [ssbio.pipeline.gempro] INFO: Completed ID mapping --> UniProt. See the "df_uniprot_metadata" attribute for a summary dataframe.
Missing UniProt mapping:  []
Out[8]:
uniprot reviewed gene_name kegg refseq pdbs pfam description entry_date entry_version seq_date seq_version sequence_file metadata_file
gene
b0118 P36683 False acnB ecj:JW0114;eco:b0118 NP_414660.1;WP_001307570.1 1L5J PF00330;PF06434;PF11791 Aconitate hydratase B 2018-01-31 165 1997-11-01 3 P36683.fasta P36683.xml
b1276 P25516 False acnA ecj:JW1268;eco:b1276 NP_415792.1;WP_000099535.1 NaN PF00330;PF00694 Aconitate hydratase A 2018-01-31 153 2008-01-15 3 P25516.fasta P25516.xml
GEMPRO.set_representative_sequence(force_rerun=False)[source]

Automatically consolidate loaded sequences (manual, UniProt, or KEGG) and set a single representative sequence.

Manually set representative sequences override all existing mappings. UniProt mappings override KEGG mappings except when KEGG mappings have PDBs associated with them and UniProt doesn’t.

Parameters:force_rerun (bool) – Set to True to recheck stored sequences
In [9]:
# Set representative sequences
my_gempro.set_representative_sequence()
print('Missing a representative sequence: ', my_gempro.missing_representative_sequence)
my_gempro.df_representative_sequences.head()
A Jupyter Widget

[2018-02-05 16:52] [ssbio.pipeline.gempro] INFO: 2/2: number of genes with a representative sequence
[2018-02-05 16:52] [ssbio.pipeline.gempro] INFO: See the "df_representative_sequences" attribute for a summary dataframe.
Missing a representative sequence:  []
Out[9]:
uniprot kegg pdbs sequence_file metadata_file
gene
b0118 P36683 ecj:JW0114;eco:b0118 1L5J P36683.fasta P36683.xml
b1276 P25516 ecj:JW1268;eco:b1276 NaN P25516.fasta P25516.xml
Mapping representative sequence –> structure

These are the ways to map sequence to structure:

  1. Use the UniProt ID and their automatic mappings to the PDB
  2. BLAST the sequence to the PDB
  3. Make homology models or
  4. Map to existing homology models

You can only utilize option #1 to map to PDBs if there is a mapped UniProt ID set in the representative sequence. If not, you’ll have to BLAST your sequence to the PDB or make a homology model. You can also run both for maximum coverage.

GEMPRO.map_uniprot_to_pdb(seq_ident_cutoff=0.0, outdir=None, force_rerun=False)[source]

Map all representative sequences’ UniProt ID to PDB IDs using the PDBe “Best Structures” API. Will save a JSON file of the results to each protein’s sequences folder.

The “Best structures” API is available at https://www.ebi.ac.uk/pdbe/api/doc/sifts.html The list of PDB structures mapping to a UniProt accession sorted by coverage of the protein and, if the same, resolution.

Parameters:
  • seq_ident_cutoff (float) – Sequence identity cutoff in decimal form
  • outdir (str) – Output directory to cache JSON results of search
  • force_rerun (bool) – Force re-downloading of JSON results if they already exist
Returns:

A rank-ordered list of PDBProp objects that map to the UniProt ID

Return type:

list

In [10]:
# Mapping using the PDBe best_structures service
my_gempro.map_uniprot_to_pdb(seq_ident_cutoff=.3)
my_gempro.df_pdb_ranking.head()
[2018-02-05 16:52] [ssbio.pipeline.gempro] INFO: Mapping UniProt IDs --> PDB IDs...
[2018-02-05 16:52] [root] INFO: getUserAgent: Begin
[2018-02-05 16:52] [root] INFO: getUserAgent: user_agent: EBI-Sample-Client/ (services.py; Python 3.6.3; Linux) Python-requests/2.18.4
[2018-02-05 16:52] [root] INFO: getUserAgent: End
A Jupyter Widget

[2018-02-05 16:52] [ssbio.pipeline.gempro] INFO: 1/2: number of genes with at least one experimental structure
[2018-02-05 16:52] [ssbio.pipeline.gempro] INFO: Completed UniProt --> best PDB mapping. See the "df_pdb_ranking" attribute for a summary dataframe.
Out[10]:
pdb_id pdb_chain_id uniprot experimental_method resolution coverage start end unp_start unp_end rank
gene
b0118 1l5j A P36683 X-ray diffraction 2.4 1 1 865 1 865 1
b0118 1l5j B P36683 X-ray diffraction 2.4 1 1 865 1 865 2
GEMPRO.blast_seqs_to_pdb(seq_ident_cutoff=0, evalue=0.0001, all_genes=False, display_link=False, outdir=None, force_rerun=False)[source]

BLAST each representative protein sequence to the PDB. Saves raw BLAST results (XML files).

Parameters:
  • seq_ident_cutoff (float, optional) – Cutoff results based on percent coverage (in decimal form)
  • evalue (float, optional) – Cutoff for the E-value - filters for significant hits. 0.001 is liberal, 0.0001 is stringent (default).
  • all_genes (bool) – If all genes should be BLASTed, or only those without any structures currently mapped
  • display_link (bool, optional) – Set to True if links to the HTML results should be displayed
  • outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
  • force_rerun (bool, optional) – If existing BLAST results should not be used, set to True. Default is False
In [11]:
# Mapping using BLAST
my_gempro.blast_seqs_to_pdb(all_genes=True, seq_ident_cutoff=.7, evalue=0.00001)
my_gempro.df_pdb_blast.head(2)
A Jupyter Widget

[2018-02-05 16:53] [ssbio.pipeline.gempro] INFO: Completed sequence --> PDB BLAST. See the "df_pdb_blast" attribute for a summary dataframe.
[2018-02-05 16:53] [ssbio.pipeline.gempro] INFO: 0: number of genes with additional structures added from BLAST
[2018-02-05 16:53] [ssbio.pipeline.gempro] WARNING: Empty dataframe
Out[11]:

Below, we are mapping to previously generated homology models for E. coli. If you are running this as a tutorial, they won’t exist on your computer, so you can skip these steps.

GEMPRO.get_manual_homology_models(input_dict, outdir=None, clean=True, force_rerun=False)[source]

Copy homology models to the GEM-PRO project.

Requires an input of a dictionary formatted like so:

{
    model_gene: {
                    homology_model_id1: {
                                            'model_file': '/path/to/homology/model.pdb',
                                            'file_type': 'pdb'
                                            'additional_info': info_value
                                        },
                    homology_model_id2: {
                                            'model_file': '/path/to/homology/model.pdb'
                                            'file_type': 'pdb'
                                        }
                }
}
Parameters:
  • input_dict (dict) – Dictionary of dictionaries of gene names to homology model IDs and other information
  • outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
  • clean (bool) – If homology files should be cleaned and saved as a new PDB file
  • force_rerun (bool) – If homology files should be copied again even if they exist in the GEM-PRO directory
In [15]:
import pandas as pd
import os.path as op
In [16]:
# Creating manual mapping dictionary for ECOLI I-TASSER models
homology_models = '/home/nathan/projects_archive/homology_models/ECOLI/zhang/'
homology_models_df = pd.read_csv('/home/nathan/projects_archive/homology_models/ECOLI/zhang_data/160804-ZHANG_INFO.csv')
tmp = homology_models_df[['zhang_id','model_file','m_gene']].drop_duplicates()
tmp = tmp[pd.notnull(tmp.m_gene)]

homology_model_dict = {}

for i,r in tmp.iterrows():
    homology_model_dict[r['m_gene']] = {r['zhang_id']: {'model_file':op.join(homology_models, r['model_file']),
                                                        'file_type':'pdb'}}

my_gempro.get_manual_homology_models(homology_model_dict)
A Jupyter Widget

[2018-02-05 16:56] [ssbio.pipeline.gempro] INFO: Updated homology model information for 2 genes.
In [17]:
# Creating manual mapping dictionary for ECOLI SUNPRO models
homology_models = '/home/nathan/projects_archive/homology_models/ECOLI/sunpro/'
homology_models_df = pd.read_csv('/home/nathan/projects_archive/homology_models/ECOLI/sunpro_data/160609-SUNPRO_INFO.csv')
tmp = homology_models_df[['sunpro_id','model_file','m_gene']].drop_duplicates()
tmp = tmp[pd.notnull(tmp.m_gene)]

homology_model_dict = {}

for i,r in tmp.iterrows():
    homology_model_dict[r['m_gene']] = {r['sunpro_id']: {'model_file':op.join(homology_models, r['model_file']),
                                                         'file_type':'pdb'}}

my_gempro.get_manual_homology_models(homology_model_dict)
A Jupyter Widget

[2018-02-05 16:56] [ssbio.pipeline.gempro] INFO: Updated homology model information for 2 genes.
Downloading and ranking structures
GEMPRO.pdb_downloader_and_metadata(outdir=None, pdb_file_type=None, force_rerun=False)[source]

Download ALL mapped experimental structures to each protein’s structures directory.

Parameters:
  • outdir (str) – Path to output directory, if GEM-PRO directories were not set or other output directory is desired
  • pdb_file_type (str) – Type of PDB file to download, if not already set or other format is desired
  • force_rerun (bool) – If files should be re-downloaded if they already exist
Warning: Downloading all PDBs takes a while, since they are also parsed for metadata. You can skip this step and just set representative structures below if you want to minimize the number of PDBs downloaded.
In [18]:
# Download all mapped PDBs and gather the metadata
my_gempro.pdb_downloader_and_metadata()
my_gempro.df_pdb_metadata.head(2)
A Jupyter Widget

[2018-02-05 16:56] [ssbio.pipeline.gempro] INFO: Updated PDB metadata dataframe. See the "df_pdb_metadata" attribute for a summary dataframe.
[2018-02-05 16:56] [ssbio.pipeline.gempro] INFO: Saved 1 structures total
Out[18]:
pdb_id pdb_title description experimental_method mapped_chains resolution chemicals taxonomy_name structure_file
gene
b0118 1l5j CRYSTAL STRUCTURE OF E. COLI ACONITASE B. Aconitate hydratase 2 (E.C.4.2.1.3) X-ray diffraction A;B 2.4 F3S;TRA Escherichia coli 1l5j.pdb
GEMPRO.set_representative_structure(seq_outdir=None, struct_outdir=None, pdb_file_type=None, engine=’needle’, always_use_homology=False, rez_cutoff=0.0, seq_ident_cutoff=0.5, allow_missing_on_termini=0.2, allow_mutants=True, allow_deletions=False, allow_insertions=False, allow_unresolved=True, skip_large_structures=False, clean=True, force_rerun=False)[source]

Set all representative structure for proteins from a structure in the structures attribute.

Each gene can have a combination of the following, which will be analyzed to set a representative structure.

  • Homology model(s)
  • Ranked PDBs
  • BLASTed PDBs

If the always_use_homology flag is true, homology models are always set as representative when they exist. If there are multiple homology models, we rank by the percent sequence coverage.

Parameters:
  • seq_outdir (str) – Path to output directory of sequence alignment files, must be set if GEM-PRO directories were not created initially
  • struct_outdir (str) – Path to output directory of structure files, must be set if GEM-PRO directories were not created initially
  • pdb_file_type (str) – pdb, mmCif, xml, mmtf - file type for files downloaded from the PDB
  • engine (str) – biopython or needle - which pairwise alignment program to use. needle is the standard EMBOSS tool to run pairwise alignments. biopython is Biopython’s implementation of needle. Results can differ!
  • always_use_homology (bool) – If homology models should always be set as the representative structure
  • rez_cutoff (float) – Resolution cutoff, in Angstroms (only if experimental structure)
  • seq_ident_cutoff (float) – Percent sequence identity cutoff, in decimal form
  • allow_missing_on_termini (float) – Percentage of the total length of the reference sequence which will be ignored when checking for modifications. Example: if 0.1, and reference sequence is 100 AA, then only residues 5 to 95 will be checked for modifications.
  • allow_mutants (bool) – If mutations should be allowed or checked for
  • allow_deletions (bool) – If deletions should be allowed or checked for
  • allow_insertions (bool) – If insertions should be allowed or checked for
  • allow_unresolved (bool) – If unresolved residues should be allowed or checked for
  • skip_large_structures (bool) – Default False – currently, large structures can’t be saved as a PDB file even if you just want to save a single chain, so Biopython will throw an error when trying to do so. As an alternative, if a large structure is selected as representative, the pipeline will currently point to it and not clean it. If you don’t want this to happen, set this to true.
  • clean (bool) – If structures should be cleaned
  • force_rerun (bool) – If sequence to structure alignment should be rerun

Todo

  • Remedy large structure representative setting
In [19]:
# Set representative structures
my_gempro.set_representative_structure()
my_gempro.df_representative_structures.head()
A Jupyter Widget

[2018-02-05 16:56] [ssbio.pipeline.gempro] INFO: 2/2: number of genes with a representative structure
[2018-02-05 16:56] [ssbio.pipeline.gempro] INFO: See the "df_representative_structures" attribute for a summary dataframe.
Out[19]:
id is_experimental file_type structure_file
gene
b0118 REP-1l5j True pdb 1l5j-A_clean.pdb
b1276 REP-ACON1_ECOLI False pdb ACON1_ECOLI_model1_clean-X_clean.pdb

Computing and storing protein properties
GEMPRO.get_sequence_properties(representatives_only=True)[source]

Run Biopython ProteinAnalysis and EMBOSS pepstats to summarize basic statistics of all protein sequences. Results are stored in the protein’s respective SeqProp objects at .annotations

Parameters:representative_only (bool) – If analysis should only be run on the representative sequences
In [20]:
# Requires EMBOSS "pepstats" program
# See the ssbio wiki for more information: https://github.com/SBRG/ssbio/wiki/Software-Installations
# Install using:
# sudo apt-get install emboss
my_gempro.get_sequence_properties()
A Jupyter Widget

GEMPRO.get_scratch_predictions(path_to_scratch, results_dir, scratch_basename=’scratch’, num_cores=1, exposed_buried_cutoff=25, custom_gene_mapping=None)[source]

Run and parse SCRATCH results to predict secondary structure and solvent accessibility. Annotations are stored in the protein’s representative sequence at:

  • .annotations
  • .letter_annotations
Parameters:
  • path_to_scratch (str) – Path to SCRATCH executable
  • results_dir (str) – Path to SCRATCH results folder, which will have the files (scratch.ss, scratch.ss8, scratch.acc, scratch.acc20)
  • scratch_basename (str) – Basename of the SCRATCH results (‘scratch’ is default)
  • num_cores (int) – Number of cores to use to parallelize SCRATCH run
  • exposed_buried_cutoff (int) – Cutoff of exposed/buried for the acc20 predictions
  • custom_gene_mapping (dict) – Default parsing of SCRATCH output files is to look for the model gene IDs. If your output files contain IDs which differ from the model gene IDs, use this dictionary to map model gene IDs to result file IDs. Dictionary keys must match model genes.
In [ ]:
# Requires SCRATCH installation, replace path_to_scratch with own path to script
# See the ssbio wiki for more information: https://github.com/SBRG/ssbio/wiki/Software-Installations
my_gempro.get_scratch_predictions(path_to_scratch='scratch',
                                  results_dir=my_gempro.data_dir,
                                  num_cores=4)
GEMPRO.find_disulfide_bridges(representatives_only=True)[source]

Run Biopython’s disulfide bridge finder and store found bridges.

Annotations are stored in the protein structure’s chain sequence at: <chain_prop>.seq_record.annotations['SSBOND-biopython']

Parameters:representative_only (bool) – If analysis should only be run on the representative structure
In [22]:
my_gempro.find_disulfide_bridges(representatives_only=False)
A Jupyter Widget

GEMPRO.get_dssp_annotations(representatives_only=True, force_rerun=False)[source]

Run DSSP on structures and store calculations.

Annotations are stored in the protein structure’s chain sequence at: <chain_prop>.seq_record.letter_annotations['*-dssp']

Parameters:
  • representative_only (bool) – If analysis should only be run on the representative structure
  • force_rerun (bool) – If calculations should be rerun even if an output file exists
In [23]:
# Requires DSSP installation
# See the ssbio wiki for more information: https://github.com/SBRG/ssbio/wiki/Software-Installations
my_gempro.get_dssp_annotations()
A Jupyter Widget

GEMPRO.get_msms_annotations(representatives_only=True, force_rerun=False)[source]

Run MSMS on structures and store calculations.

Annotations are stored in the protein structure’s chain sequence at: <chain_prop>.seq_record.letter_annotations['*-msms']

Parameters:
  • representative_only (bool) – If analysis should only be run on the representative structure
  • force_rerun (bool) – If calculations should be rerun even if an output file exists
In [24]:
# Requires MSMS installation
# See the ssbio wiki for more information: https://github.com/SBRG/ssbio/wiki/Software-Installations
my_gempro.get_msms_annotations()
A Jupyter Widget


Additional annotations
Loading feature files to the representative sequence

“Features” are currently loaded directly from UniProt, but if another feature file is available for each protein, it can be loaded manually.

In [25]:
# for g in my_gempro.genes_with_a_representative_sequence:
#     g.protein.representative_sequence.feature_path = '/path/to/new/feature/file.gff'
Adding more properties

Additional global or local properties can be loaded after loading the saved GEM-PRO.

Make sure to add ``’seq_hydrophobicity-kd’`` to the list of columns to be returned later on!

Example with hydrophobicity
In [26]:
# Kyte-Doolittle scale for hydrophobicity
kd = { 'A': 1.8,'R':-4.5,'N':-3.5,'D':-3.5,'C': 2.5,
       'Q':-3.5,'E':-3.5,'G':-0.4,'H':-3.2,'I': 4.5,
       'L': 3.8,'K':-3.9,'M': 1.9,'F': 2.8,'P':-1.6,
       'S':-0.8,'T':-0.7,'W':-0.9,'Y':-1.3,'V': 4.2 }
In [27]:
# Use Biopython to calculated hydrophobicity using a set sliding window length
from Bio.SeqUtils.ProtParam import ProteinAnalysis

window = 7

for g in my_gempro.genes_with_a_representative_sequence:
    # Create a ProteinAnalysis object -- see http://biopython.org/wiki/ProtParam
    my_seq = g.protein.representative_sequence.seq_str
    analysed_seq = ProteinAnalysis(my_seq)

    # Calculate scale
    hydrophobicity = analysed_seq.protein_scale(param_dict=kd, window=window)

    # Correct list length by prepending and appending "inf" (result needs to be same length as sequence)
    for i in range(window//2):
        hydrophobicity.insert(0, float("Inf"))
        hydrophobicity.append(float("Inf"))

    # Add new annotation to the representative sequence's "letter_annotations" dictionary
    g.protein.representative_sequence.letter_annotations['hydrophobicity-kd'] = hydrophobicity

Global protein properties

Properties of the entire protein sequence/structure are stored in:

  1. The representative_sequence annotations field
  2. The representative_structure’s representative chain SeqRecord

These properties describe aspects of the entire protein, such as its molecular weight, the percentage of amino acids in a particular secondary structure, etc.

In [28]:
# Printing all global protein properties

from pprint import pprint

# Only looking at 2 genes for now, remove [:2] to gather properties for all
for g in my_gempro.genes_with_a_representative_sequence[:2]:
    repseq = g.protein.representative_sequence
    repstruct = g.protein.representative_structure
    repchain = g.protein.representative_chain

    print('Gene: {}'.format(g.id))
    print('Number of structures: {}'.format(g.protein.num_structures))
    print('Representative sequence: {}'.format(repseq.id))
    print('Representative structure: {}'.format(repstruct.id))

    print('----------------------------------------------------------------')
    print('Global properties of the representative sequence:')
    pprint(repseq.annotations)

    print('----------------------------------------------------------------')
    print('Global properties of the representative structure:')
    pprint(repstruct.chains.get_by_id(repchain).seq_record.annotations)

    print('****************************************************************')
    print('****************************************************************')
    print('****************************************************************')

Gene: b1276
Number of structures: 3
Representative sequence: P25516
Representative structure: REP-ACON1_ECOLI
----------------------------------------------------------------
Global properties of the representative sequence:
{'amino_acids_percent-biop': {'A': 0.08641975308641975,
                              'C': 0.007856341189674524,
                              'D': 0.06397306397306397,
                              'E': 0.06172839506172839,
                              'F': 0.025813692480359147,
                              'G': 0.08754208754208755,
                              'H': 0.020202020202020204,
                              'I': 0.04826038159371493,
                              'K': 0.04826038159371493,
                              'L': 0.09427609427609428,
                              'M': 0.028058361391694726,
                              'N': 0.037037037037037035,
                              'P': 0.05611672278338945,
                              'Q': 0.030303030303030304,
                              'R': 0.05723905723905724,
                              'S': 0.05723905723905724,
                              'T': 0.06060606060606061,
                              'V': 0.0819304152637486,
                              'W': 0.014590347923681257,
                              'Y': 0.03254769921436588},
 'aromaticity-biop': 0.07295173961840629,
 'instability_index-biop': 36.28239057239071,
 'isoelectric_point-biop': 5.59344482421875,
 'molecular_weight-biop': 97676.06830000057,
 'monoisotopic-biop': False,
 'percent_acidic-pepstats': 0.1257,
 'percent_aliphatic-pepstats': 0.31089,
 'percent_aromatic-pepstats': 0.09315,
 'percent_basic-pepstats': 0.1257,
 'percent_charged-pepstats': 0.2514,
 'percent_helix_naive-biop': 0.29741863075196406,
 'percent_non-polar-pepstats': 0.56341,
 'percent_polar-pepstats': 0.43659,
 'percent_small-pepstats': 0.53872,
 'percent_strand_naive-biop': 0.27048260381593714,
 'percent_tiny-pepstats': 0.29966000000000004,
 'percent_turn_naive-biop': 0.2379349046015713}
----------------------------------------------------------------
Global properties of the representative structure:
{'percent_B-dssp': 0.010101010101010102,
 'percent_C-dssp': 0.2222222222222222,
 'percent_E-dssp': 0.1739618406285073,
 'percent_G-dssp': 0.03928170594837262,
 'percent_H-dssp': 0.345679012345679,
 'percent_I-dssp': 0.005611672278338945,
 'percent_S-dssp': 0.09427609427609428,
 'percent_T-dssp': 0.10886644219977554}
****************************************************************
****************************************************************
****************************************************************
Gene: b0118
Number of structures: 4
Representative sequence: P36683
Representative structure: REP-1l5j
----------------------------------------------------------------
Global properties of the representative sequence:
{'amino_acids_percent-biop': {'A': 0.11213872832369942,
                              'C': 0.011560693641618497,
                              'D': 0.06358381502890173,
                              'E': 0.06589595375722543,
                              'F': 0.03352601156069364,
                              'G': 0.08786127167630058,
                              'H': 0.017341040462427744,
                              'I': 0.05433526011560694,
                              'K': 0.056647398843930635,
                              'L': 0.10173410404624278,
                              'M': 0.026589595375722544,
                              'N': 0.035838150289017344,
                              'P': 0.06242774566473988,
                              'Q': 0.028901734104046242,
                              'R': 0.04508670520231214,
                              'S': 0.04161849710982659,
                              'T': 0.05433526011560694,
                              'V': 0.06705202312138728,
                              'W': 0.009248554913294798,
                              'Y': 0.024277456647398842},
 'aromaticity-biop': 0.06705202312138728,
 'instability_index-biop': 32.79631213872841,
 'isoelectric_point-biop': 5.23931884765625,
 'molecular_weight-biop': 93497.01500000065,
 'monoisotopic-biop': False,
 'percent_acidic-pepstats': 0.12948,
 'percent_aliphatic-pepstats': 0.33526000000000006,
 'percent_aromatic-pepstats': 0.08439,
 'percent_basic-pepstats': 0.11907999999999999,
 'percent_charged-pepstats': 0.24855,
 'percent_helix_naive-biop': 0.29017341040462424,
 'percent_non-polar-pepstats': 0.59075,
 'percent_polar-pepstats': 0.40924999999999995,
 'percent_small-pepstats': 0.53642,
 'percent_strand_naive-biop': 0.3063583815028902,
 'percent_tiny-pepstats': 0.30751,
 'percent_turn_naive-biop': 0.22774566473988442}
----------------------------------------------------------------
Global properties of the representative structure:
{'percent_B-dssp': 0.016241299303944315,
 'percent_C-dssp': 0.20765661252900233,
 'percent_E-dssp': 0.14037122969837587,
 'percent_G-dssp': 0.03480278422273782,
 'percent_H-dssp': 0.3805104408352668,
 'percent_I-dssp': 0.0,
 'percent_S-dssp': 0.08236658932714618,
 'percent_T-dssp': 0.13805104408352667}
****************************************************************
****************************************************************
****************************************************************

Local protein properties

Properties of specific residues are stored in:

  1. The representative_sequence’s letter_annotations attribute
  2. The representative_structure’s representative chain SeqRecord

Specific sites, like metal or metabolite binding sites, can be found in the representative_sequence’s features attribute. This information is retrieved from UniProt. The below examples extract features for the metal binding sites.

The properties related to those sites can be retrieved using the function get_residue_annotations.

UniProt contains more information than just “sites”

In [29]:
# Looking at all features
for g in my_gempro.genes_with_a_representative_sequence[:2]:
    g.id

    # UniProt features
    [x for x in g.protein.representative_sequence.features]

    # Catalytic site atlas features
    for s in g.protein.structures:
        if s.structure_file:
            for c in s.mapped_chains:
                if s.chains.get_by_id(c).seq_record:
                    if s.chains.get_by_id(c).seq_record.features:
                        [x for x in s.chains.get_by_id(c).seq_record.features]
Out[29]:
'b1276'
Out[29]:
[SeqFeature(FeatureLocation(ExactPosition(0), ExactPosition(1)), type='initiator methionine'),
 SeqFeature(FeatureLocation(ExactPosition(1), ExactPosition(891)), type='chain', id='PRO_0000076661'),
 SeqFeature(FeatureLocation(ExactPosition(434), ExactPosition(435)), type='metal ion-binding site'),
 SeqFeature(FeatureLocation(ExactPosition(500), ExactPosition(501)), type='metal ion-binding site'),
 SeqFeature(FeatureLocation(ExactPosition(503), ExactPosition(504)), type='metal ion-binding site'),
 SeqFeature(FeatureLocation(ExactPosition(521), ExactPosition(522)), type='sequence conflict')]
Out[29]:
'b0118'
Out[29]:
[SeqFeature(FeatureLocation(ExactPosition(0), ExactPosition(865)), type='chain', id='PRO_0000076675'),
 SeqFeature(FeatureLocation(ExactPosition(243), ExactPosition(246)), type='region of interest'),
 SeqFeature(FeatureLocation(ExactPosition(413), ExactPosition(416)), type='region of interest'),
 SeqFeature(FeatureLocation(ExactPosition(709), ExactPosition(710)), type='metal ion-binding site'),
 SeqFeature(FeatureLocation(ExactPosition(768), ExactPosition(769)), type='metal ion-binding site'),
 SeqFeature(FeatureLocation(ExactPosition(771), ExactPosition(772)), type='metal ion-binding site'),
 SeqFeature(FeatureLocation(ExactPosition(190), ExactPosition(191)), type='binding site'),
 SeqFeature(FeatureLocation(ExactPosition(497), ExactPosition(498)), type='binding site'),
 SeqFeature(FeatureLocation(ExactPosition(790), ExactPosition(791)), type='binding site'),
 SeqFeature(FeatureLocation(ExactPosition(795), ExactPosition(796)), type='binding site'),
 SeqFeature(FeatureLocation(ExactPosition(768), ExactPosition(769)), type='mutagenesis site'),
 SeqFeature(FeatureLocation(ExactPosition(1), ExactPosition(14)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(23), ExactPosition(35)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(41), ExactPosition(51)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(58), ExactPosition(72)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(82), ExactPosition(90)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(98), ExactPosition(104)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(104), ExactPosition(107)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(108), ExactPosition(111)), type='turn'),
 SeqFeature(FeatureLocation(ExactPosition(111), ExactPosition(120)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(127), ExactPosition(137)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(140), ExactPosition(151)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(153), ExactPosition(157)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(165), ExactPosition(178)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(178), ExactPosition(182)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(184), ExactPosition(190)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(193), ExactPosition(197)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(197), ExactPosition(200)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(213), ExactPosition(216)), type='turn'),
 SeqFeature(FeatureLocation(ExactPosition(219), ExactPosition(227)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(232), ExactPosition(243)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(247), ExactPosition(257)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(257), ExactPosition(261)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(263), ExactPosition(269)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(271), ExactPosition(278)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(279), ExactPosition(288)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(291), ExactPosition(295)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(305), ExactPosition(310)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(310), ExactPosition(314)), type='turn'),
 SeqFeature(FeatureLocation(ExactPosition(314), ExactPosition(318)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(318), ExactPosition(321)), type='turn'),
 SeqFeature(FeatureLocation(ExactPosition(323), ExactPosition(327)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(333), ExactPosition(341)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(343), ExactPosition(360)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(383), ExactPosition(391)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(391), ExactPosition(394)), type='turn'),
 SeqFeature(FeatureLocation(ExactPosition(408), ExactPosition(413)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(414), ExactPosition(417)), type='turn'),
 SeqFeature(FeatureLocation(ExactPosition(417), ExactPosition(427)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(437), ExactPosition(440)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(443), ExactPosition(448)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(450), ExactPosition(465)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(465), ExactPosition(468)), type='turn'),
 SeqFeature(FeatureLocation(ExactPosition(478), ExactPosition(483)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(483), ExactPosition(486)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(491), ExactPosition(497)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(502), ExactPosition(507)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(511), ExactPosition(521)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(530), ExactPosition(538)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(545), ExactPosition(559)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(572), ExactPosition(576)), type='turn'),
 SeqFeature(FeatureLocation(ExactPosition(576), ExactPosition(583)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(588), ExactPosition(597)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(597), ExactPosition(600)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(600), ExactPosition(603)), type='turn'),
 SeqFeature(FeatureLocation(ExactPosition(604), ExactPosition(609)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(612), ExactPosition(632)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(637), ExactPosition(653)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(666), ExactPosition(673)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(673), ExactPosition(676)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(680), ExactPosition(683)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(690), ExactPosition(693)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(693), ExactPosition(696)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(696), ExactPosition(699)), type='turn'),
 SeqFeature(FeatureLocation(ExactPosition(703), ExactPosition(707)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(713), ExactPosition(726)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(731), ExactPosition(737)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(741), ExactPosition(750)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(752), ExactPosition(760)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(769), ExactPosition(772)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(774), ExactPosition(777)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(783), ExactPosition(790)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(795), ExactPosition(798)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(801), ExactPosition(805)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(807), ExactPosition(817)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(822), ExactPosition(834)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(836), ExactPosition(840)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(845), ExactPosition(848)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(849), ExactPosition(857)), type='helix')]
Protein.get_residue_annotations(seq_resnum, seqprop=None, structprop=None, chain_id=None, use_representatives=False)[source]

Get all residue-level annotations stored in the SeqProp letter_annotations field for a given residue number.

Uses the representative sequence, structure, and chain ID stored by default. If other properties from other structures are desired, input the proper IDs. An alignment for the given sequence to the structure must be present in the sequence_alignments list.

Parameters:
  • seq_resnum (int) – Residue number in the sequence
  • seqprop (SeqProp) – SeqProp object
  • structprop (StructProp) – StructProp object
  • chain_id (str) – ID of the structure’s chain to get annotation from
  • use_representatives (bool) – If the representative sequence/structure/chain IDs should be used
Returns:

All available letter_annotations for this residue number

Return type:

dict

In [30]:
metal_info = []

for g in my_gempro.genes:
    for f in g.protein.representative_sequence.features:
        if 'metal' in f.type.lower():
            res_info = g.protein.get_residue_annotations(f.location.end, use_representatives=True)
            res_info['gene_id'] = g.id
            res_info['seq_id'] = g.protein.representative_sequence.id
            res_info['struct_id'] = g.protein.representative_structure.id
            res_info['chain_id'] = g.protein.representative_chain
            metal_info.append(res_info)

cols = ['gene_id', 'seq_id', 'struct_id', 'chain_id',
        'seq_residue', 'seq_resnum', 'struct_residue','struct_resnum',
        'seq_SS-sspro','seq_SS-sspro8','seq_RSA-accpro','seq_RSA-accpro20',
        'struct_SS-dssp','struct_RSA-dssp', 'struct_ASA-dssp',
        'struct_PHI-dssp', 'struct_PSI-dssp', 'struct_CA_DEPTH-msms', 'struct_RES_DEPTH-msms']

pd.DataFrame.from_records(metal_info, columns=cols).set_index(['gene_id', 'seq_id', 'struct_id', 'chain_id', 'seq_resnum'])
Out[30]:
seq_residue struct_residue struct_resnum seq_SS-sspro seq_SS-sspro8 seq_RSA-accpro seq_RSA-accpro20 struct_SS-dssp struct_RSA-dssp struct_ASA-dssp struct_PHI-dssp struct_PSI-dssp struct_CA_DEPTH-msms struct_RES_DEPTH-msms
gene_id seq_id struct_id chain_id seq_resnum
b1276 P25516 REP-ACON1_ECOLI X 435 C C 435 NaN NaN NaN NaN H 0.059259 8.0 -61.1 -26.6 2.656722 2.813536
501 C C 501 NaN NaN NaN NaN S 0.088889 12.0 -61.0 -50.0 1.999713 2.409119
504 C C 504 NaN NaN NaN NaN G 0.259259 35.0 -56.0 -45.6 1.999634 1.961484
b0118 P36683 REP-1l5j A 710 C C 710 NaN NaN NaN NaN T 0.118519 16.0 -67.1 -7.2 10.148960 10.009109
769 C C 769 NaN NaN NaN NaN - 0.088889 12.0 -67.8 -28.3 8.296585 8.049832
772 C C 772 NaN NaN NaN NaN G 0.081481 11.0 -50.2 -38.0 8.282292 8.239369
Column definitions
  • gene_id: Gene ID used in GEM-PRO project
  • seq_id: Representative protein sequence ID
  • struct_id: Representative protein structure ID, with REP- prepended to it. 4 letter structure IDs are experimental structures from the PDB, others are homology models
  • chain_id: Representative chain ID in the representative structure
  • seq_resnum: Residue number of the amino acid in the representative sequence
  • site_name: Name of the feature as defined in UniProt
  • seq_residue: Amino acid in the representative sequence at the residue number
  • struct_residue: Amino acid in the representative structure at the residue number
  • struct_resnum: Residue number of the amino acid in the representative structure
  • seq_SS-sspro: Predicted secondary structure, 3 definitions (from the SCRATCH program)
  • seq_SS-sspro8: Predicted secondary structure, 8 definitions (SCRATCH)
  • seq_RSA-accpro: Predicted exposed (e) or buried (-) residue (SCRATCH)
  • seq_RSA-accpro20: Predicted exposed/buried, 0 to 100 scale (SCRATCH)
  • struct_SS-dssp: Secondary structure (DSSP program)
  • struct_RSA-dssp: Relative solvent accessibility (DSSP)
  • struct_ASA-dssp: Solvent accessibility, absolute value (DSSP)
  • struct_PHI-dssp: Phi angle measure (DSSP)
  • struct_PSI-dssp: Psi angle measure (DSSP)
  • struct_RES_DEPTH-msms: Calculated residue depth averaged for all atoms in the residue (MSMS program)
  • struct_CA_DEPTH-msms: Calculated residue depth for the carbon alpha atom (MSMS)
Visualizing residues
StructProp.view_structure(only_chains=None, opacity=1.0, recolor=False, gui=False)[source]

Use NGLviewer to display a structure in a Jupyter notebook

Parameters:
  • only_chains (str, list) – Chain ID or IDs to display
  • opacity (float) – Opacity of the structure
  • recolor (bool) – If structure should be cleaned and recolored to silver
  • gui (bool) – If the NGLview GUI should show up
Returns:

NGLviewer object

StructProp.add_residues_highlight_to_nglview(view, structure_resnums, chain=None, res_color=’red’)[source]

Add a residue number or numbers to an NGLWidget view object.

Parameters:
  • view (NGLWidget) – NGLWidget view object
  • structure_resnums (int, list) – Residue number(s) to highlight, structure numbering
  • chain (str, list) – Chain ID or IDs of which residues are a part of. If not provided, all chains in the mapped_chains attribute will be used. If that is also empty, and exception is raised.
  • res_color (str) – Color to highlight residues with
In [31]:
for g in my_gempro.genes:

    # Gather residue numbers
    metal_binding_structure_residues = []
    for f in g.protein.representative_sequence.features:
        if 'metal' in f.type.lower():
            res_info = g.protein.get_residue_annotations(f.location.end, use_representatives=True)
            metal_binding_structure_residues.append(res_info['struct_resnum'])
    print(metal_binding_structure_residues)

    # Display structure
    view = g.protein.representative_structure.view_structure()
    g.protein.representative_structure.add_residues_highlight_to_nglview(view=view, structure_resnums=metal_binding_structure_residues)

    view
[435, 501, 504]
[2018-02-05 17:00] [ssbio.protein.structure.structprop] INFO: Selection: ( :X ) and not hydrogen and ( 504 or 435 or 501 )
A Jupyter Widget
[710, 769, 772]
[2018-02-05 17:00] [ssbio.protein.structure.structprop] INFO: Selection: ( :A ) and not hydrogen and ( 769 or 772 or 710 )
A Jupyter Widget
Comparing features in different structures of the same protein
In [32]:
# Run all sequence to structure alignments
for g in my_gempro.genes:
    for s in g.protein.structures:
        g.protein.align_seqprop_to_structprop(seqprop=g.protein.representative_sequence, structprop=s)
In [33]:
metal_info_compared = []

for g in my_gempro.genes:
    for f in g.protein.representative_sequence.features:
        if 'metal' in f.type.lower():
            for s in g.protein.structures:
                for c in s.mapped_chains:
                    res_info = g.protein.get_residue_annotations(seq_resnum=f.location.end,
                                                                 seqprop=g.protein.representative_sequence,
                                                                 structprop=s, chain_id=c,
                                                                 use_representatives=False)
                    res_info['gene_id'] = g.id
                    res_info['seq_id'] = g.protein.representative_sequence.id
                    res_info['struct_id'] = s.id
                    res_info['chain_id'] = c
                    metal_info_compared.append(res_info)

cols = ['gene_id', 'seq_id', 'struct_id', 'chain_id',
        'seq_residue', 'seq_resnum', 'struct_residue','struct_resnum',
        'seq_SS-sspro','seq_SS-sspro8','seq_RSA-accpro','seq_RSA-accpro20',
        'struct_SS-dssp','struct_RSA-dssp', 'struct_ASA-dssp',
        'struct_PHI-dssp', 'struct_PSI-dssp', 'struct_CA_DEPTH-msms', 'struct_RES_DEPTH-msms']

pd.DataFrame.from_records(metal_info_compared, columns=cols).sort_values(by=['seq_resnum','struct_id','chain_id']).set_index(['gene_id','seq_id','seq_resnum','seq_residue','struct_id'])
Out[33]:
chain_id struct_residue struct_resnum seq_SS-sspro seq_SS-sspro8 seq_RSA-accpro seq_RSA-accpro20 struct_SS-dssp struct_RSA-dssp struct_ASA-dssp struct_PHI-dssp struct_PSI-dssp struct_CA_DEPTH-msms struct_RES_DEPTH-msms
gene_id seq_id seq_resnum seq_residue struct_id
b1276 P25516 435 C ACON1_ECOLI X C 435 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
E01201 X C 435 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
REP-ACON1_ECOLI X C 435 NaN NaN NaN NaN H 0.059259 8.0 -61.1 -26.6 2.656722 2.813536
501 C ACON1_ECOLI X C 501 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
E01201 X C 501 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
REP-ACON1_ECOLI X C 501 NaN NaN NaN NaN S 0.088889 12.0 -61.0 -50.0 1.999713 2.409119
504 C ACON1_ECOLI X C 504 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
E01201 X C 504 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
REP-ACON1_ECOLI X C 504 NaN NaN NaN NaN G 0.259259 35.0 -56.0 -45.6 1.999634 1.961484
b0118 P36683 710 C 1l5j A C 710 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1l5j B C 710 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
ACON2_ECOLI X C 710 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
E00113 X C 710 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
REP-1l5j A C 710 NaN NaN NaN NaN T 0.118519 16.0 -67.1 -7.2 10.148960 10.009109
769 C 1l5j A C 769 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1l5j B C 769 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
ACON2_ECOLI X C 769 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
E00113 X C 769 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
REP-1l5j A C 769 NaN NaN NaN NaN - 0.088889 12.0 -67.8 -28.3 8.296585 8.049832
772 C 1l5j A C 772 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1l5j B C 772 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
ACON2_ECOLI X C 772 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
E00113 X C 772 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
REP-1l5j A C 772 NaN NaN NaN NaN G 0.081481 11.0 -50.2 -38.0 8.282292 8.239369
GEM-PRO - Genes & Sequences

This notebook gives an example of how to run the GEM-PRO pipeline with a dictionary of gene IDs and their protein sequences.

Input: Dictionary of gene IDs and protein sequences
Output: GEM-PRO model
Imports
In [1]:
import sys
import logging
In [2]:
# Import the GEM-PRO class
from ssbio.pipeline.gempro import GEMPRO
In [3]:
# Printing multiple outputs per cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
Logging

Set the logging level in logger.setLevel(logging.<LEVEL_HERE>) to specify how verbose you want the pipeline to be. Debug is most verbose.

  • CRITICAL
    • Only really important messages shown
  • ERROR
    • Major errors
  • WARNING
    • Warnings that don’t affect running of the pipeline
  • INFO (default)
    • Info such as the number of structures mapped per gene
  • DEBUG
    • Really detailed information that will print out a lot of stuff
Warning: DEBUG mode prints out a large amount of information, especially if you have a lot of genes. This may stall your notebook!
In [4]:
# Create logger
logger = logging.getLogger()
logger.setLevel(logging.INFO)  # SET YOUR LOGGING LEVEL HERE #
In [5]:
# Other logger stuff for Jupyter notebooks
handler = logging.StreamHandler(sys.stderr)
formatter = logging.Formatter('[%(asctime)s] [%(name)s] %(levelname)s: %(message)s', datefmt="%Y-%m-%d %H:%M")
handler.setFormatter(formatter)
logger.handlers = [handler]
Initialization of the project

Set these three things:

  • ROOT_DIR
    • The directory where a folder named after your PROJECT will be created
  • PROJECT
    • Your project name
  • LIST_OF_GENES
    • Your list of gene IDs

A directory will be created in ROOT_DIR with your PROJECT name. The folders are organized like so:

ROOT_DIR
└── PROJECT
    ├── data  # General storage for pipeline outputs
    ├── model  # SBML and GEM-PRO models are stored here
    ├── genes  # Per gene information
    │   ├── <gene_id1>  # Specific gene directory
    │   │   └── protein
    │   │       ├── sequences  # Protein sequence files, alignments, etc.
    │   │       └── structures  # Protein structure files, calculations, etc.
    │   └── <gene_id2>
    │       └── protein
    │           ├── sequences
    │           └── structures
    ├── reactions  # Per reaction information
    │   └── <reaction_id1>  # Specific reaction directory
    │       └── complex
    │           └── structures  # Protein complex files
    └── metabolites  # Per metabolite information
        └── <metabolite_id1>  # Specific metabolite directory
            └── chemical
                └── structures  # Metabolite 2D and 3D structure files
Note: Methods for protein complexes and metabolites are still in development.
In [6]:
# SET FOLDERS AND DATA HERE
import tempfile
ROOT_DIR = tempfile.gettempdir()

PROJECT = 'genes_and_sequences_GP'
GENES_AND_SEQUENCES = {'b0870': 'MIDLRSDTVTRPSRAMLEAMMAAPVGDDVYGDDPTVNALQDYAAELSGKEAAIFLPTGTQANLVALLSHCERGEEYIVGQAAHNYLFEAGGAAVLGSIQPQPIDAAADGTLPLDKVAMKIKPDDIHFARTKLLSLENTHNGKVLPREYLKEAWEFTRERNLALHVDGARIFNAVVAYGCELKEITQYCDSFTICLSKGLGTPVGSLLVGNRDYIKRAIRWRKMTGGGMRQSGILAAAGIYALKNNVARLQEDHDNAAWMAEQLREAGADVMRQDTNMLFVRVGEENAAALGEYMKARNVLINASPIVRLVTHLDVSREQLAEVAAHWRAFLAR',
                       'b3041': 'MNQTLLSSFGTPFERVENALAALREGRGVMVLDDEDRENEGDMIFPAETMTVEQMALTIRHGSGIVCLCITEDRRKQLDLPMMVENNTSAYGTGFTVTIEAAEGVTTGVSAADRITTVRAAIADGAKPSDLNRPGHVFPLRAQAGGVLTRGGHTEATIDLMTLAGFKPAGVLCELTNDDGTMARAPECIEFANKHNMALVTIEDLVAYRQAHERKAS'}
PDB_FILE_TYPE = 'mmtf'
In [7]:
# Create the GEM-PRO project
my_gempro = GEMPRO(gem_name=PROJECT, root_dir=ROOT_DIR, genes_and_sequences=GENES_AND_SEQUENCES, pdb_file_type=PDB_FILE_TYPE)
[2018-02-05 18:11] [ssbio.pipeline.gempro] INFO: Creating GEM-PRO project directory in folder /tmp
[2018-02-05 18:11] [ssbio.pipeline.gempro] INFO: /tmp/genes_and_sequences_GP: GEM-PRO project location
[2018-02-05 18:11] [ssbio.pipeline.gempro] INFO: Loaded in 2 sequences
[2018-02-05 18:11] [ssbio.pipeline.gempro] INFO: 2: number of genes
Mapping sequence –> structure

Since the sequences have been provided, we just need to BLAST them to the PDB.

Note: These methods do not download any 3D structure files.

Methods
GEMPRO.blast_seqs_to_pdb(seq_ident_cutoff=0, evalue=0.0001, all_genes=False, display_link=False, outdir=None, force_rerun=False)[source]

BLAST each representative protein sequence to the PDB. Saves raw BLAST results (XML files).

Parameters:
  • seq_ident_cutoff (float, optional) – Cutoff results based on percent coverage (in decimal form)
  • evalue (float, optional) – Cutoff for the E-value - filters for significant hits. 0.001 is liberal, 0.0001 is stringent (default).
  • all_genes (bool) – If all genes should be BLASTed, or only those without any structures currently mapped
  • display_link (bool, optional) – Set to True if links to the HTML results should be displayed
  • outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
  • force_rerun (bool, optional) – If existing BLAST results should not be used, set to True. Default is False
In [8]:
# Mapping using BLAST
my_gempro.blast_seqs_to_pdb(all_genes=True, seq_ident_cutoff=.9, evalue=0.00001)
my_gempro.df_pdb_blast.head(2)

[2018-02-05 18:11] [ssbio.pipeline.gempro] INFO: Completed sequence --> PDB BLAST. See the "df_pdb_blast" attribute for a summary dataframe.
[2018-02-05 18:11] [ssbio.pipeline.gempro] INFO: 2: number of genes with additional structures added from BLAST
Out[8]:
pdb_id pdb_chain_id hit_score hit_evalue hit_percent_similar hit_percent_ident hit_num_ident hit_num_similar
gene
b0870 3wlx A 1713.0 0.0 1.0 1.0 333 333
b0870 3wlx B 1713.0 0.0 1.0 1.0 333 333
Downloading and ranking structures
Methods
GEMPRO.pdb_downloader_and_metadata(outdir=None, pdb_file_type=None, force_rerun=False)[source]

Download ALL mapped experimental structures to each protein’s structures directory.

Parameters:
  • outdir (str) – Path to output directory, if GEM-PRO directories were not set or other output directory is desired
  • pdb_file_type (str) – Type of PDB file to download, if not already set or other format is desired
  • force_rerun (bool) – If files should be re-downloaded if they already exist
Warning: Downloading all PDBs takes a while, since they are also parsed for metadata. You can skip this step and just set representative structures below if you want to minimize the number of PDBs downloaded.
In [9]:
# Download all mapped PDBs and gather the metadata
my_gempro.pdb_downloader_and_metadata()
my_gempro.df_pdb_metadata.head(2)

[2018-02-05 18:11] [ssbio.pipeline.gempro] INFO: Updated PDB metadata dataframe. See the "df_pdb_metadata" attribute for a summary dataframe.
[2018-02-05 18:11] [ssbio.pipeline.gempro] INFO: Saved 11 structures total
Out[9]:
pdb_id pdb_title description experimental_method mapped_chains resolution chemicals taxonomy_name structure_file
gene
b0870 3wlx Crystal structure of low-specificity L-threoni... Low specificity L-threonine aldolase (E.C.4.1.... X-RAY DIFFRACTION A;B 2.51 PLG Escherichia coli 3wlx.mmtf
b0870 4lnj Structure of Escherichia coli Threonine Aldola... Low-specificity L-threonine aldolase (E.C.4.1.... X-RAY DIFFRACTION A;B 2.10 EPE;MG;PLR Escherichia coli 4lnj.mmtf
GEMPRO.set_representative_structure(seq_outdir=None, struct_outdir=None, pdb_file_type=None, engine=’needle’, always_use_homology=False, rez_cutoff=0.0, seq_ident_cutoff=0.5, allow_missing_on_termini=0.2, allow_mutants=True, allow_deletions=False, allow_insertions=False, allow_unresolved=True, skip_large_structures=False, clean=True, force_rerun=False)[source]

Set all representative structure for proteins from a structure in the structures attribute.

Each gene can have a combination of the following, which will be analyzed to set a representative structure.

  • Homology model(s)
  • Ranked PDBs
  • BLASTed PDBs

If the always_use_homology flag is true, homology models are always set as representative when they exist. If there are multiple homology models, we rank by the percent sequence coverage.

Parameters:
  • seq_outdir (str) – Path to output directory of sequence alignment files, must be set if GEM-PRO directories were not created initially
  • struct_outdir (str) – Path to output directory of structure files, must be set if GEM-PRO directories were not created initially
  • pdb_file_type (str) – pdb, mmCif, xml, mmtf - file type for files downloaded from the PDB
  • engine (str) – biopython or needle - which pairwise alignment program to use. needle is the standard EMBOSS tool to run pairwise alignments. biopython is Biopython’s implementation of needle. Results can differ!
  • always_use_homology (bool) – If homology models should always be set as the representative structure
  • rez_cutoff (float) – Resolution cutoff, in Angstroms (only if experimental structure)
  • seq_ident_cutoff (float) – Percent sequence identity cutoff, in decimal form
  • allow_missing_on_termini (float) – Percentage of the total length of the reference sequence which will be ignored when checking for modifications. Example: if 0.1, and reference sequence is 100 AA, then only residues 5 to 95 will be checked for modifications.
  • allow_mutants (bool) – If mutations should be allowed or checked for
  • allow_deletions (bool) – If deletions should be allowed or checked for
  • allow_insertions (bool) – If insertions should be allowed or checked for
  • allow_unresolved (bool) – If unresolved residues should be allowed or checked for
  • skip_large_structures (bool) – Default False – currently, large structures can’t be saved as a PDB file even if you just want to save a single chain, so Biopython will throw an error when trying to do so. As an alternative, if a large structure is selected as representative, the pipeline will currently point to it and not clean it. If you don’t want this to happen, set this to true.
  • clean (bool) – If structures should be cleaned
  • force_rerun (bool) – If sequence to structure alignment should be rerun

Todo

  • Remedy large structure representative setting
In [10]:
# Set representative structures
my_gempro.set_representative_structure()
my_gempro.df_representative_structures.head()

[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: 2/2: number of genes with a representative structure
[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: See the "df_representative_structures" attribute for a summary dataframe.
Out[10]:
id is_experimental file_type structure_file
gene
b0870 REP-3wlx True pdb 3wlx-A_clean.pdb
b3041 REP-1iez True pdb 1iez-A_clean.pdb
In [11]:
# Looking at the information saved within a gene
my_gempro.genes.get_by_id('b0870').protein.representative_structure
my_gempro.genes.get_by_id('b0870').protein.representative_structure.get_dict()
Out[11]:
<StructProp REP-3wlx at 0x7f9a0ae345f8>
Out[11]:
{'_structure_dir': '/tmp/genes_and_sequences_GP/genes/b0870/b0870_protein/structures',
 'chains': [<ChainProp A at 0x7f99fbd55710>],
 'date': None,
 'description': 'Low specificity L-threonine aldolase (E.C.4.1.2.48)',
 'file_type': 'pdb',
 'id': 'REP-3wlx',
 'is_experimental': True,
 'mapped_chains': ['A'],
 'notes': {},
 'original_structure_id': '3wlx',
 'resolution': 2.51,
 'structure_file': '3wlx-A_clean.pdb',
 'taxonomy_name': 'Escherichia coli'}
Creating homology models

For those proteins with no representative structure, we can create homology models for them. ssbio contains some built in functions for easily running I-TASSER locally or on machines with SLURM (ie. on NERSC) or Torque job scheduling.

You can load in I-TASSER models once they complete using the get_itasser_models later.

Info: Homology modeling can take a long time - about 24-72 hours per protein (highly dependent on the sequence length, as well as if there are available templates).

Methods
In [12]:
# Prep I-TASSER model folders
my_gempro.prep_itasser_modeling('~/software/I-TASSER4.4', '~/software/ITLIB/', runtype='local', all_genes=False)
[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: Prepared I-TASSER modeling folders for 0 genes in folder /tmp/genes_and_sequences_GP/data/homology_models
Saving your GEM-PRO

Warning: Saving is still experimental. For a full GEM-PRO with sequences & structures, depending on the number of genes, saving can take >5 minutes.

GEMPRO.save_json(outfile, compression=False)

Save the object as a JSON file using json_tricks

In [13]:
import os.path as op
my_gempro.save_json(op.join(my_gempro.model_dir, '{}.json'.format(my_gempro.id)), compression=False)
[2018-02-05 18:12] [root] WARNING: json-tricks: numpy scalar serialization is experimental and may work differently in future versions
[2018-02-05 18:12] [ssbio.io] INFO: Saved <class 'ssbio.pipeline.gempro.GEMPRO'> (id: genes_and_sequences_GP) to /tmp/genes_and_sequences_GP/model/genes_and_sequences_GP.json
GEM-PRO - List of Gene IDs

This notebook gives an example of how to run the GEM-PRO pipeline with a list of gene IDs.

Input: List of gene IDs
Output: GEM-PRO model
Imports
In [1]:
import sys
import logging
In [2]:
# Import the GEM-PRO class
from ssbio.pipeline.gempro import GEMPRO
In [3]:
# Printing multiple outputs per cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
Logging

Set the logging level in logger.setLevel(logging.<LEVEL_HERE>) to specify how verbose you want the pipeline to be. Debug is most verbose.

  • CRITICAL
    • Only really important messages shown
  • ERROR
    • Major errors
  • WARNING
    • Warnings that don’t affect running of the pipeline
  • INFO (default)
    • Info such as the number of structures mapped per gene
  • DEBUG
    • Really detailed information that will print out a lot of stuff
Warning: DEBUG mode prints out a large amount of information, especially if you have a lot of genes. This may stall your notebook!
In [4]:
# Create logger
logger = logging.getLogger()
logger.setLevel(logging.INFO)  # SET YOUR LOGGING LEVEL HERE #
In [5]:
# Other logger stuff for Jupyter notebooks
handler = logging.StreamHandler(sys.stderr)
formatter = logging.Formatter('[%(asctime)s] [%(name)s] %(levelname)s: %(message)s', datefmt="%Y-%m-%d %H:%M")
handler.setFormatter(formatter)
logger.handlers = [handler]
Initialization of the project

Set these three things:

  • ROOT_DIR
    • The directory where a folder named after your PROJECT will be created
  • PROJECT
    • Your project name
  • LIST_OF_GENES
    • Your list of gene IDs

A directory will be created in ROOT_DIR with your PROJECT name. The folders are organized like so:

ROOT_DIR
└── PROJECT
    ├── data  # General storage for pipeline outputs
    ├── model  # SBML and GEM-PRO models are stored here
    ├── genes  # Per gene information
    │   ├── <gene_id1>  # Specific gene directory
    │   │   └── protein
    │   │       ├── sequences  # Protein sequence files, alignments, etc.
    │   │       └── structures  # Protein structure files, calculations, etc.
    │   └── <gene_id2>
    │       └── protein
    │           ├── sequences
    │           └── structures
    ├── reactions  # Per reaction information
    │   └── <reaction_id1>  # Specific reaction directory
    │       └── complex
    │           └── structures  # Protein complex files
    └── metabolites  # Per metabolite information
        └── <metabolite_id1>  # Specific metabolite directory
            └── chemical
                └── structures  # Metabolite 2D and 3D structure files
Note: Methods for protein complexes and metabolites are still in development.
In [6]:
# SET FOLDERS AND DATA HERE
import tempfile
ROOT_DIR = tempfile.gettempdir()

PROJECT = 'genes_GP'
LIST_OF_GENES = ['b0761', 'b0889', 'b0995', 'b1013', 'b1014', 'b1040', 'b1130', 'b1187', 'b1221', 'b1299']
PDB_FILE_TYPE = 'mmtf'
In [7]:
# Create the GEM-PRO project
my_gempro = GEMPRO(gem_name=PROJECT, root_dir=ROOT_DIR, genes_list=LIST_OF_GENES, pdb_file_type=PDB_FILE_TYPE)
[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: Creating GEM-PRO project directory in folder /tmp
[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: /tmp/genes_GP: GEM-PRO project location
[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: 10: number of genes
Mapping gene ID –> sequence

First, we need to map these IDs to their protein sequences. There are 2 ID mapping services provided to do this - through KEGG or UniProt. The end goal is to map a UniProt ID to each ID, since there is a comprehensive mapping (and some useful APIs) between UniProt and the PDB.

Note: You only need to map gene IDs using one service. However you can run both if some genes don’t map in one service and do map in another!

Methods
GEMPRO.kegg_mapping_and_metadata(kegg_organism_code, custom_gene_mapping=None, outdir=None, set_as_representative=False, force_rerun=False)[source]

Map all genes in the model to KEGG IDs using the KEGG service.

Steps:
  1. Download all metadata and sequence files in the sequences directory
  2. Creates a KEGGProp object in the protein.sequences attribute
  3. Returns a Pandas DataFrame of mapping results
Parameters:
  • kegg_organism_code (str) – The three letter KEGG code of your organism
  • custom_gene_mapping (dict) – If your model genes differ from the gene IDs you want to map, custom_gene_mapping allows you to input a dictionary which maps model gene IDs to new ones. Dictionary keys must match model gene IDs.
  • outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
  • set_as_representative (bool) – If mapped KEGG IDs should be set as representative sequences
  • force_rerun (bool) – If you want to overwrite any existing mappings and files
In [8]:
# KEGG mapping of gene ids
my_gempro.kegg_mapping_and_metadata(kegg_organism_code='eco')
print('Missing KEGG mapping: ', my_gempro.missing_kegg_mapping)
my_gempro.df_kegg_metadata.head()

[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: 10/10: number of genes mapped to KEGG
[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: Completed ID mapping --> KEGG. See the "df_kegg_metadata" attribute for a summary dataframe.
Missing KEGG mapping:  []
Out[8]:
kegg refseq uniprot pdbs sequence_file metadata_file
gene
b0761 eco:b0761 NP_415282 P0A9G8 1B9M;1H9S;1B9N;1O7L;1H9R eco-b0761.faa eco-b0761.kegg
b0889 eco:b0889 NP_415409 P0ACJ0 2GQQ;2L4A eco-b0889.faa eco-b0889.kegg
b0995 eco:b0995 NP_415515 P38684 1ZGZ eco-b0995.faa eco-b0995.kegg
b1013 eco:b1013 NP_415533 P0ACU2 4JYK;4XK4;4X1E;3LOC eco-b1013.faa eco-b1013.kegg
b1014 eco:b1014 NP_415534 P09546 3E2Q;4JNZ;3E2R;4JNY;2GPE;4O8A;3E2S;2FZN;1TJ1;1... eco-b1014.faa eco-b1014.kegg
GEMPRO.uniprot_mapping_and_metadata(model_gene_source, custom_gene_mapping=None, outdir=None, set_as_representative=False, force_rerun=False)[source]

Map all genes in the model to UniProt IDs using the UniProt mapping service. Also download all metadata and sequences.

Parameters:
  • model_gene_source (str) –

    the database source of your model gene IDs. See: http://www.uniprot.org/help/api_idmapping Common model gene sources are:

    • Ensembl Genomes - ENSEMBLGENOME_ID (i.e. E. coli b-numbers)
    • Entrez Gene (GeneID) - P_ENTREZGENEID
    • RefSeq Protein - P_REFSEQ_AC
  • custom_gene_mapping (dict) – If your model genes differ from the gene IDs you want to map, custom_gene_mapping allows you to input a dictionary which maps model gene IDs to new ones. Dictionary keys must match model genes.
  • outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
  • set_as_representative (bool) – If mapped UniProt IDs should be set as representative sequences
  • force_rerun (bool) – If you want to overwrite any existing mappings and files
In [9]:
# UniProt mapping
my_gempro.uniprot_mapping_and_metadata(model_gene_source='ENSEMBLGENOME_ID')
print('Missing UniProt mapping: ', my_gempro.missing_uniprot_mapping)
my_gempro.df_uniprot_metadata.head()
[2018-02-05 18:12] [root] INFO: getUserAgent: Begin
[2018-02-05 18:12] [root] INFO: getUserAgent: user_agent: EBI-Sample-Client/ (services.py; Python 3.6.3; Linux) Python-requests/2.18.4
[2018-02-05 18:12] [root] INFO: getUserAgent: End
[2018-02-05 18:12] [root] WARNING: status is not ok with Bad Request
[2018-02-05 18:12] [root] WARNING: Results seems empty...returning empty dictionary.

[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: 0/10: number of genes mapped to UniProt
[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: Completed ID mapping --> UniProt. See the "df_uniprot_metadata" attribute for a summary dataframe.
Missing UniProt mapping:  ['b1130', 'b0889', 'b1221', 'b0761', 'b0995', 'b1187', 'b1013', 'b1299', 'b1014', 'b1040']
[2018-02-05 18:12] [ssbio.pipeline.gempro] WARNING: Empty dataframe
Out[9]:
uniprot reviewed gene_name kegg refseq num_pdbs pdbs ec_number pfam seq_len description entry_date entry_version seq_date seq_version sequence_file metadata_file
gene
GEMPRO.set_representative_sequence(force_rerun=False)[source]

Automatically consolidate loaded sequences (manual, UniProt, or KEGG) and set a single representative sequence.

Manually set representative sequences override all existing mappings. UniProt mappings override KEGG mappings except when KEGG mappings have PDBs associated with them and UniProt doesn’t.

Parameters:force_rerun (bool) – Set to True to recheck stored sequences
In [10]:
# Set representative sequences
my_gempro.set_representative_sequence()
print('Missing a representative sequence: ', my_gempro.missing_representative_sequence)
my_gempro.df_representative_sequences.head()

[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: 10/10: number of genes with a representative sequence
[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: See the "df_representative_sequences" attribute for a summary dataframe.
Missing a representative sequence:  []
Out[10]:
uniprot kegg pdbs sequence_file metadata_file
gene
b0761 P0A9G8 eco:b0761 1B9M;1H9S;1B9N;1O7L;1H9R eco-b0761.faa eco-b0761.kegg
b0889 P0ACJ0 eco:b0889 2GQQ;2L4A eco-b0889.faa eco-b0889.kegg
b0995 P38684 eco:b0995 1ZGZ eco-b0995.faa eco-b0995.kegg
b1013 P0ACU2 eco:b1013 4JYK;4XK4;4X1E;3LOC eco-b1013.faa eco-b1013.kegg
b1014 P09546 eco:b1014 3E2Q;4JNZ;3E2R;4JNY;2GPE;4O8A;3E2S;2FZN;1TJ1;1... eco-b1014.faa eco-b1014.kegg
Mapping representative sequence –> structure

These are the ways to map sequence to structure:

  1. Use the UniProt ID and their automatic mappings to the PDB
  2. BLAST the sequence to the PDB
  3. Make homology models or
  4. Map to existing homology models

You can only utilize option #1 to map to PDBs if there is a mapped UniProt ID set in the representative sequence. If not, you’ll have to BLAST your sequence to the PDB or make a homology model. You can also run both for maximum coverage.

Methods
GEMPRO.map_uniprot_to_pdb(seq_ident_cutoff=0.0, outdir=None, force_rerun=False)[source]

Map all representative sequences’ UniProt ID to PDB IDs using the PDBe “Best Structures” API. Will save a JSON file of the results to each protein’s sequences folder.

The “Best structures” API is available at https://www.ebi.ac.uk/pdbe/api/doc/sifts.html The list of PDB structures mapping to a UniProt accession sorted by coverage of the protein and, if the same, resolution.

Parameters:
  • seq_ident_cutoff (float) – Sequence identity cutoff in decimal form
  • outdir (str) – Output directory to cache JSON results of search
  • force_rerun (bool) – Force re-downloading of JSON results if they already exist
Returns:

A rank-ordered list of PDBProp objects that map to the UniProt ID

Return type:

list

In [11]:
# Mapping using the PDBe best_structures service
my_gempro.map_uniprot_to_pdb(seq_ident_cutoff=.3)
my_gempro.df_pdb_ranking.head()
[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: Mapping UniProt IDs --> PDB IDs...
[2018-02-05 18:12] [root] INFO: getUserAgent: Begin
[2018-02-05 18:12] [root] INFO: getUserAgent: user_agent: EBI-Sample-Client/ (services.py; Python 3.6.3; Linux) Python-requests/2.18.4
[2018-02-05 18:12] [root] INFO: getUserAgent: End
[2018-02-05 18:12] [root] WARNING: status is not ok with Bad Request
[2018-02-05 18:12] [root] WARNING: Results seems empty...returning empty dictionary.

[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: 0/10: number of genes with at least one experimental structure
[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: Completed UniProt --> best PDB mapping. See the "df_pdb_ranking" attribute for a summary dataframe.
[2018-02-05 18:12] [ssbio.pipeline.gempro] WARNING: Empty dataframe
Out[11]:
GEMPRO.blast_seqs_to_pdb(seq_ident_cutoff=0, evalue=0.0001, all_genes=False, display_link=False, outdir=None, force_rerun=False)[source]

BLAST each representative protein sequence to the PDB. Saves raw BLAST results (XML files).

Parameters:
  • seq_ident_cutoff (float, optional) – Cutoff results based on percent coverage (in decimal form)
  • evalue (float, optional) – Cutoff for the E-value - filters for significant hits. 0.001 is liberal, 0.0001 is stringent (default).
  • all_genes (bool) – If all genes should be BLASTed, or only those without any structures currently mapped
  • display_link (bool, optional) – Set to True if links to the HTML results should be displayed
  • outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
  • force_rerun (bool, optional) – If existing BLAST results should not be used, set to True. Default is False
In [12]:
# Mapping using BLAST
my_gempro.blast_seqs_to_pdb(all_genes=True, seq_ident_cutoff=.9, evalue=0.00001)
my_gempro.df_pdb_blast.head(2)

[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: Completed sequence --> PDB BLAST. See the "df_pdb_blast" attribute for a summary dataframe.
[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: 5: number of genes with additional structures added from BLAST
Out[12]:
pdb_id pdb_chain_id hit_score hit_evalue hit_percent_similar hit_percent_ident hit_num_ident hit_num_similar
gene
b0761 1b9n B 1091.0 5.530720e-119 0.931298 0.931298 244 244
b0761 1o7l D 1089.0 1.096280e-118 0.931298 0.931298 244 244
Downloading and ranking structures
Methods
GEMPRO.pdb_downloader_and_metadata(outdir=None, pdb_file_type=None, force_rerun=False)[source]

Download ALL mapped experimental structures to each protein’s structures directory.

Parameters:
  • outdir (str) – Path to output directory, if GEM-PRO directories were not set or other output directory is desired
  • pdb_file_type (str) – Type of PDB file to download, if not already set or other format is desired
  • force_rerun (bool) – If files should be re-downloaded if they already exist
Warning: Downloading all PDBs takes a while, since they are also parsed for metadata. You can skip this step and just set representative structures below if you want to minimize the number of PDBs downloaded.
In [13]:
# Download all mapped PDBs and gather the metadata
my_gempro.pdb_downloader_and_metadata()
my_gempro.df_pdb_metadata.head(2)

[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: Updated PDB metadata dataframe. See the "df_pdb_metadata" attribute for a summary dataframe.
[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: Saved 15 structures total
Out[13]:
chemicals description experimental_method mapped_chains pdb_id pdb_title resolution structure_file taxonomy_name
gene
b0761 NI ModE (MOLYBDATE-DEPENDENT TRANSCRIPTIONAL REGU... X-RAY DIFFRACTION A;B 1b9m REGULATOR FROM ESCHERICHIA COLI 1.75 1b9m.mmtf Escherichia coli
b0761 NI MODE (MOLYBDATE DEPENDENT TRANSCRIPTIONAL REGU... X-RAY DIFFRACTION A;B 1b9n REGULATOR FROM ESCHERICHIA COLI 2.09 1b9n.mmtf Escherichia coli
GEMPRO.set_representative_structure(seq_outdir=None, struct_outdir=None, pdb_file_type=None, engine=’needle’, always_use_homology=False, rez_cutoff=0.0, seq_ident_cutoff=0.5, allow_missing_on_termini=0.2, allow_mutants=True, allow_deletions=False, allow_insertions=False, allow_unresolved=True, skip_large_structures=False, clean=True, force_rerun=False)[source]

Set all representative structure for proteins from a structure in the structures attribute.

Each gene can have a combination of the following, which will be analyzed to set a representative structure.

  • Homology model(s)
  • Ranked PDBs
  • BLASTed PDBs

If the always_use_homology flag is true, homology models are always set as representative when they exist. If there are multiple homology models, we rank by the percent sequence coverage.

Parameters:
  • seq_outdir (str) – Path to output directory of sequence alignment files, must be set if GEM-PRO directories were not created initially
  • struct_outdir (str) – Path to output directory of structure files, must be set if GEM-PRO directories were not created initially
  • pdb_file_type (str) – pdb, mmCif, xml, mmtf - file type for files downloaded from the PDB
  • engine (str) – biopython or needle - which pairwise alignment program to use. needle is the standard EMBOSS tool to run pairwise alignments. biopython is Biopython’s implementation of needle. Results can differ!
  • always_use_homology (bool) – If homology models should always be set as the representative structure
  • rez_cutoff (float) – Resolution cutoff, in Angstroms (only if experimental structure)
  • seq_ident_cutoff (float) – Percent sequence identity cutoff, in decimal form
  • allow_missing_on_termini (float) – Percentage of the total length of the reference sequence which will be ignored when checking for modifications. Example: if 0.1, and reference sequence is 100 AA, then only residues 5 to 95 will be checked for modifications.
  • allow_mutants (bool) – If mutations should be allowed or checked for
  • allow_deletions (bool) – If deletions should be allowed or checked for
  • allow_insertions (bool) – If insertions should be allowed or checked for
  • allow_unresolved (bool) – If unresolved residues should be allowed or checked for
  • skip_large_structures (bool) – Default False – currently, large structures can’t be saved as a PDB file even if you just want to save a single chain, so Biopython will throw an error when trying to do so. As an alternative, if a large structure is selected as representative, the pipeline will currently point to it and not clean it. If you don’t want this to happen, set this to true.
  • clean (bool) – If structures should be cleaned
  • force_rerun (bool) – If sequence to structure alignment should be rerun

Todo

  • Remedy large structure representative setting
In [14]:
# Set representative structures
my_gempro.set_representative_structure()
my_gempro.df_representative_structures.head()

[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: 5/10: number of genes with a representative structure
[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: See the "df_representative_structures" attribute for a summary dataframe.
Out[14]:
id is_experimental file_type structure_file
gene
b0761 REP-1b9n True pdb 1b9n-A_clean.pdb
b0889 REP-2gqq True pdb 2gqq-A_clean.pdb
b1013 REP-4xk4 True pdb 4xk4-A_clean.pdb
b1187 REP-1h9t True pdb 1h9t-A_clean.pdb
b1221 REP-1rnl True pdb 1rnl-A_clean.pdb
In [15]:
# Looking at the information saved within a gene
my_gempro.genes.get_by_id('b1187').protein.representative_structure
my_gempro.genes.get_by_id('b1187').protein.representative_structure.get_dict()
Out[15]:
<StructProp REP-1h9t at 0x7f3880a275c0>
Out[15]:
{'_structure_dir': '/tmp/genes_GP/genes/b1187/b1187_protein/structures',
 'chains': [<ChainProp A at 0x7f38830e1c18>],
 'date': None,
 'description': 'FATTY ACID METABOLISM REGULATOR PROTEIN',
 'file_type': 'pdb',
 'id': 'REP-1h9t',
 'is_experimental': True,
 'mapped_chains': ['A'],
 'notes': {},
 'original_structure_id': '1h9t',
 'resolution': 3.25,
 'structure_file': '1h9t-A_clean.pdb',
 'taxonomy_name': 'ESCHERICHIA COLI'}
Saving your GEM-PRO

Warning: Saving is still experimental. For a full GEM-PRO with sequences & structures, depending on the number of genes, saving can take >5 minutes.

GEMPRO.save_json(outfile, compression=False)

Save the object as a JSON file using json_tricks

In [16]:
import os.path as op
my_gempro.save_json(op.join(my_gempro.model_dir, '{}.json'.format(my_gempro.id)), compression=False)
[2018-02-05 18:12] [root] WARNING: json-tricks: numpy scalar serialization is experimental and may work differently in future versions
[2018-02-05 18:12] [ssbio.io] INFO: Saved <class 'ssbio.pipeline.gempro.GEMPRO'> (id: genes_GP) to /tmp/genes_GP/model/genes_GP.json
GEM-PRO - SBML Model

This notebook gives an example of how to run the GEM-PRO pipeline with a SBML model, in this case iNJ661, the metabolic model of M. tuberculosis.

Input: GEM (in SBML, JSON, or MAT formats)
Output: GEM-PRO model
Imports
In [1]:
import sys
import logging
In [2]:
# Import the GEM-PRO class
from ssbio.pipeline.gempro import GEMPRO
In [3]:
# Printing multiple outputs per cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
Logging

Set the logging level in logger.setLevel(logging.<LEVEL_HERE>) to specify how verbose you want the pipeline to be. Debug is most verbose.

  • CRITICAL
    • Only really important messages shown
  • ERROR
    • Major errors
  • WARNING
    • Warnings that don’t affect running of the pipeline
  • INFO (default)
    • Info such as the number of structures mapped per gene
  • DEBUG
    • Really detailed information that will print out a lot of stuff
Warning: DEBUG mode prints out a large amount of information, especially if you have a lot of genes. This may stall your notebook!
In [4]:
# Create logger
logger = logging.getLogger()
logger.setLevel(logging.INFO)  # SET YOUR LOGGING LEVEL HERE #
In [5]:
# Other logger stuff for Jupyter notebooks
handler = logging.StreamHandler(sys.stderr)
formatter = logging.Formatter('[%(asctime)s] [%(name)s] %(levelname)s: %(message)s', datefmt="%Y-%m-%d %H:%M")
handler.setFormatter(formatter)
logger.handlers = [handler]
Initialization of the project

Set these three things:

  • ROOT_DIR
    • The directory where a folder named after your PROJECT will be created
  • PROJECT
    • Your project name
  • LIST_OF_GENES
    • Your list of gene IDs

A directory will be created in ROOT_DIR with your PROJECT name. The folders are organized like so:

ROOT_DIR
└── PROJECT
    ├── data  # General storage for pipeline outputs
    ├── model  # SBML and GEM-PRO models are stored here
    ├── genes  # Per gene information
    │   ├── <gene_id1>  # Specific gene directory
    │   │   └── protein
    │   │       ├── sequences  # Protein sequence files, alignments, etc.
    │   │       └── structures  # Protein structure files, calculations, etc.
    │   └── <gene_id2>
    │       └── protein
    │           ├── sequences
    │           └── structures
    ├── reactions  # Per reaction information
    │   └── <reaction_id1>  # Specific reaction directory
    │       └── complex
    │           └── structures  # Protein complex files
    └── metabolites  # Per metabolite information
        └── <metabolite_id1>  # Specific metabolite directory
            └── chemical
                └── structures  # Metabolite 2D and 3D structure files
Note: Methods for protein complexes and metabolites are still in development.
In [6]:
# SET FOLDERS AND DATA HERE
import tempfile
ROOT_DIR = tempfile.gettempdir()

PROJECT = 'mtuberculosis_gp'
GEM_FILE = '../../ssbio/test/test_files/models/iNJ661.json'
GEM_FILE_TYPE = 'json'
PDB_FILE_TYPE = 'mmtf'
In [7]:
# Create the GEM-PRO project
my_gempro = GEMPRO(gem_name=PROJECT, root_dir=ROOT_DIR, gem_file_path=GEM_FILE, gem_file_type=GEM_FILE_TYPE, pdb_file_type=PDB_FILE_TYPE)
[2018-02-05 18:13] [ssbio.pipeline.gempro] INFO: Creating GEM-PRO project directory in folder /tmp
[2018-02-05 18:13] [ssbio.pipeline.gempro] INFO: /tmp/mtuberculosis_gp: GEM-PRO project location
[2018-02-05 18:13] [ssbio.pipeline.gempro] INFO: iNJ661: loaded model
[2018-02-05 18:13] [ssbio.pipeline.gempro] INFO: 1025: number of reactions
[2018-02-05 18:13] [ssbio.pipeline.gempro] INFO: 720: number of reactions linked to a gene
[2018-02-05 18:13] [ssbio.pipeline.gempro] INFO: 661: number of genes (excluding spontaneous)
[2018-02-05 18:13] [ssbio.pipeline.gempro] INFO: 826: number of metabolites
[2018-02-05 18:13] [ssbio.pipeline.gempro] WARNING: IMPORTANT: All Gene objects have been transformed into GenePro objects, and will be for any new ones
[2018-02-05 18:13] [ssbio.pipeline.gempro] INFO: 661: number of genes
Mapping gene ID –> sequence

First, we need to map these IDs to their protein sequences. There are 2 ID mapping services provided to do this - through KEGG or UniProt. The end goal is to map a UniProt ID to each ID, since there is a comprehensive mapping (and some useful APIs) between UniProt and the PDB.

Note: You only need to map gene IDs using one service. However you can run both if some genes don’t map in one service and do map in another!

However, you don’t need to map using these services if you already have the amino acid sequences for each protein. You can just manually load in the sequences as shown using the method manual_seq_mapping. Or, if you already have the UniProt IDs, you can load those in using the method manual_uniprot_mapping.

Methods
GEMPRO.manual_seq_mapping(gene_to_seq_dict, outdir=None, write_fasta_files=True, set_as_representative=True)[source]

Read a manual input dictionary of model gene IDs –> protein sequences. By default sets them as representative.

Parameters:
  • gene_to_seq_dict (dict) – Mapping of gene IDs to their protein sequence strings
  • outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
  • write_fasta_files (bool) – If individual protein FASTA files should be written out
  • set_as_representative (bool) – If mapped sequences should be set as representative
In [8]:
gene_to_seq_dict = {'Rv1295': 'MTVPPTATHQPWPGVIAAYRDRLPVGDDWTPVTLLEGGTPLIAATNLSKQTGCTIHLKVEGLNPTGSFKDRGMTMAVTDALAHGQRAVLCASTGNTSASAAAYAARAGITCAVLIPQGKIAMGKLAQAVMHGAKIIQIDGNFDDCLELARKMAADFPTISLVNSVNPVRIEGQKTAAFEIVDVLGTAPDVHALPVGNAGNITAYWKGYTEYHQLGLIDKLPRMLGTQAAGAAPLVLGEPVSHPETIATAIRIGSPASWTSAVEAQQQSKGRFLAASDEEILAAYHLVARVEGVFVEPASAASIAGLLKAIDDGWVARGSTVVCTVTGNGLKDPDTALKDMPSVSPVPVDPVAVVEKLGLA',
                    'Rv2233': 'VSSPRERRPASQAPRLSRRPPAHQTSRSSPDTTAPTGSGLSNRFVNDNGIVTDTTASGTNCPPPPRAAARRASSPGESPQLVIFDLDGTLTDSARGIVSSFRHALNHIGAPVPEGDLATHIVGPPMHETLRAMGLGESAEEAIVAYRADYSARGWAMNSLFDGIGPLLADLRTAGVRLAVATSKAEPTARRILRHFGIEQHFEVIAGASTDGSRGSKVDVLAHALAQLRPLPERLVMVGDRSHDVDGAAAHGIDTVVVGWGYGRADFIDKTSTTVVTHAATIDELREALGV'}
my_gempro.manual_seq_mapping(gene_to_seq_dict)
[2018-02-05 18:14] [ssbio.pipeline.gempro] INFO: Loaded in 2 sequences
GEMPRO.manual_uniprot_mapping(gene_to_uniprot_dict, outdir=None, set_as_representative=True)[source]

Read a manual dictionary of model gene IDs –> UniProt IDs. By default sets them as representative.

This allows for mapping of the missing genes, or overriding of automatic mappings.

Input a dictionary of:

{
    <gene_id1>: <uniprot_id1>,
    <gene_id2>: <uniprot_id2>,
}
Parameters:
  • gene_to_uniprot_dict – Dictionary of mappings as shown above
  • outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
  • set_as_representative (bool) – If mapped UniProt IDs should be set as representative sequences
In [9]:
manual_uniprot_dict = {'Rv1755c': 'P9WIA9', 'Rv2321c': 'P71891', 'Rv0619': 'Q79FY3', 'Rv0618': 'Q79FY4', 'Rv2322c': 'P71890'}
my_gempro.manual_uniprot_mapping(manual_uniprot_dict)
my_gempro.df_uniprot_metadata.tail(4)

[2018-02-05 18:14] [ssbio.pipeline.gempro] INFO: Completed manual ID mapping --> UniProt. See the "df_uniprot_metadata" attribute for a summary dataframe.
Out[9]:
uniprot reviewed gene_name kegg refseq pfam description entry_date entry_version seq_date seq_version sequence_file metadata_file
gene
Rv0619 Q79FY3 False galTb NaN NaN PF02744 Probable galactose-1-phosphate uridylyltransfe... 2017-07-05 78 2004-07-05 1 Q79FY3.fasta Q79FY3.xml
Rv1755c P9WIA9 False plcD NaN NaN PF04185 Phospholipase C 4 2017-07-05 18 2014-04-16 1 P9WIA9.fasta P9WIA9.xml
Rv2321c P71891 False rocD2 mtv:RVBD_2321c WP_003411956.1 PF00202 Probable ornithine aminotransferase (C-terminu... 2017-07-05 116 1997-02-01 1 P71891.fasta P71891.xml
Rv2322c P71890 False rocD1 mtv:RVBD_2322c WP_003411957.1 PF00202 Probable ornithine aminotransferase (N-terminu... 2017-06-07 117 1997-02-01 1 P71890.fasta P71890.xml
GEMPRO.kegg_mapping_and_metadata(kegg_organism_code, custom_gene_mapping=None, outdir=None, set_as_representative=False, force_rerun=False)[source]

Map all genes in the model to KEGG IDs using the KEGG service.

Steps:
  1. Download all metadata and sequence files in the sequences directory
  2. Creates a KEGGProp object in the protein.sequences attribute
  3. Returns a Pandas DataFrame of mapping results
Parameters:
  • kegg_organism_code (str) – The three letter KEGG code of your organism
  • custom_gene_mapping (dict) – If your model genes differ from the gene IDs you want to map, custom_gene_mapping allows you to input a dictionary which maps model gene IDs to new ones. Dictionary keys must match model gene IDs.
  • outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
  • set_as_representative (bool) – If mapped KEGG IDs should be set as representative sequences
  • force_rerun (bool) – If you want to overwrite any existing mappings and files
In [10]:
# KEGG mapping of gene ids
my_gempro.kegg_mapping_and_metadata(kegg_organism_code='mtu')
print('Missing KEGG mapping: ', my_gempro.missing_kegg_mapping)
my_gempro.df_kegg_metadata.head()
[2018-02-05 18:14] [root] WARNING: status is not ok with Not Found
[2018-02-05 18:14] [ssbio.databases.kegg] WARNING: mtu:Rv1755c: no sequence file available
[2018-02-05 18:14] [root] WARNING: status is not ok with Not Found
[2018-02-05 18:14] [ssbio.databases.kegg] WARNING: mtu:Rv1755c: no metadata file available
[2018-02-05 18:14] [root] WARNING: status is not ok with Not Found
[2018-02-05 18:14] [ssbio.databases.kegg] WARNING: mtu:Rv2233: no sequence file available
[2018-02-05 18:14] [root] WARNING: status is not ok with Not Found
[2018-02-05 18:14] [ssbio.databases.kegg] WARNING: mtu:Rv2233: no metadata file available
[2018-02-05 18:14] [ssbio.core.protein] WARNING: Rv2233: representative sequence does not match mapped KEGG sequence.
[2018-02-05 18:14] [root] WARNING: status is not ok with Not Found
[2018-02-05 18:14] [ssbio.databases.kegg] WARNING: mtu:Rv0619: no sequence file available
[2018-02-05 18:14] [root] WARNING: status is not ok with Not Found
[2018-02-05 18:14] [ssbio.databases.kegg] WARNING: mtu:Rv0619: no metadata file available
[2018-02-05 18:14] [root] WARNING: status is not ok with Not Found
[2018-02-05 18:14] [ssbio.databases.kegg] WARNING: mtu:Rv0618: no sequence file available
[2018-02-05 18:14] [root] WARNING: status is not ok with Not Found
[2018-02-05 18:14] [ssbio.databases.kegg] WARNING: mtu:Rv2321c: no sequence file available
[2018-02-05 18:14] [root] WARNING: status is not ok with Not Found
[2018-02-05 18:14] [ssbio.databases.kegg] WARNING: mtu:Rv2321c: no metadata file available
[2018-02-05 18:14] [root] WARNING: status is not ok with Not Found
[2018-02-05 18:14] [ssbio.databases.kegg] WARNING: mtu:Rv2322c: no sequence file available
[2018-02-05 18:14] [root] WARNING: status is not ok with Not Found
[2018-02-05 18:14] [ssbio.databases.kegg] WARNING: mtu:Rv2322c: no metadata file available

[2018-02-05 18:14] [ssbio.pipeline.gempro] INFO: 655/661: number of genes mapped to KEGG
[2018-02-05 18:14] [ssbio.pipeline.gempro] INFO: Completed ID mapping --> KEGG. See the "df_kegg_metadata" attribute for a summary dataframe.
Missing KEGG mapping:  ['Rv0618', 'Rv2233', 'Rv0619', 'Rv1755c', 'Rv2322c', 'Rv2321c']
Out[10]:
kegg refseq uniprot pdbs sequence_file metadata_file
gene
Rv0013 mtu:Rv0013 YP_177615 I6WX77 NaN mtu-Rv0013.faa mtu-Rv0013.kegg
Rv0032 mtu:Rv0032 NP_214546 I6Y6Q7 NaN mtu-Rv0032.faa mtu-Rv0032.kegg
Rv0046c mtu:Rv0046c NP_214560 I6X8D3 1GR0 mtu-Rv0046c.faa mtu-Rv0046c.kegg
Rv0066c mtu:Rv0066c NP_214580 L0T2B7 NaN mtu-Rv0066c.faa mtu-Rv0066c.kegg
Rv0069c mtu:Rv0069c NP_214583 A0A089S0Y8 NaN mtu-Rv0069c.faa mtu-Rv0069c.kegg
GEMPRO.uniprot_mapping_and_metadata(model_gene_source, custom_gene_mapping=None, outdir=None, set_as_representative=False, force_rerun=False)[source]

Map all genes in the model to UniProt IDs using the UniProt mapping service. Also download all metadata and sequences.

Parameters:
  • model_gene_source (str) –

    the database source of your model gene IDs. See: http://www.uniprot.org/help/api_idmapping Common model gene sources are:

    • Ensembl Genomes - ENSEMBLGENOME_ID (i.e. E. coli b-numbers)
    • Entrez Gene (GeneID) - P_ENTREZGENEID
    • RefSeq Protein - P_REFSEQ_AC
  • custom_gene_mapping (dict) – If your model genes differ from the gene IDs you want to map, custom_gene_mapping allows you to input a dictionary which maps model gene IDs to new ones. Dictionary keys must match model genes.
  • outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
  • set_as_representative (bool) – If mapped UniProt IDs should be set as representative sequences
  • force_rerun (bool) – If you want to overwrite any existing mappings and files
In [11]:
# UniProt mapping
my_gempro.uniprot_mapping_and_metadata(model_gene_source='TUBERCULIST_ID')
print('Missing UniProt mapping: ', my_gempro.missing_uniprot_mapping)
my_gempro.df_uniprot_metadata.head()
[2018-02-05 18:14] [root] INFO: getUserAgent: Begin
[2018-02-05 18:14] [root] INFO: getUserAgent: user_agent: EBI-Sample-Client/ (services.py; Python 3.6.3; Linux) Python-requests/2.18.4
[2018-02-05 18:14] [root] INFO: getUserAgent: End
[2018-02-05 18:15] [root] WARNING: status is not ok with Bad Request
[2018-02-05 18:15] [root] WARNING: Results seems empty...returning empty dictionary.

[2018-02-05 18:15] [ssbio.pipeline.gempro] INFO: 0/661: number of genes mapped to UniProt
[2018-02-05 18:15] [ssbio.pipeline.gempro] INFO: Completed ID mapping --> UniProt. See the "df_uniprot_metadata" attribute for a summary dataframe.
Missing UniProt mapping:  ['Rv2178c', 'Rv2476c', 'Rv3607c', 'Rv1202', 'Rv1030', 'Rv1187', 'Rv2421c', 'Rv2247', 'Rv2701c', 'Rv1662', 'Rv1380', 'Rv2763c', 'Rv2435c', 'Rv0082', 'Rv0189c', 'Rv3757c', 'Rv1133c', 'Rv1257c', 'Rv1623c', 'Rv0958', 'Rv1162', 'Rv3581c', 'Rv1739c', 'Rv1086', 'Rv1077', 'Rv1625c', 'Rv3227', 'Rv2832c', 'Rv3308', 'Rv2858c', 'Rv2193', 'Rv2834c', 'Rv2995c', 'Rv1915', 'Rv3003c', 'Rv1538c', 'Rv2793c', 'Rv2152c', 'Rv3468c', 'Rv1485', 'Rv1079', 'Rv1552', 'Rv1161', 'Rv3307', 'Rv0768', 'Rv1843c', 'Rv1302', 'Rv0780', 'Rv2064', 'Rv2124c', 'Rv1908c', 'Rv1555', 'Rv2754c', 'Rv0346c', 'Rv2467', 'Rv2384', 'Rv3758c', 'Rv1338', 'Rv1311', 'Rv1599', 'Rv1589', 'Rv2930', 'Rv0564c', 'Rv2070c', 'Rv0143c', 'Rv0107c', 'Rv2157c', 'Rv2847c', 'Rv1416', 'Rv0147', 'Rv0432', 'Rv3152', 'Rv0098', 'Rv0548c', 'Rv1408', 'Rv2320c', 'Rv2859c', 'Rv3314c', 'Rv3331', 'Rv3393', 'Rv0509', 'Rv2062c', 'Rv1248c', 'Rv1895', 'Rv2870c', 'Rv1131', 'Rv1294', 'Rv0337c', 'Rv0650', 'Rv1001', 'Rv2211c', 'Rv2471', 'Rv2075c', 'Rv1622c', 'Rv3045', 'Rv0772', 'Rv1607', 'Rv2601', 'Rv2363', 'Rv2483c', 'Rv1553', 'Rv1559', 'Rv3464', 'Rv3010c', 'Rv2552c', 'Rv1031', 'Rv1905c', 'Rv3815c', 'Rv2935', 'Rv3469c', 'Rv1595', 'Rv1293', 'Rv1099c', 'Rv3398c', 'Rv2764c', 'Rv1604', 'Rv2399c', 'Rv1695', 'Rv0253', 'Rv0046c', 'Rv3634c', 'Rv2982c', 'Rv0414c', 'Rv3410c', 'Rv3042c', 'Rv1659', 'Rv0013', 'Rv3582c', 'Rv2051c', 'Rv1082', 'Rv1373', 'Rv1570', 'Rv2236c', 'Rv1093', 'Rv1445c', 'Rv3602c', 'Rv2502c', 'Rv1436', 'Rv1849', 'Rv2382c', 'Rv0952', 'Rv2139', 'Rv0956', 'Rv3283', 'Rv3310', 'Rv0373c', 'Rv2243', 'Rv1612', 'Rv1307', 'Rv3229c', 'Rv3265c', 'Rv2220', 'Rv2786c', 'Rv0382c', 'Rv2583c', 'Rv2678c', 'Rv0896', 'Rv1207', 'Rv0183', 'Rv3913', 'Rv2940c', 'Rv3153', 'Rv1731', 'Rv0973c', 'Rv3247c', 'Rv0728c', 'Rv2987c', 'Rv3846', 'Rv3150', 'Rv0234c', 'Rv3794', 'Rv3436c', 'Rv2427c', 'Rv3257c', 'Rv2539c', 'Rv2965c', 'Rv2156c', 'Rv3791', 'Rv1296', 'Rv2287', 'Rv2899c', 'Rv3275c', 'Rv1837c', 'Rv1285', 'Rv1602', 'Rv2245', 'Rv0423c', 'Rv2208', 'Rv3801c', 'Rv0436c', 'Rv1492', 'Rv2780', 'Rv1658', 'Rv2674', 'Rv2329c', 'Rv3509c', 'Rv3290c', 'Rv0620', 'Rv1383', 'Rv1850', 'Rv0112', 'Rv2996c', 'Rv3396c', 'Rv2671', 'Rv1653', 'Rv2496c', 'Rv3330', 'Rv1663', 'Rv0820', 'Rv1448c', 'Rv2205c', 'Rv1447c', 'Rv2043c', 'Rv3806c', 'Rv2029c', 'Rv3792', 'Rv3313c', 'Rv1822', 'Rv3470c', 'Rv1609', 'Rv0103c', 'Rv3264c', 'Rv1484', 'Rv3534c', 'Rv1603', 'Rv0803', 'Rv2849c', 'Rv2458', 'Rv3341', 'Rv1617', 'Rv0570', 'Rv3609c', 'Rv1512', 'Rv1672c', 'Rv3759c', 'Rv2201', 'Rv0855', 'Rv3236c', 'Rv1940', 'Rv2163c', 'Rv0374c', 'Rv1563c', 'Rv0375c', 'Rv2455c', 'Rv1310', 'Rv1295', 'Rv2702', 'Rv3356c', 'Rv2445c', 'Rv3273', 'Rv1306', 'Rv3423c', 'Rv2589', 'Rv0317c', 'Rv3303c', 'Rv2439c', 'Rv1304', 'Rv3248c', 'Rv0858c', 'Rv0534c', 'Rv3158', 'Rv1605', 'Rv1127c', 'Rv1692', 'Rv1655', 'Rv2192c', 'Rv1391', 'Rv1122', 'Rv2454c', 'Rv2573', 'Rv2949c', 'Rv2379c', 'Rv1240', 'Rv1098c', 'Rv2196', 'Rv0555', 'Rv1350', 'Rv3302c', 'Rv1309', 'Rv0267', 'Rv0255c', 'Rv2194', 'Rv3051c', 'Rv0524', 'Rv2590', 'Rv2465c', 'Rv0137c', 'Rv1601', 'Rv0437c', 'Rv1554', 'Rv1318c', 'Rv3340', 'Rv1389', 'Rv2934', 'Rv0545c', 'Rv0558', 'Rv1005c', 'Rv1412', 'Rv2127', 'Rv2335', 'Rv2392', 'Rv2981c', 'Rv3709c', 'Rv3754', 'Rv2612c', 'Rv2931', 'Rv0557', 'Rv0252', 'Rv2928', 'Rv1347c', 'Rv1449c', 'Rv1600', 'Rv2207', 'Rv0536', 'Rv0295c', 'Rv3455c', 'Rv3411c', 'Rv0503c', 'Rv3708c', 'Rv0417', 'Rv1704c', 'Rv0489', 'Rv3795', 'Rv0553', 'Rv0573c', 'Rv2498c', 'Rv0470c', 'Rv0773c', 'Rv3048c', 'Rv2933', 'Rv3704c', 'Rv3818', 'Rv1737c', 'Rv0266c', 'Rv0644c', 'Rv2130c', 'Rv3280', 'Rv2400c', 'Rv2182c', 'Rv1018c', 'Rv2158c', 'Rv3149', 'Rv2289', 'Rv0126', 'Rv2967c', 'Rv3214', 'Rv2317', 'Rv3490', 'Rv2881c', 'Rv1916', 'Rv1438', 'Rv1348', 'Rv3808c', 'Rv0334', 'Rv1594', 'Rv3156', 'Rv2072c', 'Rv2344c', 'Rv2210c', 'Rv2281', 'Rv0993', 'Rv2316', 'Rv1406', 'Rv1613', 'Rv1201c', 'Rv0091', 'Rv3318', 'Rv3285', 'Rv0409', 'Rv0974c', 'Rv3316', 'Rv2121c', 'Rv2833c', 'Rv3309c', 'Rv1328', 'Rv3266c', 'Rv1213', 'Rv2697c', 'Rv0946c', 'Rv1832', 'Rv3793', 'Rv1011', 'Rv3696c', 'Rv2246', 'Rv3858c', 'Rv1620c', 'Rv2386c', 'Rv1618', 'Rv0211', 'Rv1631', 'Rv0512', 'Rv1385', 'Rv1286', 'Rv2122c', 'Rv3319', 'Rv2249c', 'Rv2835c', 'Rv2941', 'Rv2605c', 'Rv2394', 'Rv3145', 'Rv0886', 'Rv3157', 'Rv3737', 'Rv1094', 'Rv1820', 'Rv0500', 'Rv1236', 'Rv1409', 'Rv2136c', 'Rv3800c', 'Rv2984', 'Rv0118c', 'Rv1121', 'Rv2932', 'Rv1826', 'Rv2233', 'Rv1511', 'Rv2540c', 'Rv3838c', 'Rv1305', 'Rv2200c', 'Rv3146', 'Rv0389', 'Rv0848', 'Rv3279c', 'Rv1529', 'Rv0533c', 'Rv0951', 'Rv0884c', 'Rv1308', 'Rv0482', 'Rv2259', 'Rv0727c', 'Rv0261c', 'Rv3281', 'Rv3756c', 'Rv1475c', 'Rv0462', 'Rv1551', 'Rv0805', 'Rv1656', 'Rv2398c', 'Rv1392', 'Rv1170', 'Rv2438c', 'Rv3379c', 'Rv1562c', 'Rv0066c', 'Rv2726c', 'Rv1844c', 'Rv1699', 'Rv2231c', 'Rv3002c', 'Rv3441c', 'Rv1029', 'Rv3001c', 'Rv1415', 'Rv3148', 'Rv2378c', 'Rv2958c', 'Rv2900c', 'Rv0505c', 'Rv3043c', 'Rv1200', 'Rv1185c', 'Rv2531c', 'Rv3772', 'Rv2947c', 'Rv1654', 'Rv0486', 'Rv2436', 'Rv2977c', 'Rv0511', 'Rv1848', 'Rv1237', 'Rv3601c', 'Rv0733', 'Rv3155', 'Rv0753c', 'Rv0694', 'Rv3215', 'Rv3710', 'Rv3606c', 'Rv0254c', 'Rv3315c', 'Rv3713', 'Rv0155', 'Rv0859', 'Rv3317', 'Rv2222c', 'Rv2495c', 'Rv1745c', 'Rv2497c', 'Rv0771', 'Rv1320c', 'Rv3842c', 'Rv1349', 'Rv2773c', 'Rv1164', 'Rv1381', 'Rv0363c', 'Rv2682c', 'Rv3113', 'Rv3154', 'Rv2860c', 'Rv3588c', 'Rv3608c', 'Rv3535c', 'Rv2318', 'Rv0777', 'Rv0729', 'Rv1714', 'Rv0645c', 'Rv0853c', 'Rv0642c', 'Rv0162c', 'Rv2992c', 'Rv3068c', 'Rv1621c', 'Rv0522', 'Rv1568', 'Rv3465', 'Rv2584c', 'Rv2746c', 'Rv0248c', 'Rv0542c', 'Rv2383c', 'Rv0391', 'Rv2332', 'Rv1264', 'Rv2065', 'Rv1652', 'Rv2291', 'Rv1902c', 'Rv1928c', 'Rv0478', 'Rv0467', 'Rv2713', 'Rv1017c', 'Rv0501', 'Rv0422c', 'Rv1023', 'Rv3624c', 'Rv2071c', 'Rv2964', 'Rv0357c', 'Rv2195', 'Rv1647', 'Rv3293', 'Rv2006', 'Rv3826', 'Rv0499', 'Rv0032', 'Rv2988c', 'Rv2155c', 'Rv3372', 'Rv1712', 'Rv1336', 'Rv0306', 'Rv2397c', 'Rv1611', 'Rv3859c', 'Rv0889c', 'Rv2677c', 'Rv0069c', 'Rv2945c', 'Rv1238', 'Rv0408', 'Rv3276c', 'Rv2361c', 'Rv0510', 'Rv1569', 'Rv2610c', 'Rv0812', 'Rv2962c', 'Rv0649', 'Rv2381c', 'Rv2538c', 'Rv1872c', 'Rv3628', 'Rv0191', 'Rv2920c', 'Rv3777', 'Rv0808', 'Rv2447c', 'Rv2607', 'Rv2611c', 'Rv3790', 'Rv3147', 'Rv2380c', 'Rv2753c', 'Rv1323', 'Rv3255c', 'Rv2202c', 'Rv2443', 'Rv0247c', 'Rv1437', 'Rv3784', 'Rv0156', 'Rv1188', 'Rv0322', 'Rv2957', 'Rv0957', 'Rv3332', 'Rv0157', 'Rv3339c', 'Rv3809c', 'Rv2523c', 'Rv0468', 'Rv0936', 'Rv2501c', 'Rv3106', 'Rv0904c', 'Rv1878', 'Rv1451', 'Rv0794c', 'Rv1885c', 'Rv2524c', 'Rv2153c', 'Rv1606', 'Rv3432c', 'Rv0843', 'Rv0321', 'Rv2883c', 'Rv2334', 'Rv1315', 'Rv0819', 'Rv3667', 'Rv0809', 'Rv1596', 'Rv0084', 'Rv2066', 'Rv2391', 'Rv1319c', 'Rv1163', 'Rv2504c', 'Rv1239c', 'Rv3565', 'Rv2215', 'Rv2855', 'Rv0824c', 'Rv2537c', 'Rv1464', 'Rv2848c', 'Rv2225', 'Rv0070c', 'Rv2241', 'Rv1493', 'Rv1981c', 'Rv2503c', 'Rv2388c', 'Rv1092c', 'Rv1483', 'Rv0788', 'Rv0860']
Out[11]:
uniprot reviewed gene_name kegg refseq pfam description entry_date entry_version seq_date seq_version sequence_file metadata_file
gene
Rv0618 Q79FY4 False galTa mtv:RVBD_0618 WP_003900189.1 PF01087 Probable galactose-1-phosphate uridylyltransfe... 2017-07-05 87 2004-07-05 1 Q79FY4.fasta Q79FY4.xml
Rv0619 Q79FY3 False galTb NaN NaN PF02744 Probable galactose-1-phosphate uridylyltransfe... 2017-07-05 78 2004-07-05 1 Q79FY3.fasta Q79FY3.xml
Rv1755c P9WIA9 False plcD NaN NaN PF04185 Phospholipase C 4 2017-07-05 18 2014-04-16 1 P9WIA9.fasta P9WIA9.xml
Rv2321c P71891 False rocD2 mtv:RVBD_2321c WP_003411956.1 PF00202 Probable ornithine aminotransferase (C-terminu... 2017-07-05 116 1997-02-01 1 P71891.fasta P71891.xml
Rv2322c P71890 False rocD1 mtv:RVBD_2322c WP_003411957.1 PF00202 Probable ornithine aminotransferase (N-terminu... 2017-06-07 117 1997-02-01 1 P71890.fasta P71890.xml
GEMPRO.set_representative_sequence(force_rerun=False)[source]

Automatically consolidate loaded sequences (manual, UniProt, or KEGG) and set a single representative sequence.

Manually set representative sequences override all existing mappings. UniProt mappings override KEGG mappings except when KEGG mappings have PDBs associated with them and UniProt doesn’t.

Parameters:force_rerun (bool) – Set to True to recheck stored sequences

If you have mapped with both KEGG and UniProt mappers, then you can set a representative sequence for the gene using this function. If you used just one, this will just set that ID as representative.

  • If any sequences or IDs were provided manually, these will be set as representative first.
  • UniProt mappings override KEGG mappings except when KEGG mappings have PDBs associated with them and UniProt doesn’t.
In [12]:
# Set representative sequences
my_gempro.set_representative_sequence()
print('Missing a representative sequence: ', my_gempro.missing_representative_sequence)
my_gempro.df_representative_sequences.head()

[2018-02-05 18:15] [ssbio.pipeline.gempro] INFO: 661/661: number of genes with a representative sequence
[2018-02-05 18:15] [ssbio.pipeline.gempro] INFO: See the "df_representative_sequences" attribute for a summary dataframe.
Missing a representative sequence:  []
Out[12]:
uniprot kegg pdbs sequence_file metadata_file
gene
Rv0013 I6WX77 mtu:Rv0013 NaN mtu-Rv0013.faa mtu-Rv0013.kegg
Rv0032 I6Y6Q7 mtu:Rv0032 NaN mtu-Rv0032.faa mtu-Rv0032.kegg
Rv0046c I6X8D3 mtu:Rv0046c 1GR0 mtu-Rv0046c.faa mtu-Rv0046c.kegg
Rv0066c L0T2B7 mtu:Rv0066c NaN mtu-Rv0066c.faa mtu-Rv0066c.kegg
Rv0069c A0A089S0Y8 mtu:Rv0069c NaN mtu-Rv0069c.faa mtu-Rv0069c.kegg
Mapping representative sequence –> structure

These are the ways to map sequence to structure:

  1. Use the UniProt ID and their automatic mappings to the PDB
  2. BLAST the sequence to the PDB
  3. Make homology models or
  4. Map to existing homology models

You can only utilize option #1 to map to PDBs if there is a mapped UniProt ID set in the representative sequence. If not, you’ll have to BLAST your sequence to the PDB or make a homology model. You can also run both for maximum coverage.

Methods
GEMPRO.map_uniprot_to_pdb(seq_ident_cutoff=0.0, outdir=None, force_rerun=False)[source]

Map all representative sequences’ UniProt ID to PDB IDs using the PDBe “Best Structures” API. Will save a JSON file of the results to each protein’s sequences folder.

The “Best structures” API is available at https://www.ebi.ac.uk/pdbe/api/doc/sifts.html The list of PDB structures mapping to a UniProt accession sorted by coverage of the protein and, if the same, resolution.

Parameters:
  • seq_ident_cutoff (float) – Sequence identity cutoff in decimal form
  • outdir (str) – Output directory to cache JSON results of search
  • force_rerun (bool) – Force re-downloading of JSON results if they already exist
Returns:

A rank-ordered list of PDBProp objects that map to the UniProt ID

Return type:

list

In [13]:
# Mapping using the PDBe best_structures service
my_gempro.map_uniprot_to_pdb(seq_ident_cutoff=.3)
my_gempro.df_pdb_ranking.head()
[2018-02-05 18:15] [ssbio.pipeline.gempro] INFO: Mapping UniProt IDs --> PDB IDs...
[2018-02-05 18:15] [root] INFO: getUserAgent: Begin
[2018-02-05 18:15] [root] INFO: getUserAgent: user_agent: EBI-Sample-Client/ (services.py; Python 3.6.3; Linux) Python-requests/2.18.4
[2018-02-05 18:15] [root] INFO: getUserAgent: End
[2018-02-05 18:15] [root] WARNING: status is not ok with Bad Request
[2018-02-05 18:15] [root] WARNING: Results seems empty...returning empty dictionary.

[2018-02-05 18:15] [ssbio.pipeline.gempro] INFO: 0/661: number of genes with at least one experimental structure
[2018-02-05 18:15] [ssbio.pipeline.gempro] INFO: Completed UniProt --> best PDB mapping. See the "df_pdb_ranking" attribute for a summary dataframe.
[2018-02-05 18:15] [ssbio.pipeline.gempro] WARNING: Empty dataframe
Out[13]:
GEMPRO.blast_seqs_to_pdb(seq_ident_cutoff=0, evalue=0.0001, all_genes=False, display_link=False, outdir=None, force_rerun=False)[source]

BLAST each representative protein sequence to the PDB. Saves raw BLAST results (XML files).

Parameters:
  • seq_ident_cutoff (float, optional) – Cutoff results based on percent coverage (in decimal form)
  • evalue (float, optional) – Cutoff for the E-value - filters for significant hits. 0.001 is liberal, 0.0001 is stringent (default).
  • all_genes (bool) – If all genes should be BLASTed, or only those without any structures currently mapped
  • display_link (bool, optional) – Set to True if links to the HTML results should be displayed
  • outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
  • force_rerun (bool, optional) – If existing BLAST results should not be used, set to True. Default is False
In [14]:
# Mapping using BLAST
my_gempro.blast_seqs_to_pdb(all_genes=True, seq_ident_cutoff=.9, evalue=0.00001)
my_gempro.df_pdb_blast.head(2)

[2018-02-05 18:15] [ssbio.pipeline.gempro] INFO: Completed sequence --> PDB BLAST. See the "df_pdb_blast" attribute for a summary dataframe.
[2018-02-05 18:15] [ssbio.pipeline.gempro] INFO: 141: number of genes with additional structures added from BLAST
Out[14]:
pdb_id pdb_chain_id hit_score hit_evalue hit_percent_similar hit_percent_ident hit_num_ident hit_num_similar
gene
Rv0046c 1gr0 A 1861.0 0.0 1.000000 1.000000 367 367
Rv0066c 5kvu D 3828.0 0.0 0.981208 0.981208 731 731
GEMPRO.get_itasser_models(homology_raw_dir, custom_itasser_name_mapping=None, outdir=None, force_rerun=False)[source]

Copy generated I-TASSER models from a directory to the GEM-PRO directory.

Parameters:
  • homology_raw_dir (str) – Root directory of I-TASSER folders.
  • custom_itasser_name_mapping (dict) – Use this if your I-TASSER folder names differ from your model gene names. Input a dict of {model_gene: ITASSER_folder}.
  • outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
  • force_rerun (bool) – If homology files should be copied again even if they exist in the GEM-PRO directory
In [15]:
tb_homology_dir = '/home/nathan/projects_archive/homology_models/MTUBERCULOSIS/'

##### EXAMPLE SPECIFIC CODE #####
# Needed to map to older IDs used in this example
import pandas as pd
import os.path as op
old_gene_to_homology = pd.read_csv(op.join(tb_homology_dir, 'data/161031-old_gene_to_uniprot_mapping.csv'))
gene_to_uniprot = old_gene_to_homology.set_index('m_gene').to_dict()['u_uniprot_acc']
my_gempro.get_itasser_models(homology_raw_dir=op.join(tb_homology_dir, 'raw'), custom_itasser_name_mapping=gene_to_uniprot)
### END EXAMPLE SPECIFIC CODE ###

# Organizing I-TASSER homology models
my_gempro.get_itasser_models(homology_raw_dir=op.join(tb_homology_dir, 'raw'))
my_gempro.df_homology_models.head()

[2018-02-05 18:16] [ssbio.pipeline.gempro] INFO: Completed copying of 435 I-TASSER models to GEM-PRO directory. See the "df_homology_models" attribute for a summary dataframe.

[2018-02-05 18:16] [ssbio.pipeline.gempro] INFO: Completed copying of 9 I-TASSER models to GEM-PRO directory. See the "df_homology_models" attribute for a summary dataframe.
Out[15]:
id structure_file model_date difficulty top_template_pdb top_template_chain c_score tm_score tm_score_err rmsd rmsd_err
gene
Rv0013 P9WN35 P9WN35_model1.pdb 2018-02-06 easy 1i7s B -0.53 0.65 0.13 6.8 4.0
Rv0032 P9WQ85 P9WQ85_model1.pdb 2018-02-06 easy 3a2b A -2.89 0.39 0.13 15.7 3.3
Rv0066c O53611 O53611_model1.pdb 2018-02-06 easy 1itw A 1.91 0.99 0.04 4.1 2.8
Rv0069c P9WGT5 P9WGT5_model1.pdb 2018-02-06 easy 4rqo A 1.18 0.88 0.07 4.6 3.0
Rv0070c P9WGI7 P9WGI7_model1.pdb 2018-02-06 easy 3h7f B 1.80 0.97 0.05 3.3 2.3
GEMPRO.get_manual_homology_models(input_dict, outdir=None, clean=True, force_rerun=False)[source]

Copy homology models to the GEM-PRO project.

Requires an input of a dictionary formatted like so:

{
    model_gene: {
                    homology_model_id1: {
                                            'model_file': '/path/to/homology/model.pdb',
                                            'file_type': 'pdb'
                                            'additional_info': info_value
                                        },
                    homology_model_id2: {
                                            'model_file': '/path/to/homology/model.pdb'
                                            'file_type': 'pdb'
                                        }
                }
}
Parameters:
  • input_dict (dict) – Dictionary of dictionaries of gene names to homology model IDs and other information
  • outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
  • clean (bool) – If homology files should be cleaned and saved as a new PDB file
  • force_rerun (bool) – If homology files should be copied again even if they exist in the GEM-PRO directory
In [16]:
homology_model_dict = {}
my_gempro.get_manual_homology_models(homology_model_dict)

[2018-02-05 18:16] [ssbio.pipeline.gempro] INFO: Updated homology model information for 0 genes.
Downloading and ranking structures
Methods
GEMPRO.pdb_downloader_and_metadata(outdir=None, pdb_file_type=None, force_rerun=False)[source]

Download ALL mapped experimental structures to each protein’s structures directory.

Parameters:
  • outdir (str) – Path to output directory, if GEM-PRO directories were not set or other output directory is desired
  • pdb_file_type (str) – Type of PDB file to download, if not already set or other format is desired
  • force_rerun (bool) – If files should be re-downloaded if they already exist
Warning: Downloading all PDBs takes a while, since they are also parsed for metadata. You can skip this step and just set representative structures below if you want to minimize the number of PDBs downloaded.
In [ ]:
# Download all mapped PDBs and gather the metadata
my_gempro.pdb_downloader_and_metadata()
my_gempro.df_pdb_metadata.head(2)
GEMPRO.set_representative_structure(seq_outdir=None, struct_outdir=None, pdb_file_type=None, engine=’needle’, always_use_homology=False, rez_cutoff=0.0, seq_ident_cutoff=0.5, allow_missing_on_termini=0.2, allow_mutants=True, allow_deletions=False, allow_insertions=False, allow_unresolved=True, skip_large_structures=False, clean=True, force_rerun=False)[source]

Set all representative structure for proteins from a structure in the structures attribute.

Each gene can have a combination of the following, which will be analyzed to set a representative structure.

  • Homology model(s)
  • Ranked PDBs
  • BLASTed PDBs

If the always_use_homology flag is true, homology models are always set as representative when they exist. If there are multiple homology models, we rank by the percent sequence coverage.

Parameters:
  • seq_outdir (str) – Path to output directory of sequence alignment files, must be set if GEM-PRO directories were not created initially
  • struct_outdir (str) – Path to output directory of structure files, must be set if GEM-PRO directories were not created initially
  • pdb_file_type (str) – pdb, mmCif, xml, mmtf - file type for files downloaded from the PDB
  • engine (str) – biopython or needle - which pairwise alignment program to use. needle is the standard EMBOSS tool to run pairwise alignments. biopython is Biopython’s implementation of needle. Results can differ!
  • always_use_homology (bool) – If homology models should always be set as the representative structure
  • rez_cutoff (float) – Resolution cutoff, in Angstroms (only if experimental structure)
  • seq_ident_cutoff (float) – Percent sequence identity cutoff, in decimal form
  • allow_missing_on_termini (float) – Percentage of the total length of the reference sequence which will be ignored when checking for modifications. Example: if 0.1, and reference sequence is 100 AA, then only residues 5 to 95 will be checked for modifications.
  • allow_mutants (bool) – If mutations should be allowed or checked for
  • allow_deletions (bool) – If deletions should be allowed or checked for
  • allow_insertions (bool) – If insertions should be allowed or checked for
  • allow_unresolved (bool) – If unresolved residues should be allowed or checked for
  • skip_large_structures (bool) – Default False – currently, large structures can’t be saved as a PDB file even if you just want to save a single chain, so Biopython will throw an error when trying to do so. As an alternative, if a large structure is selected as representative, the pipeline will currently point to it and not clean it. If you don’t want this to happen, set this to true.
  • clean (bool) – If structures should be cleaned
  • force_rerun (bool) – If sequence to structure alignment should be rerun

Todo

  • Remedy large structure representative setting
In [17]:
# Set representative structures
my_gempro.set_representative_structure()
my_gempro.df_representative_structures.head()
[2018-02-05 18:16] [ssbio.core.protein] WARNING: Rv0234c: no structures meet quality checks
[2018-02-05 18:16] [ssbio.core.protein] WARNING: Rv0505c: no structures meet quality checks
[2018-02-05 18:17] [ssbio.core.protein] WARNING: Rv2987c: no structures meet quality checks
[2018-02-05 18:18] [ssbio.core.protein] WARNING: Rv2498c: no structures meet quality checks
[2018-02-05 18:18] [ssbio.core.protein] WARNING: Rv3601c: no structures meet quality checks

[2018-02-05 18:18] [ssbio.pipeline.gempro] INFO: 553/661: number of genes with a representative structure
[2018-02-05 18:18] [ssbio.pipeline.gempro] INFO: See the "df_representative_structures" attribute for a summary dataframe.
Out[17]:
id is_experimental file_type structure_file
gene
Rv0013 REP-P9WN35 False pdb P9WN35_model1-X_clean.pdb
Rv0032 REP-P9WQ85 False pdb P9WQ85_model1-X_clean.pdb
Rv0046c REP-1gr0 True pdb 1gr0-A_clean.pdb
Rv0066c REP-5kvu True pdb 5kvu-A_clean.pdb
Rv0069c REP-P9WGT5 False pdb P9WGT5_model1-X_clean.pdb
In [18]:
# Looking at the information saved within a gene
my_gempro.genes.get_by_id('Rv1295').protein.representative_structure
my_gempro.genes.get_by_id('Rv1295').protein.representative_structure.get_dict()
Out[18]:
<StructProp REP-2d1f at 0x7fdec937e4a8>
Out[18]:
{'_structure_dir': '/tmp/mtuberculosis_gp/genes/Rv1295/Rv1295_protein/structures',
 'chains': [<ChainProp A at 0x7fdec8655630>],
 'date': None,
 'description': 'Threonine synthase (E.C.4.2.3.1)',
 'file_type': 'pdb',
 'id': 'REP-2d1f',
 'is_experimental': True,
 'mapped_chains': ['A'],
 'notes': {},
 'original_structure_id': '2d1f',
 'resolution': 2.5,
 'structure_file': '2d1f-A_clean.pdb',
 'taxonomy_name': 'Mycobacterium tuberculosis'}
Creating homology models

For those proteins with no representative structure, we can create homology models for them. ssbio contains some built in functions for easily running I-TASSER locally or on machines with SLURM (ie. on NERSC) or Torque job scheduling.

You can load in I-TASSER models once they complete using the get_itasser_models later.

Info: Homology modeling can take a long time - about 24-72 hours per protein (highly dependent on the sequence length, as well as if there are available templates).

Methods
In [19]:
# Prep I-TASSER model folders
my_gempro.prep_itasser_modeling('~/software/I-TASSER4.4', '~/software/ITLIB/', runtype='local', all_genes=False)
[2018-02-05 18:18] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv2934: I-TASSER modeling will not run as sequence length (1827) is not in the range [10, 1500]
[2018-02-05 18:18] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv2932: I-TASSER modeling will not run as sequence length (1538) is not in the range [10, 1500]
[2018-02-05 18:18] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv2933: I-TASSER modeling will not run as sequence length (2188) is not in the range [10, 1500]
[2018-02-05 18:18] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv2931: I-TASSER modeling will not run as sequence length (1876) is not in the range [10, 1500]
[2018-02-05 18:18] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv2380c: I-TASSER modeling will not run as sequence length (1682) is not in the range [10, 1500]
[2018-02-05 18:18] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv3859c: I-TASSER modeling will not run as sequence length (1527) is not in the range [10, 1500]
[2018-02-05 18:18] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv2476c: I-TASSER modeling will not run as sequence length (1624) is not in the range [10, 1500]
[2018-02-05 18:18] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv3800c: I-TASSER modeling will not run as sequence length (1733) is not in the range [10, 1500]
[2018-02-05 18:18] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv0107c: I-TASSER modeling will not run as sequence length (1632) is not in the range [10, 1500]
[2018-02-05 18:18] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv2940c: I-TASSER modeling will not run as sequence length (2111) is not in the range [10, 1500]
[2018-02-05 18:18] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv1662: I-TASSER modeling will not run as sequence length (1602) is not in the range [10, 1500]
[2018-02-05 18:18] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv2524c: I-TASSER modeling will not run as sequence length (3069) is not in the range [10, 1500]
[2018-02-05 18:18] [ssbio.pipeline.gempro] INFO: Prepared I-TASSER modeling folders for 108 genes in folder /tmp/mtuberculosis_gp/data/homology_models
Saving your GEM-PRO

Finally, you can save your GEM-PRO as a JSON or pickle file, so you don’t have to run the pipeline again.

For most functions, if you rerun them, they will check for existing results saved as files. The only function that would take a long time is setting the representative structure, as they are each rechecked and cleaned. This is where saving helps!

Warning: Saving in JSON format is still experimental. For a full GEM-PRO with sequences & structures, depending on the number of genes, saving can take >5 minutes.

GEMPRO.save_pickle(outfile, protocol=2)

Save the object as a pickle file

Parameters:
  • outfile (str) – Filename
  • protocol (int) – Pickle protocol to use. Default is 2 to remain compatible with Python 2
Returns:

Path to pickle file

Return type:

str

In [20]:
import os.path as op
my_gempro.save_pickle(op.join(my_gempro.model_dir, '{}.pckl'.format(my_gempro.id)))
GEMPRO.save_json(outfile, compression=False)

Save the object as a JSON file using json_tricks

In [21]:
import os.path as op
my_gempro.save_json(op.join(my_gempro.model_dir, '{}.json'.format(my_gempro.id)), compression=False)
[2018-02-05 18:18] [root] WARNING: json-tricks: numpy scalar serialization is experimental and may work differently in future versions
[2018-02-05 18:18] [ssbio.io] INFO: Saved <class 'ssbio.pipeline.gempro.GEMPRO'> (id: mtuberculosis_gp) to /tmp/mtuberculosis_gp/model/mtuberculosis_gp.json
Loading a saved GEM-PRO
In [ ]:
# Loading a pickle file
import pickle
with open('/tmp/mtuberculosis_gp_atlas/model/mtuberculosis_gp_atlas.pckl', 'rb') as f:
    my_saved_gempro = pickle.load(f)
In [ ]:
# Loading a JSON file
import ssbio.core.io
my_saved_gempro = ssbio.core.io.load_json('/tmp/mtuberculosis_gp_atlas/model/mtuberculosis_gp_atlas.json', decompression=False)

Features

  • Automated mapping of gene/protein sequence IDs
  • Consolidating sequence IDs and setting a representative protein sequence
  • Mapping of representative protein sequence –> 3D structures
  • Preparation of sequences for homology modeling (currently for I-TASSER)
  • Running QC/QA on structures and setting a representative protein structure
  • Automation of protein sequence and structure property calculation
  • Creation of Pandas DataFrame summaries directly from downloaded or calculated metadata

COBRApy model additions

The GEM-PRO Class

Let’s take a look at a GEM loaded with ssbio and what additions exist compared to a GEM loaded with COBRApy. In the figure above, the text in grey indicates objects that exist in a COBRApy Model object, and in blue, the attributes added when loading with ssbio. Please note that the Complex object is still under development and currently non-functional.

COBRApy

Under construction…

ssbio

Under construction…

Use cases

Uses of a GEM-PRO

When would you create or use a GEM-PRO? The added context of manually curated network interactions to protein structures enables different scales of analyses. For instance…

From the “top-down”:
  • Global non-variant properties of protein structures such as the distribution of fold types can be compared within or between organisms [1], [2], [3], elucidating adaptations that are reflected in the structural proteome.
  • Multi-strain modelling techniques ([10], [11], [12]) would allow strain-specific changes to be investigated at the molecular level, potentially explaining phenotypic differences or strain adaptations to certain environments.
From the “bottom-up”
  • Structural properties predicted from sequence or calculated from structure can be utilized to enhance model predictive capabilities [4], [5], [6], [7], [8], [9]

File organization

Files such as sequences, structures, alignment files, and property calculation outputs can optionally be cached on a user’s disk to minimize calls to web services, limit recalculations, and provide direct inputs to common sequence and structure algorithms which often require local copies of the data. For a GEM-PRO project, files are organized in the following fashion once a root directory and project name are set:

<ROOT_DIR>
└── <PROJECT_NAME>
      ├── data  # General directory for pipeline outputs
      ├── model  # SBML and GEM-PRO models are stored in this directory
      └── genes  # Per gene information
            └── <gene_id1>  # Specific gene directory
                  └── <protein_id1>  # Protein directory
                        ├── sequences  # Protein sequence files, alignments, etc.
                        └── structures  # Protein structure files, calculations, etc.

API

GEMPRO
class ssbio.pipeline.gempro.GEMPRO(gem_name, root_dir=None, pdb_file_type='mmtf', gem=None, gem_file_path=None, gem_file_type=None, genes_list=None, genes_and_sequences=None, genome_path=None, write_protein_fasta_files=True, description=None, custom_spont_id=None)[source]

Generic class to represent all information for a GEM-PRO project.

Initialize the GEM-PRO project with a genome-scale model, a list of genes, or a dict of genes and sequences. Specify the name of your project, along with the root directory where a folder with that name will be created.

Main methods provided are:

  1. Automated mapping of sequence IDs

    • With KEGG mapper
    • With UniProt mapper
    • Allowing manual gene ID –> protein sequence entry
    • Allowing manual gene ID –> UniProt ID
  2. Consolidating sequence IDs and setting a representative sequence

    • Currently these are set based on available PDB IDs
  3. Mapping of representative sequence –> structures

    • With UniProt –> ranking of PDB structures
    • BLAST representative sequence –> PDB database
  4. Preparation of files for homology modeling (currently for I-TASSER)

    • Mapping to existing models
    • Preparation for running I-TASSER
    • Parsing I-TASSER runs
  5. Running QC/QA on structures and setting a representative structure

    • Various cutoffs (mutations, insertions, deletions) can be set to filter structures
  6. Automation of protein sequence and structure property calculation

  7. Creation of Pandas DataFrame summaries directly from downloaded metadata

Parameters:
  • gem_name (str) – The name of your GEM or just your project in general. This will be the name of the main folder that is created in root_dir.
  • root_dir (str) – Path to where the folder named after gem_name will be created. If not provided, directories will not be created and output directories need to be specified for some steps.
  • pdb_file_type (str) – pdb, mmCif, xml, mmtf - file type for files downloaded from the PDB
  • gem (Model) – COBRApy Model object
  • gem_file_path (str) – Path to GEM file
  • gem_file_type (str) – GEM model type - sbml (or xml), mat, or json formats
  • genes_list (list) – List of gene IDs that you want to map
  • genes_and_sequences (dict) – Dictionary of gene IDs and their amino acid sequence strings
  • genome_path (str) – FASTA file of all protein sequences
  • write_protein_fasta_files (bool) – If individual protein FASTA files should be written out
  • description (str) – Description string of your project
  • custom_spont_id (str) – ID of spontaneous genes in a COBRA model which will be ignored for analysis
add_gene_ids(genes_list)[source]

Add gene IDs manually into the GEM-PRO project.

Parameters:genes_list (list) – List of gene IDs as strings.
base_dir

str – GEM-PRO project folder.

blast_seqs_to_pdb(seq_ident_cutoff=0, evalue=0.0001, all_genes=False, display_link=False, outdir=None, force_rerun=False)[source]

BLAST each representative protein sequence to the PDB. Saves raw BLAST results (XML files).

Parameters:
  • seq_ident_cutoff (float, optional) – Cutoff results based on percent coverage (in decimal form)
  • evalue (float, optional) – Cutoff for the E-value - filters for significant hits. 0.001 is liberal, 0.0001 is stringent (default).
  • all_genes (bool) – If all genes should be BLASTed, or only those without any structures currently mapped
  • display_link (bool, optional) – Set to True if links to the HTML results should be displayed
  • outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
  • force_rerun (bool, optional) – If existing BLAST results should not be used, set to True. Default is False
custom_spont_id = None

str – ID of spontaneous genes in a COBRA model which will be ignored for analysis

data_dir

str – Directory where all data are stored.

df_homology_models

DataFrame – Get a dataframe of I-TASSER homology model results

df_kegg_metadata

DataFrame – Pandas DataFrame of KEGG metadata per protein.

df_pdb_blast

DataFrame – Get a dataframe of PDB BLAST results

df_pdb_metadata

DataFrame – Get a dataframe of PDB metadata (PDBs have to be downloaded first).

df_pdb_ranking

DataFrame – Get a dataframe of UniProt -> best structure in PDB results

df_proteins

DataFrame – Get a summary dataframe of all proteins in the project.

df_representative_sequences

DataFrame – Pandas DataFrame of representative sequence information per protein.

df_representative_structures

DataFrame – Get a dataframe of representative protein structure information.

df_uniprot_metadata

DataFrame – Pandas DataFrame of UniProt metadata per protein.

find_disulfide_bridges(representatives_only=True)[source]

Run Biopython’s disulfide bridge finder and store found bridges.

Annotations are stored in the protein structure’s chain sequence at: <chain_prop>.seq_record.annotations['SSBOND-biopython']

Parameters:representative_only (bool) – If analysis should only be run on the representative structure
find_disulfide_bridges_parallelize(sc, representatives_only=True)[source]

Run Biopython’s disulfide bridge finder and store found bridges.

Annotations are stored in the protein structure’s chain sequence at: <chain_prop>.seq_record.annotations['SSBOND-biopython']

Parameters:representative_only (bool) – If analysis should only be run on the representative structure
functional_genes

DictList – All genes with a representative protein structure.

genes = None

DictList – All protein-coding genes in this GEM-PRO project

genes_dir

str – Directory where all gene specific information is stored.

genes_with_a_representative_sequence

DictList – All genes with a representative sequence.

genes_with_a_representative_structure

DictList – All genes with a representative protein structure.

genes_with_experimental_structures

DictList – All genes that have at least one experimental structure.

genes_with_homology_models

DictList – All genes that have at least one homology model.

genes_with_structures

DictList – All genes with any mapped protein structures.

genome_path = None

str – Simple link to the filepath of the FASTA file containing all protein sequences

get_dssp_annotations(representatives_only=True, force_rerun=False)[source]

Run DSSP on structures and store calculations.

Annotations are stored in the protein structure’s chain sequence at: <chain_prop>.seq_record.letter_annotations['*-dssp']

Parameters:
  • representative_only (bool) – If analysis should only be run on the representative structure
  • force_rerun (bool) – If calculations should be rerun even if an output file exists
get_dssp_annotations_parallelize(sc, representatives_only=True, force_rerun=False)[source]

Run DSSP on structures and store calculations.

Annotations are stored in the protein structure’s chain sequence at: <chain_prop>.seq_record.letter_annotations['*-dssp']

Parameters:
  • representative_only (bool) – If analysis should only be run on the representative structure
  • force_rerun (bool) – If calculations should be rerun even if an output file exists
get_freesasa_annotations(include_hetatms=False, representatives_only=True, force_rerun=False)[source]

Run freesasa on structures and store calculations.

Annotations are stored in the protein structure’s chain sequence at: <chain_prop>.seq_record.letter_annotations['*-freesasa']

Parameters:
  • include_hetatms (bool) – If HETATMs should be included in calculations. Defaults to False.
  • representative_only (bool) – If analysis should only be run on the representative structure
  • force_rerun (bool) – If calculations should be rerun even if an output file exists
get_freesasa_annotations_parallelize(sc, include_hetatms=False, representatives_only=True, force_rerun=False)[source]

Run freesasa on structures and store calculations.

Annotations are stored in the protein structure’s chain sequence at: <chain_prop>.seq_record.letter_annotations['*-freesasa']

Parameters:
  • include_hetatms (bool) – If HETATMs should be included in calculations. Defaults to False.
  • representative_only (bool) – If analysis should only be run on the representative structure
  • force_rerun (bool) – If calculations should be rerun even if an output file exists
get_itasser_models(homology_raw_dir, custom_itasser_name_mapping=None, outdir=None, force_rerun=False)[source]

Copy generated I-TASSER models from a directory to the GEM-PRO directory.

Parameters:
  • homology_raw_dir (str) – Root directory of I-TASSER folders.
  • custom_itasser_name_mapping (dict) – Use this if your I-TASSER folder names differ from your model gene names. Input a dict of {model_gene: ITASSER_folder}.
  • outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
  • force_rerun (bool) – If homology files should be copied again even if they exist in the GEM-PRO directory
get_manual_homology_models(input_dict, outdir=None, clean=True, force_rerun=False)[source]

Copy homology models to the GEM-PRO project.

Requires an input of a dictionary formatted like so:

{
    model_gene: {
                    homology_model_id1: {
                                            'model_file': '/path/to/homology/model.pdb',
                                            'file_type': 'pdb'
                                            'additional_info': info_value
                                        },
                    homology_model_id2: {
                                            'model_file': '/path/to/homology/model.pdb'
                                            'file_type': 'pdb'
                                        }
                }
}
Parameters:
  • input_dict (dict) – Dictionary of dictionaries of gene names to homology model IDs and other information
  • outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
  • clean (bool) – If homology files should be cleaned and saved as a new PDB file
  • force_rerun (bool) – If homology files should be copied again even if they exist in the GEM-PRO directory
get_msms_annotations(representatives_only=True, force_rerun=False)[source]

Run MSMS on structures and store calculations.

Annotations are stored in the protein structure’s chain sequence at: <chain_prop>.seq_record.letter_annotations['*-msms']

Parameters:
  • representative_only (bool) – If analysis should only be run on the representative structure
  • force_rerun (bool) – If calculations should be rerun even if an output file exists
get_msms_annotations_parallelize(sc, representatives_only=True, force_rerun=False)[source]

Run MSMS on structures and store calculations.

Annotations are stored in the protein structure’s chain sequence at: <chain_prop>.seq_record.letter_annotations['*-msms']

Parameters:
  • representative_only (bool) – If analysis should only be run on the representative structure
  • force_rerun (bool) – If calculations should be rerun even if an output file exists
get_scratch_predictions(path_to_scratch, results_dir, scratch_basename='scratch', num_cores=1, exposed_buried_cutoff=25, custom_gene_mapping=None)[source]

Run and parse SCRATCH results to predict secondary structure and solvent accessibility. Annotations are stored in the protein’s representative sequence at:

  • .annotations
  • .letter_annotations
Parameters:
  • path_to_scratch (str) – Path to SCRATCH executable
  • results_dir (str) – Path to SCRATCH results folder, which will have the files (scratch.ss, scratch.ss8, scratch.acc, scratch.acc20)
  • scratch_basename (str) – Basename of the SCRATCH results (‘scratch’ is default)
  • num_cores (int) – Number of cores to use to parallelize SCRATCH run
  • exposed_buried_cutoff (int) – Cutoff of exposed/buried for the acc20 predictions
  • custom_gene_mapping (dict) – Default parsing of SCRATCH output files is to look for the model gene IDs. If your output files contain IDs which differ from the model gene IDs, use this dictionary to map model gene IDs to result file IDs. Dictionary keys must match model genes.
get_sequence_properties(representatives_only=True)[source]

Run Biopython ProteinAnalysis and EMBOSS pepstats to summarize basic statistics of all protein sequences. Results are stored in the protein’s respective SeqProp objects at .annotations

Parameters:representative_only (bool) – If analysis should only be run on the representative sequences
get_tmhmm_predictions(tmhmm_results, custom_gene_mapping=None)[source]

Parse TMHMM results and store in the representative sequences.

This is a basic function to parse pre-run TMHMM results. Run TMHMM from the web service (http://www.cbs.dtu.dk/services/TMHMM/) by doing the following:

  1. Write all representative sequences in the GEM-PRO using the function write_representative_sequences_file
  2. Upload the file to http://www.cbs.dtu.dk/services/TMHMM/ and choose “Extensive, no graphics” as the output
  3. Copy and paste the results (ignoring the top header and above “HELP with output formats”) into a file and save it
  4. Run this function on that file
Parameters:
  • tmhmm_results (str) – Path to TMHMM results (long format)
  • custom_gene_mapping (dict) – Default parsing of TMHMM output is to look for the model gene IDs. If your output file contains IDs which differ from the model gene IDs, use this dictionary to map model gene IDs to result file IDs. Dictionary keys must match model genes.
kegg_mapping_and_metadata(kegg_organism_code, custom_gene_mapping=None, outdir=None, set_as_representative=False, force_rerun=False)[source]

Map all genes in the model to KEGG IDs using the KEGG service.

Steps:
  1. Download all metadata and sequence files in the sequences directory
  2. Creates a KEGGProp object in the protein.sequences attribute
  3. Returns a Pandas DataFrame of mapping results
Parameters:
  • kegg_organism_code (str) – The three letter KEGG code of your organism
  • custom_gene_mapping (dict) – If your model genes differ from the gene IDs you want to map, custom_gene_mapping allows you to input a dictionary which maps model gene IDs to new ones. Dictionary keys must match model gene IDs.
  • outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
  • set_as_representative (bool) – If mapped KEGG IDs should be set as representative sequences
  • force_rerun (bool) – If you want to overwrite any existing mappings and files
kegg_mapping_and_metadata_parallelize(sc, kegg_organism_code, custom_gene_mapping=None, outdir=None, set_as_representative=False, force_rerun=False)[source]

Map all genes in the model to KEGG IDs using the KEGG service.

Steps:
  1. Download all metadata and sequence files in the sequences directory
  2. Creates a KEGGProp object in the protein.sequences attribute
  3. Returns a Pandas DataFrame of mapping results
Parameters:
  • sc (SparkContext) – Spark Context to parallelize this function
  • kegg_organism_code (str) – The three letter KEGG code of your organism
  • custom_gene_mapping (dict) – If your model genes differ from the gene IDs you want to map, custom_gene_mapping allows you to input a dictionary which maps model gene IDs to new ones. Dictionary keys must match model gene IDs.
  • outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
  • set_as_representative (bool) – If mapped KEGG IDs should be set as representative sequences
  • force_rerun (bool) – If you want to overwrite any existing mappings and files
load_cobra_model(model)[source]

Load a COBRApy Model object into the GEM-PRO project.

Parameters:model (Model) – COBRApy Model object
manual_seq_mapping(gene_to_seq_dict, outdir=None, write_fasta_files=True, set_as_representative=True)[source]

Read a manual input dictionary of model gene IDs –> protein sequences. By default sets them as representative.

Parameters:
  • gene_to_seq_dict (dict) – Mapping of gene IDs to their protein sequence strings
  • outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
  • write_fasta_files (bool) – If individual protein FASTA files should be written out
  • set_as_representative (bool) – If mapped sequences should be set as representative
manual_uniprot_mapping(gene_to_uniprot_dict, outdir=None, set_as_representative=True)[source]

Read a manual dictionary of model gene IDs –> UniProt IDs. By default sets them as representative.

This allows for mapping of the missing genes, or overriding of automatic mappings.

Input a dictionary of:

{
    <gene_id1>: <uniprot_id1>,
    <gene_id2>: <uniprot_id2>,
}
Parameters:
  • gene_to_uniprot_dict – Dictionary of mappings as shown above
  • outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
  • set_as_representative (bool) – If mapped UniProt IDs should be set as representative sequences
map_uniprot_to_pdb(seq_ident_cutoff=0.0, outdir=None, force_rerun=False)[source]

Map all representative sequences’ UniProt ID to PDB IDs using the PDBe “Best Structures” API. Will save a JSON file of the results to each protein’s sequences folder.

The “Best structures” API is available at https://www.ebi.ac.uk/pdbe/api/doc/sifts.html The list of PDB structures mapping to a UniProt accession sorted by coverage of the protein and, if the same, resolution.

Parameters:
  • seq_ident_cutoff (float) – Sequence identity cutoff in decimal form
  • outdir (str) – Output directory to cache JSON results of search
  • force_rerun (bool) – Force re-downloading of JSON results if they already exist
Returns:

A rank-ordered list of PDBProp objects that map to the UniProt ID

Return type:

list

missing_homology_models

list – List of genes with no mapping to any homology models.

missing_kegg_mapping

list – List of genes with no mapping to KEGG.

missing_pdb_structures

list – List of genes with no mapping to any experimental PDB structure.

missing_representative_sequence

list – List of genes with no mapping to a representative sequence.

missing_representative_structure

list – List of genes with no mapping to a representative structure.

missing_uniprot_mapping

list – List of genes with no mapping to UniProt.

model = None

Model – COBRApy model object

model_dir

str – Directory where original GEMs and GEM-related files are stored.

pdb_downloader_and_metadata(outdir=None, pdb_file_type=None, force_rerun=False)[source]

Download ALL mapped experimental structures to each protein’s structures directory.

Parameters:
  • outdir (str) – Path to output directory, if GEM-PRO directories were not set or other output directory is desired
  • pdb_file_type (str) – Type of PDB file to download, if not already set or other format is desired
  • force_rerun (bool) – If files should be re-downloaded if they already exist
pdb_file_type = None

strpdb, mmCif, xml, mmtf - file type for files downloaded from the PDB

prep_itasser_modeling(itasser_installation, itlib_folder, runtype, create_in_dir=None, execute_from_dir=None, all_genes=False, print_exec=False, **kwargs)[source]

Prepare to run I-TASSER homology modeling for genes without structures, or all genes.

Parameters:
  • itasser_installation (str) – Path to I-TASSER folder, i.e. ~/software/I-TASSER4.4
  • itlib_folder (str) – Path to ITLIB folder, i.e. ~/software/ITLIB
  • runtype – How you will be running I-TASSER - local, slurm, or torque
  • create_in_dir (str) – Local directory where folders will be created
  • execute_from_dir (str) – Optional path to execution directory - use this if you are copying the homology models to another location such as a supercomputer for running
  • all_genes (bool) – If all genes should be prepped, or only those without any mapped structures
  • print_exec (bool) – If the execution statement should be printed to run modelling

Todo

  • Document kwargs - extra options for I-TASSER, SLURM or Torque execution
  • Allow modeling of any sequence in sequences attribute, select by ID or provide SeqProp?
root_dir

str – Directory where GEM-PRO project folder named after the attribute base_dir is located.

set_representative_sequence(force_rerun=False)[source]

Automatically consolidate loaded sequences (manual, UniProt, or KEGG) and set a single representative sequence.

Manually set representative sequences override all existing mappings. UniProt mappings override KEGG mappings except when KEGG mappings have PDBs associated with them and UniProt doesn’t.

Parameters:force_rerun (bool) – Set to True to recheck stored sequences
set_representative_structure(seq_outdir=None, struct_outdir=None, pdb_file_type=None, engine='needle', always_use_homology=False, rez_cutoff=0.0, seq_ident_cutoff=0.5, allow_missing_on_termini=0.2, allow_mutants=True, allow_deletions=False, allow_insertions=False, allow_unresolved=True, skip_large_structures=False, clean=True, force_rerun=False)[source]

Set all representative structure for proteins from a structure in the structures attribute.

Each gene can have a combination of the following, which will be analyzed to set a representative structure.

  • Homology model(s)
  • Ranked PDBs
  • BLASTed PDBs

If the always_use_homology flag is true, homology models are always set as representative when they exist. If there are multiple homology models, we rank by the percent sequence coverage.

Parameters:
  • seq_outdir (str) – Path to output directory of sequence alignment files, must be set if GEM-PRO directories were not created initially
  • struct_outdir (str) – Path to output directory of structure files, must be set if GEM-PRO directories were not created initially
  • pdb_file_type (str) – pdb, mmCif, xml, mmtf - file type for files downloaded from the PDB
  • engine (str) – biopython or needle - which pairwise alignment program to use. needle is the standard EMBOSS tool to run pairwise alignments. biopython is Biopython’s implementation of needle. Results can differ!
  • always_use_homology (bool) – If homology models should always be set as the representative structure
  • rez_cutoff (float) – Resolution cutoff, in Angstroms (only if experimental structure)
  • seq_ident_cutoff (float) – Percent sequence identity cutoff, in decimal form
  • allow_missing_on_termini (float) – Percentage of the total length of the reference sequence which will be ignored when checking for modifications. Example: if 0.1, and reference sequence is 100 AA, then only residues 5 to 95 will be checked for modifications.
  • allow_mutants (bool) – If mutations should be allowed or checked for
  • allow_deletions (bool) – If deletions should be allowed or checked for
  • allow_insertions (bool) – If insertions should be allowed or checked for
  • allow_unresolved (bool) – If unresolved residues should be allowed or checked for
  • skip_large_structures (bool) – Default False – currently, large structures can’t be saved as a PDB file even if you just want to save a single chain, so Biopython will throw an error when trying to do so. As an alternative, if a large structure is selected as representative, the pipeline will currently point to it and not clean it. If you don’t want this to happen, set this to true.
  • clean (bool) – If structures should be cleaned
  • force_rerun (bool) – If sequence to structure alignment should be rerun

Todo

  • Remedy large structure representative setting
structures_dir

str – Directory where all structures are stored.

uniprot_mapping_and_metadata(model_gene_source, custom_gene_mapping=None, outdir=None, set_as_representative=False, force_rerun=False)[source]

Map all genes in the model to UniProt IDs using the UniProt mapping service. Also download all metadata and sequences.

Parameters:
  • model_gene_source (str) –

    the database source of your model gene IDs. See: http://www.uniprot.org/help/api_idmapping Common model gene sources are:

    • Ensembl Genomes - ENSEMBLGENOME_ID (i.e. E. coli b-numbers)
    • Entrez Gene (GeneID) - P_ENTREZGENEID
    • RefSeq Protein - P_REFSEQ_AC
  • custom_gene_mapping (dict) – If your model genes differ from the gene IDs you want to map, custom_gene_mapping allows you to input a dictionary which maps model gene IDs to new ones. Dictionary keys must match model genes.
  • outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
  • set_as_representative (bool) – If mapped UniProt IDs should be set as representative sequences
  • force_rerun (bool) – If you want to overwrite any existing mappings and files
write_representative_sequences_file(outname, outdir=None, set_ids_from_model=True)[source]

Write all the model’s sequences as a single FASTA file. By default, sets IDs to model gene IDs.

Parameters:
  • outname (str) – Name of the output FASTA file without the extension
  • outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
  • set_ids_from_model (bool) – If the gene ID source should be the model gene IDs, not the original sequence ID

Further reading

For examples in which structures have been integrated into a GEM and utilized on a genome-scale, please see the following:

[1]Zhang Y, Thiele I, Weekes D, Li Z, Jaroszewski L, Ginalski K, et al. Three-dimensional structural view of the central metabolic network of Thermotoga maritima. Science. 2009 Sep 18;325(5947):1544–9. Available from: http://dx.doi.org/10.1126/science.1174671
[2]Brunk E, Mih N, Monk J, Zhang Z, O’Brien EJ, Bliven SE, et al. Systems biology of the structural proteome. BMC Syst Biol. 2016;10: 26. doi:10.1186/s12918-016-0271-6
[3]Monk JM, Lloyd CJ, Brunk E, Mih N, Sastry A, King Z, et al. iML1515, a knowledgebase that computes Escherichia coli traits. Nat Biotechnol. 2017;35: 904–908. doi:10.1038/nbt.3956
[4]Chang RL, Xie L, Xie L, Bourne PE, Palsson BØ. Drug off-target effects predicted using structural analysis in the context of a metabolic network model. PLoS Comput Biol. 2010 Sep 23;6(9):e1000938. Available from: http://dx.doi.org/10.1371/journal.pcbi.1000938
[5]Chang RL, Andrews K, Kim D, Li Z, Godzik A, Palsson BO. Structural systems biology evaluation of metabolic thermotolerance in Escherichia coli. Science. 2013 Jun 7;340(6137):1220–3. Available from: http://dx.doi.org/10.1126/science.1234012
[6]Chang RL, Xie L, Bourne PE, Palsson BO. Antibacterial mechanisms identified through structural systems pharmacology. BMC Syst Biol. 2013 Oct 10;7:102. Available from: http://dx.doi.org/10.1186/1752-0509-7-102
[7]Mih N, Brunk E, Bordbar A, Palsson BO. A Multi-scale Computational Platform to Mechanistically Assess the Effect of Genetic Variation on Drug Responses in Human Erythrocyte Metabolism. PLoS Comput Biol. 2016;12: e1005039. doi:10.1371/journal.pcbi.1005039
[8]Chen K, Gao Y, Mih N, O’Brien EJ, Yang L, Palsson BO. Thermosensitivity of growth is determined by chaperone-mediated proteome reallocation. Proceedings of the National Academy of Sciences. 2017;114: 11548–11553. doi:10.1073/pnas.1705524114
[9]Yang L, Mih N, Yurkovich JT, Park JH, Seo S, Kim D, et al. Multi-scale model of the proteomic and metabolic consequences of reactive oxygen species. bioRxiv. 2017. p. 227892. doi:10.1101/227892

References

[10]Bosi, E, Monk, JM, Aziz, RK, Fondi, M, Nizet, V, & Palsson, BO. (2016). Comparative genome-scale modelling of Staphylococcus aureus strains identifies strain-specific metabolic capabilities linked to pathogenicity. Proceedings of the National Academy of Sciences of the United States of America, 113/26: E3801–9. DOI: 10.1073/pnas.1523199113
[11]Monk, JM, Koza, A, Campodonico, MA, Machado, D, Seoane, JM, Palsson, BO, Herrgård, MJ, et al. (2016). Multi-omics Quantification of Species Variation of Escherichia coli Links Molecular Features with Strain Phenotypes. Cell systems, 3/3: 238–51.e12. DOI: 10.1016/j.cels.2016.08.013
[12]Ong, WK, Vu, TT, Lovendahl, KN, Llull, JM, Serres, MH, Romine, MF, & Reed, JL. (2014). Comparisons of Shewanella strains based on genome annotations, modeling, and experiments. BMC systems biology, 8: 31. DOI: 10.1186/1752-0509-8-31

The Protein Class

The Protein Class

Introduction

This section will give an overview of the methods that can be executed for the Protein class, which is a basic representation of a protein by a collection of amino acid sequences and 3D structures.

Tutorials

Protein - Structure Mapping, Alignments, and Visualization

This notebook gives an example of how to map a single protein sequence to its structure, along with conducting sequence alignments and visualizing the mutations.

Input: Protein ID + amino acid sequence + mutated sequence(s)
Output: Representative protein structure, sequence alignments, and visualization of mutations
Imports
In [ ]:
import sys
import logging
In [ ]:
# Import the Protein class
from ssbio.core.protein import Protein
In [ ]:
# Printing multiple outputs per cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
Logging

Set the logging level in logger.setLevel(logging.<LEVEL_HERE>) to specify how verbose you want the pipeline to be. Debug is most verbose.

  • CRITICAL
    • Only really important messages shown
  • ERROR
    • Major errors
  • WARNING
    • Warnings that don’t affect running of the pipeline
  • INFO (default)
    • Info such as the number of structures mapped per gene
  • DEBUG
    • Really detailed information that will print out a lot of stuff

Warning: DEBUG mode prints out a large amount of information, especially if you have a lot of genes. This may stall your notebook!

In [ ]:
# Create logger
logger = logging.getLogger()
logger.setLevel(logging.INFO)  # SET YOUR LOGGING LEVEL HERE #
In [ ]:
# Other logger stuff for Jupyter notebooks
handler = logging.StreamHandler(sys.stderr)
formatter = logging.Formatter('[%(asctime)s] [%(name)s] %(levelname)s: %(message)s', datefmt="%Y-%m-%d %H:%M")
handler.setFormatter(formatter)
logger.handlers = [handler]
Initialization of the project

Set these three things:

  • ROOT_DIR
    • The directory where a folder named after your PROTEIN_ID will be created
  • PROTEIN_ID
    • Your protein ID
  • PROTEIN_SEQ
    • Your protein sequence

A directory will be created in ROOT_DIR with your PROTEIN_ID name. The folders are organized like so:

ROOT_DIR
└── PROTEIN_ID
    ├── sequences  # Protein sequence files, alignments, etc.
    └── structures  # Protein structure files, calculations, etc.
In [ ]:
# SET FOLDERS AND DATA HERE
import tempfile
ROOT_DIR = tempfile.gettempdir()

PROTEIN_ID = 'SRR1753782_00918'
PROTEIN_SEQ = 'MSKQQIGVVGMAVMGRNLALNIESRGYTVSVFNRSREKTEEVIAENPGKKLVPYYTVKEFVESLETPRRILLMVKAGAGTDAAIDSLKPYLEKGDIIIDGGNTFFQDTIRRNRELSAEGFNFIGTGVSGGEEGALKGPSIMPGGQKDAYELVAPILTKIAAVAEDGEPCVTYIGADGAGHYVKMVHNGIEYGDMQLIAEAYSLLKGGLNLSNEELANTFTEWNNGELSSYLIDITKDIFTKKDEDGNYLVDVILDEAANKGTGKWTSQSALDLGEPLSLITESVFARYISSLKAQRVAASKVLSGPKAQPAGDKAEFIEKVRRALYLGKIVSYAQGFSQLRAASDEYHWDLNYGEIAKIFRAGCIIRAQFLQKITDAYAENADIANLLLAPYFKKIADEYQQALRDVVAYAVQNGIPVPTFSAAVAYYDSYRAAVLPANLIQAQRDYFGAHTYKRTDKEGIFHTEWLE'
class ssbio.core.protein.Protein(ident, description=None, root_dir=None, pdb_file_type=’mmtf’)[source]

Store information about a protein, which represents the monomeric translated unit of a gene.

The main utilities of this class are to:

  • Load, parse, and store the same (ie. from different database sources) or similar (ie. from different strains) protein sequences as SeqProp objects in the sequences attribute
  • Load, parse, and store multiple experimental or predicted protein structures as StructProp objects in the structures attribute
  • Set a single representative_sequence and representative_structure
  • Calculate, store, and access pairwise sequence alignments to the representative sequence or structure
  • Provide summaries of alignments and mutations seen
  • Map between residue numbers of sequences and structures
Parameters:
  • ident (str) – Unique identifier for this protein
  • description (str) – Optional description for this protein
  • root_dir (str) – Path to where the folder named by this protein’s ID will be created. Default is current working directory.
  • pdb_file_type (str) – pdb, mmCif, xml, mmtf - file type for files downloaded from the PDB

Todo

  • Implement structural alignment objects with FATCAT
In [ ]:
# Create the Protein object
my_protein = Protein(ident=PROTEIN_ID, root_dir=ROOT_DIR, pdb_file_type='mmtf')
In [ ]:
# Load the protein sequence
# This sets the loaded sequence as the representative one
my_protein.load_manual_sequence(seq=PROTEIN_SEQ, ident='WT', write_fasta_file=True, set_as_representative=True)
Mapping sequence –> structure

Since the sequence has been provided, we just need to BLAST it to the PDB.

Note: These methods do not download any 3D structure files.

Methods
Protein.blast_representative_sequence_to_pdb(seq_ident_cutoff=0, evalue=0.0001, display_link=False, outdir=None, force_rerun=False)[source]

BLAST the representative protein sequence to the PDB. Saves a raw BLAST result file (XML file).

Parameters:
  • seq_ident_cutoff (float, optional) – Cutoff results based on percent coverage (in decimal form)
  • evalue (float, optional) – Cutoff for the E-value - filters for significant hits. 0.001 is liberal, 0.0001 is stringent (default).
  • display_link (bool, optional) – Set to True if links to the HTML results should be displayed
  • outdir (str) – Path to output directory of downloaded XML files, must be set if protein directory was not initialized
  • force_rerun (bool, optional) – If existing BLAST results should not be used, set to True. Default is False.
Returns:

List of new PDBProp objects added to the structures attribute

Return type:

list

In [ ]:
# Mapping using BLAST
my_protein.blast_representative_sequence_to_pdb(seq_ident_cutoff=0.9, evalue=0.00001)
my_protein.df_pdb_blast.head()
Downloading and ranking structures
Methods
Protein.pdb_downloader_and_metadata(outdir=None, pdb_file_type=None, force_rerun=False)[source]

Download ALL mapped experimental structures to the protein structures directory.

Parameters:
  • outdir (str) – Path to output directory, if protein structures directory not set or other output directory is desired
  • pdb_file_type (str) – Type of PDB file to download, if not already set or other format is desired
  • force_rerun (bool) – If files should be re-downloaded if they already exist
Returns:

List of PDB IDs that were downloaded

Return type:

list

Todo

  • Parse mmtf or PDB file for header information, rather than always getting the cif file for header info
Warning: Downloading all PDBs takes a while, since they are also parsed for metadata. You can skip this step and just set representative structures below if you want to minimize the number of PDBs downloaded.
In [ ]:
# Download all mapped PDBs and gather the metadata
my_protein.pdb_downloader_and_metadata()
my_protein.df_pdb_metadata.head(2)
Protein.set_representative_structure(seq_outdir=None, struct_outdir=None, pdb_file_type=None, engine=’needle’, always_use_homology=False, rez_cutoff=0.0, seq_ident_cutoff=0.5, allow_missing_on_termini=0.2, allow_mutants=True, allow_deletions=False, allow_insertions=False, allow_unresolved=True, clean=True, keep_chemicals=None, skip_large_structures=False, force_rerun=False)[source]

Set a representative structure from a structure in the structures attribute.

Each gene can have a combination of the following, which will be analyzed to set a representative structure.
  • Homology model(s)
  • Ranked PDBs
  • BLASTed PDBs

If the always_use_homology flag is true, homology models are always set as representative when they exist. If there are multiple homology models, we rank by the percent sequence coverage.

Parameters:
  • seq_outdir (str) – Path to output directory of sequence alignment files, must be set if Protein directory was not created initially
  • struct_outdir (str) – Path to output directory of structure files, must be set if Protein directory was not created initially
  • pdb_file_type (str) – pdb, mmCif, xml, mmtf - file type for files downloaded from the PDB
  • engine (str) – biopython or needle - which pairwise alignment program to use. needle is the standard EMBOSS tool to run pairwise alignments. biopython is Biopython’s implementation of needle. Results can differ!
  • always_use_homology (bool) – If homology models should always be set as the representative structure
  • rez_cutoff (float) – Resolution cutoff, in Angstroms (only if experimental structure)
  • seq_ident_cutoff (float) – Percent sequence identity cutoff, in decimal form
  • allow_missing_on_termini (float) – Percentage of the total length of the reference sequence which will be ignored when checking for modifications. Example: if 0.1, and reference sequence is 100 AA, then only residues 5 to 95 will be checked for modifications.
  • allow_mutants (bool) – If mutations should be allowed or checked for
  • allow_deletions (bool) – If deletions should be allowed or checked for
  • allow_insertions (bool) – If insertions should be allowed or checked for
  • allow_unresolved (bool) – If unresolved residues should be allowed or checked for
  • clean (bool) – If structure should be cleaned
  • keep_chemicals (str, list) – Keep specified chemical names if structure is to be cleaned
  • skip_large_structures (bool) – Default False – currently, large structures can’t be saved as a PDB file even if you just want to save a single chain, so Biopython will throw an error when trying to do so. As an alternative, if a large structure is selected as representative, the pipeline will currently point to it and not clean it. If you don’t want this to happen, set this to true.
  • force_rerun (bool) – If sequence to structure alignment should be rerun
Returns:

Representative structure from the list of structures. This is a not a map to the original structure, it is copied and optionally cleaned from the original one.

Return type:

StructProp

Todo

  • Remedy large structure representative setting
In [ ]:
# Set representative structures
my_protein.set_representative_structure()
Loading and aligning new sequences

You can load additional sequences into this protein object and align them to the representative sequence.

Methods
Protein.load_manual_sequence(seq, ident=None, write_fasta_file=False, outdir=None, set_as_representative=False, force_rewrite=False)[source]

Load a manual sequence given as a string and optionally set it as the representative sequence. Also store it in the sequences attribute.

Parameters:
  • seq (str, Seq, SeqRecord) – Sequence string, Biopython Seq or SeqRecord object
  • ident (str) – Optional identifier for the sequence, required if seq is a string. Also will override existing IDs in Seq or SeqRecord objects if set.
  • write_fasta_file (bool) – If this sequence should be written out to a FASTA file
  • outdir (str) – Path to output directory
  • set_as_representative (bool) – If this sequence should be set as the representative one
  • force_rewrite (bool) – If the FASTA file should be overwritten if it already exists
Returns:

Sequence that was loaded into the sequences attribute

Return type:

SeqProp

In [ ]:
# Input your mutated sequence and load it
mutated_protein1_id = 'N17P_SNP'
mutated_protein1_seq = 'MSKQQIGVVGMAVMGRPLALNIESRGYTVSVFNRSREKTEEVIAENPGKKLVPYYTVKEFVESLETPRRILLMVKAGAGTDAAIDSLKPYLEKGDIIIDGGNTFFQDTIRRNRELSAEGFNFIGTGVSGGEEGALKGPSIMPGGQKDAYELVAPILTKIAAVAEDGEPCVTYIGADGAGHYVKMVHNGIEYGDMQLIAEAYSLLKGGLNLSNEELANTFTEWNNGELSSYLIDITKDIFTKKDEDGNYLVDVILDEAANKGTGKWTSQSALDLGEPLSLITESVFARYISSLKAQRVAASKVLSGPKAQPAGDKAEFIEKVRRALYLGKIVSYAQGFSQLRAASDEYHWDLNYGEIAKIFRAGCIIRAQFLQKITDAYAENADIANLLLAPYFKKIADEYQQALRDVVAYAVQNGIPVPTFSAAVAYYDSYRAAVLPANLIQAQRDYFGAHTYKRTDKEGIFHTEWLE'

my_protein.load_manual_sequence(ident=mutated_protein1_id, seq=mutated_protein1_seq)
In [ ]:
# Input another mutated sequence and load it
mutated_protein2_id = 'Q4S_N17P_SNP'
mutated_protein2_seq = 'MSKSQIGVVGMAVMGRPLALNIESRGYTVSVFNRSREKTEEVIAENPGKKLVPYYTVKEFVESLETPRRILLMVKAGAGTDAAIDSLKPYLEKGDIIIDGGNTFFQDTIRRNRELSAEGFNFIGTGVSGGEEGALKGPSIMPGGQKDAYELVAPILTKIAAVAEDGEPCVTYIGADGAGHYVKMVHNGIEYGDMQLIAEAYSLLKGGLNLSNEELANTFTEWNNGELSSYLIDITKDIFTKKDEDGNYLVDVILDEAANKGTGKWTSQSALDLGEPLSLITESVFARYISSLKAQRVAASKVLSGPKAQPAGDKAEFIEKVRRALYLGKIVSYAQGFSQLRAASDEYHWDLNYGEIAKIFRAGCIIRAQFLQKITDAYAENADIANLLLAPYFKKIADEYQQALRDVVAYAVQNGIPVPTFSAAVAYYDSYRAAVLPANLIQAQRDYFGAHTYKRTDKEGIFHTEWLE'

my_protein.load_manual_sequence(ident=mutated_protein2_id, seq=mutated_protein2_seq)
Protein.pairwise_align_sequences_to_representative(gapopen=10, gapextend=0.5, outdir=None, engine=’needle’, parse=True, force_rerun=False)[source]

Pairwise all sequences in the sequences attribute to the representative sequence. Stores the alignments in the sequence_alignments DictList attribute.

Parameters:
  • gapopen (int) – Only for engine='needle' - Gap open penalty is the score taken away when a gap is created
  • gapextend (float) – Only for engine='needle' - Gap extension penalty is added to the standard gap penalty for each base or residue in the gap
  • outdir (str) – Only for engine='needle' - Path to output directory. Default is the protein sequence directory.
  • engine (str) – biopython or needle - which pairwise alignment program to use. needle is the standard EMBOSS tool to run pairwise alignments. biopython is Biopython’s implementation of needle. Results can differ!
  • parse (bool) – Store locations of mutations, insertions, and deletions in the alignment object (as an annotation)
  • force_rerun (bool) – Only for engine='needle' - Default False, set to True if you want to rerun the alignment if outfile exists.
In [ ]:
# Conduct pairwise sequence alignments
my_protein.pairwise_align_sequences_to_representative()
In [ ]:
# View IDs of all sequence alignments
[x.id for x in my_protein.sequence_alignments]

# View the stored information for one of the alignments
my_alignment = my_protein.sequence_alignments.get_by_id('SRR1753782_00918_N17P_SNP')
my_alignment.annotations
str(my_alignment[0].seq)
str(my_alignment[1].seq)
Protein.sequence_mutation_summary(alignment_ids=None, alignment_type=None)[source]

Summarize all mutations found in the sequence_alignments attribute.

Returns 2 dictionaries, single_counter and fingerprint_counter.

single_counter:

Dictionary of {point mutation: list of genes/strains} Example:

{
    ('A', 24, 'V'): ['Strain1', 'Strain2', 'Strain4'],
    ('R', 33, 'T'): ['Strain2']
}

Here, we report which genes/strains have the single point mutation.

fingerprint_counter:

Dictionary of {mutation group: list of genes/strains} Example:

{
    (('A', 24, 'V'), ('R', 33, 'T')): ['Strain2'],
    (('A', 24, 'V')): ['Strain1', 'Strain4']
}

Here, we report which genes/strains have the specific combinations (or “fingerprints”) of point mutations

Parameters:
  • alignment_ids (str, list) – Specified alignment ID or IDs to use
  • alignment_type (str) – Specified alignment type contained in the annotation field of an alignment object, seqalign or structalign are the current types.
Returns:

single_counter, fingerprint_counter

Return type:

dict, dict

In [ ]:
# Summarize all the mutations in all sequence alignments
s,f = my_protein.sequence_mutation_summary(alignment_type='seqalign')
print('Single mutations:')
s
print('---------------------')
print('Mutation fingerprints')
f
Some additional methods
Getting binding site/other information from UniProt
In [ ]:
import ssbio.databases.uniprot
In [ ]:
this_examples_uniprot = 'P14062'
sites = ssbio.databases.uniprot.uniprot_sites(this_examples_uniprot)
my_protein.representative_sequence.features = sites
my_protein.representative_sequence.features
Mapping sequence residue numbers to structure residue numbers
Methods
Protein.map_seqprop_resnums_to_structprop_resnums(resnums, seqprop=None, structprop=None, chain_id=None, use_representatives=False)[source]

Map a residue number in any SeqProp to the structure’s residue number for a specified chain.

Parameters:
  • resnums (int, list) – Residue numbers in the sequence
  • seqprop (SeqProp) – SeqProp object
  • structprop (StructProp) – StructProp object
  • chain_id (str) – Chain ID to map to
  • use_representatives (bool) – If the representative sequence and structure should be used. If True, seqprop, structprop, and chain_id do not need to be defined.
Returns:

Mapping of sequence residue numbers to structure residue numbers

Return type:

dict

In [ ]:
# Returns a dictionary mapping sequence residue numbers to structure residue identifiers
# Will warn you if residues are not present in the structure
structure_sites = my_protein.map_seqprop_resnums_to_structprop_resnums(resnums=[1,3,45],
                                                                       use_representatives=True)
structure_sites
Viewing structures

The awesome package nglview is utilized as a backend for viewing structures within a Jupyter notebook. ssbio view functions will either return a NGLWidget object, which is the same as using nglview like the below example, or act upon the widget object itself.

# This is how NGLview usually works - it will load a structure file and return a NGLWidget "view" object.
import nglview
view = nglview.show_structure_file(my_protein.representative_structure.structure_path)
view
Methods
StructProp.view_structure(only_chains=None, opacity=1.0, recolor=False, gui=False)[source]

Use NGLviewer to display a structure in a Jupyter notebook

Parameters:
  • only_chains (str, list) – Chain ID or IDs to display
  • opacity (float) – Opacity of the structure
  • recolor (bool) – If structure should be cleaned and recolored to silver
  • gui (bool) – If the NGLview GUI should show up
Returns:

NGLviewer object

In [ ]:
# View just the structure
view = my_protein.representative_structure.view_structure()
view
Protein.add_mutations_to_nglview(view, alignment_type=’seqalign’, alignment_ids=None, seqprop=None, structprop=None, chain_id=None, use_representatives=False, grouped=False, color=’red’, unique_colors=True, opacity_range=(0.8, 1), scale_range=(1, 5))[source]

Add representations to an NGLWidget view object for residues that are mutated in the sequence_alignments attribute.

Parameters:
  • view (NGLWidget) – NGLWidget view object
  • alignment_type (str) – Specified alignment type contained in the annotation field of an alignment object, seqalign or structalign are the current types.
  • alignment_ids (str, list) – Specified alignment ID or IDs to use
  • seqprop (SeqProp) – SeqProp object
  • structprop (StructProp) – StructProp object
  • chain_id (str) – ID of the structure’s chain to get annotation from
  • use_representatives (bool) – If the representative sequence/structure/chain IDs should be used
  • grouped (bool) – If groups of mutations should be colored and sized together
  • color (str) – Color of the mutations (overridden if unique_colors=True)
  • unique_colors (bool) – If each mutation/mutation group should be colored uniquely
  • opacity_range (tuple) – Min/max opacity values (mutations that show up more will be opaque)
  • scale_range (tuple) – Min/max size values (mutations that show up more will be bigger)
In [ ]:
# Map the mutations on the visualization (scale increased) - will show up on the above view
my_protein.add_mutations_to_nglview(view=view, alignment_type='seqalign', scale_range=(4,7),
                                    use_representatives=True)
Protein.add_features_to_nglview(view, seqprop=None, structprop=None, chain_id=None, use_representatives=False)[source]

Add select features from the selected SeqProp object to an NGLWidget view object.

Currently parsing for:

  • Single residue features (ie. metal binding sites)
  • Disulfide bonds
Parameters:
  • view (NGLWidget) – NGLWidget view object
  • seqprop (SeqProp) – SeqProp object
  • structprop (StructProp) – StructProp object
  • chain_id (str) – ID of the structure’s chain to get annotation from
  • use_representatives (bool) – If the representative sequence/structure/chain IDs should be used
In [ ]:
# Add sites as shown above in the table to the view
my_protein.add_features_to_nglview(view=view, use_representatives=True)
Saving
Protein.save_json(outfile, compression=False)

Save the object as a JSON file using json_tricks

In [ ]:
import os.path as op
my_protein.save_json(op.join(my_protein.protein_dir, '{}.json'.format(my_protein.id)))

Features

  • Load, parse, and store the same (ie. from different database sources) or similar (ie. from different strains) protein sequences as SeqProp objects in the sequences attribute
  • Load, parse, and store multiple experimental or predicted protein structures as StructProp objects in the structures attribute
  • Set a single representative_sequence and representative_structure
  • Calculate, store, and access pairwise sequence alignments to the representative sequence or structure
  • Provide summaries of alignments and mutations seen
  • Map between residue numbers of sequences and structures

API

Protein
class ssbio.core.protein.Protein(ident, description=None, root_dir=None, pdb_file_type='mmtf')[source]

Store information about a protein, which represents the monomeric translated unit of a gene.

The main utilities of this class are to:

  • Load, parse, and store the same (ie. from different database sources) or similar (ie. from different strains) protein sequences as SeqProp objects in the sequences attribute
  • Load, parse, and store multiple experimental or predicted protein structures as StructProp objects in the structures attribute
  • Set a single representative_sequence and representative_structure
  • Calculate, store, and access pairwise sequence alignments to the representative sequence or structure
  • Provide summaries of alignments and mutations seen
  • Map between residue numbers of sequences and structures
Parameters:
  • ident (str) – Unique identifier for this protein
  • description (str) – Optional description for this protein
  • root_dir (str) – Path to where the folder named by this protein’s ID will be created. Default is current working directory.
  • pdb_file_type (str) – pdb, mmCif, xml, mmtf - file type for files downloaded from the PDB

Todo

  • Implement structural alignment objects with FATCAT
add_features_to_nglview(view, seqprop=None, structprop=None, chain_id=None, use_representatives=False)[source]

Add select features from the selected SeqProp object to an NGLWidget view object.

Currently parsing for:

  • Single residue features (ie. metal binding sites)
  • Disulfide bonds
Parameters:
  • view (NGLWidget) – NGLWidget view object
  • seqprop (SeqProp) – SeqProp object
  • structprop (StructProp) – StructProp object
  • chain_id (str) – ID of the structure’s chain to get annotation from
  • use_representatives (bool) – If the representative sequence/structure/chain IDs should be used
add_fingerprint_to_nglview(view, fingerprint, seqprop=None, structprop=None, chain_id=None, use_representatives=False, color='red', opacity_range=(0.8, 1), scale_range=(1, 5))[source]

Add representations to an NGLWidget view object for residues that are mutated in the sequence_alignments attribute.

Parameters:
  • view (NGLWidget) – NGLWidget view object
  • fingerprint (dict) – Single mutation group from the sequence_mutation_summary function
  • seqprop (SeqProp) – SeqProp object
  • structprop (StructProp) – StructProp object
  • chain_id (str) – ID of the structure’s chain to get annotation from
  • use_representatives (bool) – If the representative sequence/structure/chain IDs should be used
  • color (str) – Color of the mutations (overridden if unique_colors=True)
  • opacity_range (tuple) – Min/max opacity values (mutations that show up more will be opaque)
  • scale_range (tuple) – Min/max size values (mutations that show up more will be bigger)
add_mutations_to_nglview(view, alignment_type='seqalign', alignment_ids=None, seqprop=None, structprop=None, chain_id=None, use_representatives=False, grouped=False, color='red', unique_colors=True, opacity_range=(0.8, 1), scale_range=(1, 5))[source]

Add representations to an NGLWidget view object for residues that are mutated in the sequence_alignments attribute.

Parameters:
  • view (NGLWidget) – NGLWidget view object
  • alignment_type (str) – Specified alignment type contained in the annotation field of an alignment object, seqalign or structalign are the current types.
  • alignment_ids (str, list) – Specified alignment ID or IDs to use
  • seqprop (SeqProp) – SeqProp object
  • structprop (StructProp) – StructProp object
  • chain_id (str) – ID of the structure’s chain to get annotation from
  • use_representatives (bool) – If the representative sequence/structure/chain IDs should be used
  • grouped (bool) – If groups of mutations should be colored and sized together
  • color (str) – Color of the mutations (overridden if unique_colors=True)
  • unique_colors (bool) – If each mutation/mutation group should be colored uniquely
  • opacity_range (tuple) – Min/max opacity values (mutations that show up more will be opaque)
  • scale_range (tuple) – Min/max size values (mutations that show up more will be bigger)
align_seqprop_to_structprop(seqprop, structprop, chains=None, outdir=None, engine='needle', structure_already_parsed=False, parse=True, force_rerun=False, **kwargs)[source]

Run and store alignments of a SeqProp to chains in the mapped_chains attribute of a StructProp.

Alignments are stored in the sequence_alignments attribute, with the IDs formatted as <SeqProp_ID>_<StructProp_ID>-<Chain_ID>. Although it is more intuitive to align to individual ChainProps, StructProps should be loaded as little as possible to reduce run times so the alignment is done to the entire structure.

Parameters:
  • seqprop (SeqProp) – SeqProp object with a loaded sequence
  • structprop (StructProp) – StructProp object with a loaded structure
  • chains (str, list) – Chain ID or IDs to map to. If not specified, mapped_chains attribute is inspected for chains. If no chains there, all chains will be aligned to.
  • outdir (str) – Directory to output sequence alignment files (only if running with needle)
  • engine (str) – biopython or needle - which pairwise alignment program to use. needle is the standard EMBOSS tool to run pairwise alignments. biopython is Biopython’s implementation of needle. Results can differ!
  • structure_already_parsed (bool) – If the structure has already been parsed and the chain sequences are stored. Temporary option until Hadoop sequence file is implemented to reduce number of times a structure is parsed.
  • parse (bool) – Store locations of mutations, insertions, and deletions in the alignment object (as an annotation)
  • force_rerun (bool) – If alignments should be rerun
  • **kwargs – Other alignment options

Todo

  • Document **kwargs for alignment options
blast_representative_sequence_to_pdb(seq_ident_cutoff=0, evalue=0.0001, display_link=False, outdir=None, force_rerun=False)[source]

BLAST the representative protein sequence to the PDB. Saves a raw BLAST result file (XML file).

Parameters:
  • seq_ident_cutoff (float, optional) – Cutoff results based on percent coverage (in decimal form)
  • evalue (float, optional) – Cutoff for the E-value - filters for significant hits. 0.001 is liberal, 0.0001 is stringent (default).
  • display_link (bool, optional) – Set to True if links to the HTML results should be displayed
  • outdir (str) – Path to output directory of downloaded XML files, must be set if protein directory was not initialized
  • force_rerun (bool, optional) – If existing BLAST results should not be used, set to True. Default is False.
Returns:

List of new PDBProp objects added to the structures attribute

Return type:

list

check_structure_chain_quality(seqprop, structprop, chain_id, seq_ident_cutoff=0.5, allow_missing_on_termini=0.2, allow_mutants=True, allow_deletions=False, allow_insertions=False, allow_unresolved=True)[source]

Report if a structure’s chain meets the defined cutoffs for sequence quality.

df_homology_models

DataFrame – Get a dataframe of I-TASSER homology model results

df_pdb_blast

DataFrame – Get a dataframe of PDB BLAST results

df_pdb_metadata

DataFrame – Get a dataframe of PDB metadata (PDBs have to be downloaded first)

df_pdb_ranking

DataFrame – Get a dataframe of UniProt -> best structure in PDB results

download_all_pdbs(outdir=None, pdb_file_type=None, load_metadata=False, force_rerun=False)[source]

Downloads all structures from the PDB. load_metadata flag sets if metadata should be parsed and stored in StructProp, otherwise filepaths are just linked

filter_sequences(seq_type)[source]

Return a DictList of only specified types in the sequences attribute.

Parameters:seq_type (SeqProp) – Object type
Returns:A filtered DictList of specified object type only
Return type:DictList
find_disulfide_bridges(representative_only=True)[source]

Run Biopython’s disulfide bridge finder and store found bridges.

Annotations are stored in the protein structure’s chain sequence at: <chain_prop>.seq_record.annotations['SSBOND-biopython']

Parameters:representative_only (bool) – If analysis should only be run on the representative structure
find_representative_chain(seqprop, structprop, chains_to_check=None, seq_ident_cutoff=0.5, allow_missing_on_termini=0.2, allow_mutants=True, allow_deletions=False, allow_insertions=False, allow_unresolved=True)[source]

Set and return the representative chain based on sequence quality checks to a reference sequence.

Parameters:
  • seqprop (SeqProp) – SeqProp object to compare to chain sequences
  • structprop (StructProp) – StructProp object with chains to compare to in the mapped_chains attribute. If there are none present, chains_to_check can be specified, otherwise all chains are checked.
  • chains_to_check (str, list) – Chain ID or IDs to check for sequence coverage quality
  • seq_ident_cutoff (float) – Percent sequence identity cutoff, in decimal form
  • allow_missing_on_termini (float) – Percentage of the total length of the reference sequence which will be ignored when checking for modifications. Example: if 0.1, and reference sequence is 100 AA, then only residues 5 to 95 will be checked for modifications.
  • allow_mutants (bool) – If mutations should be allowed or checked for
  • allow_deletions (bool) – If deletions should be allowed or checked for
  • allow_insertions (bool) – If insertions should be allowed or checked for
  • allow_unresolved (bool) – If unresolved residues should be allowed or checked for
Returns:

the best chain ID, if any

Return type:

str

get_dssp_annotations(representative_only=True, force_rerun=False)[source]

Run DSSP on structures and store calculations.

Annotations are stored in the protein structure’s chain sequence at: <chain_prop>.seq_record.letter_annotations['*-dssp']

Parameters:
  • representative_only (bool) – If analysis should only be run on the representative structure
  • force_rerun (bool) – If calculations should be rerun even if an output file exists

Todo

  • Some errors arise from storing annotations for nonstandard amino acids, need to run DSSP separately for those
get_experimental_structures()[source]

DictList: Return a DictList of all experimental structures in self.structures

get_freesasa_annotations(include_hetatms=False, representative_only=True, force_rerun=False)[source]

Run freesasa on structures and store calculations.

Annotations are stored in the protein structure’s chain sequence at: <chain_prop>.seq_record.letter_annotations['*-freesasa']

Parameters:
  • include_hetatms (bool) – If HETATMs should be included in calculations. Defaults to False.
  • representative_only (bool) – If analysis should only be run on the representative structure
  • force_rerun (bool) – If calculations should be rerun even if an output file exists
get_homology_models()[source]

DictList: Return a DictList of all homology models in self.structures

get_msms_annotations(representative_only=True, force_rerun=False)[source]

Run MSMS on structures and store calculations.

Annotations are stored in the protein structure’s chain sequence at: <chain_prop>.seq_record.letter_annotations['*-msms']

Parameters:
  • representative_only (bool) – If analysis should only be run on the representative structure
  • force_rerun (bool) – If calculations should be rerun even if an output file exists
get_residue_annotations(seq_resnum, seqprop=None, structprop=None, chain_id=None, use_representatives=False)[source]

Get all residue-level annotations stored in the SeqProp letter_annotations field for a given residue number.

Uses the representative sequence, structure, and chain ID stored by default. If other properties from other structures are desired, input the proper IDs. An alignment for the given sequence to the structure must be present in the sequence_alignments list.

Parameters:
  • seq_resnum (int) – Residue number in the sequence
  • seqprop (SeqProp) – SeqProp object
  • structprop (StructProp) – StructProp object
  • chain_id (str) – ID of the structure’s chain to get annotation from
  • use_representatives (bool) – If the representative sequence/structure/chain IDs should be used
Returns:

All available letter_annotations for this residue number

Return type:

dict

get_seqprop_to_structprop_alignment_stats(seqprop, structprop, chain_id)[source]

Get the sequence alignment information for a sequence to a structure’s chain.

get_sequence_properties(representative_only=True)[source]

Run Biopython ProteinAnalysis and EMBOSS pepstats to summarize basic statistics of the protein sequences. Results are stored in the protein’s respective SeqProp objects at .annotations

Parameters:representative_only (bool) – If analysis should only be run on the representative sequence
load_itasser_folder(ident, itasser_folder, organize=False, outdir=None, organize_name=None, set_as_representative=False, representative_chain='X', force_rerun=False)[source]

Load the results folder from an I-TASSER run (local, not from the website) and copy relevant files over to the protein structures directory.

Parameters:
  • ident (str) – I-TASSER ID
  • itasser_folder (str) – Path to results folder
  • organize (bool) – If select files from modeling should be copied to the Protein directory
  • outdir (str) – Path to directory where files will be copied and organized to
  • organize_name (str) – Basename of files to rename results to. If not provided, will use id attribute.
  • set_as_representative – If this structure should be set as the representative structure
  • representative_chain (str) – If set_as_representative is True, provide the representative chain ID
  • force_rerun (bool) – If the PDB should be reloaded if it is already in the list of structures
Returns:

The object that is now contained in the structures attribute

Return type:

ITASSERProp

load_kegg(kegg_id, kegg_organism_code=None, kegg_seq_file=None, kegg_metadata_file=None, set_as_representative=False, download=False, outdir=None, force_rerun=False)[source]

Load a KEGG ID, sequence, and metadata files into the sequences attribute.

Parameters:
  • kegg_id (str) – KEGG ID
  • kegg_organism_code (str) – KEGG organism code to prepend to the kegg_id if not part of it already. Example: eco:b1244, eco is the organism code
  • kegg_seq_file (str) – Path to KEGG FASTA file
  • kegg_metadata_file (str) – Path to KEGG metadata file (raw KEGG format)
  • set_as_representative (bool) – If this KEGG ID should be set as the representative sequence
  • download (bool) – If the KEGG sequence and metadata files should be downloaded if not provided
  • outdir (str) – Where the sequence and metadata files should be downloaded to
  • force_rerun (bool) – If ID should be reloaded and files redownloaded
Returns:

object contained in the sequences attribute

Return type:

KEGGProp

load_manual_sequence(seq, ident=None, write_fasta_file=False, outdir=None, set_as_representative=False, force_rewrite=False)[source]

Load a manual sequence given as a string and optionally set it as the representative sequence. Also store it in the sequences attribute.

Parameters:
  • seq (str, Seq, SeqRecord) – Sequence string, Biopython Seq or SeqRecord object
  • ident (str) – Optional identifier for the sequence, required if seq is a string. Also will override existing IDs in Seq or SeqRecord objects if set.
  • write_fasta_file (bool) – If this sequence should be written out to a FASTA file
  • outdir (str) – Path to output directory
  • set_as_representative (bool) – If this sequence should be set as the representative one
  • force_rewrite (bool) – If the FASTA file should be overwritten if it already exists
Returns:

Sequence that was loaded into the sequences attribute

Return type:

SeqProp

load_manual_sequence_file(ident, seq_file, copy_file=False, outdir=None, set_as_representative=False)[source]

Load a manual sequence, given as a FASTA file and optionally set it as the representative sequence. Also store it in the sequences attribute.

Parameters:
  • ident (str) – Sequence ID
  • seq_file (str) – Path to sequence FASTA file
  • copy_file (bool) – If the FASTA file should be copied to the protein’s sequences folder or the outdir, if protein folder has not been set
  • outdir (str) – Path to output directory
  • set_as_representative (bool) – If this sequence should be set as the representative one
Returns:

Sequence that was loaded into the sequences attribute

Return type:

SeqProp

load_pdb(pdb_id, mapped_chains=None, pdb_file=None, file_type=None, is_experimental=True, set_as_representative=False, representative_chain=None, force_rerun=False)[source]

Load a structure ID and optional structure file into the structures attribute.

Parameters:
  • pdb_id (str) – PDB ID
  • mapped_chains (str, list) – Chain ID or list of IDs which you are interested in
  • pdb_file (str) – Path to PDB file
  • file_type (str) – Type of PDB file
  • is_experimental (bool) – If this structure file is experimental
  • set_as_representative (bool) – If this structure should be set as the representative structure
  • representative_chain (str) – If set_as_representative is True, provide the representative chain ID
  • force_rerun (bool) – If the PDB should be reloaded if it is already in the list of structures
Returns:

The object that is now contained in the structures attribute

Return type:

PDBProp

load_uniprot(uniprot_id, uniprot_seq_file=None, uniprot_xml_file=None, download=False, outdir=None, set_as_representative=False, force_rerun=False)[source]

Load a UniProt ID and associated sequence/metadata files into the sequences attribute.

Sequence and metadata files can be provided, or alternatively downloaded with the download flag set to True. Metadata files will be downloaded as XML files.

Parameters:
  • uniprot_id (str) – UniProt ID/ACC
  • uniprot_seq_file (str) – Path to FASTA file
  • uniprot_xml_file (str) – Path to UniProt XML file
  • download (bool) – If sequence and metadata files should be downloaded
  • outdir (str) – Output directory for sequence and metadata files
  • set_as_representative (bool) – If this sequence should be set as the representative one
  • force_rerun (bool) – If files should be redownloaded and metadata reloaded
Returns:

Sequence that was loaded into the sequences attribute

Return type:

UniProtProp

map_seqprop_resnums_to_structprop_resnums(resnums, seqprop=None, structprop=None, chain_id=None, use_representatives=False)[source]

Map a residue number in any SeqProp to the structure’s residue number for a specified chain.

Parameters:
  • resnums (int, list) – Residue numbers in the sequence
  • seqprop (SeqProp) – SeqProp object
  • structprop (StructProp) – StructProp object
  • chain_id (str) – Chain ID to map to
  • use_representatives (bool) – If the representative sequence and structure should be used. If True, seqprop, structprop, and chain_id do not need to be defined.
Returns:

Mapping of sequence residue numbers to structure residue numbers

Return type:

dict

map_structprop_resnums_to_seqprop_resnums(resnums, structprop=None, chain_id=None, seqprop=None, use_representatives=False)[source]

Map a residue number in any StructProp + chain ID to any SeqProp’s residue number.

Parameters:
  • resnums (int, list) – Residue numbers in the structure
  • structprop (StructProp) – StructProp object
  • chain_id (str) – Chain ID to map from
  • seqprop (SeqProp) – SeqProp object
  • use_representatives (bool) – If the representative sequence and structure should be used. If True, seqprop, structprop, and chain_id do not need to be defined.
Returns:

Mapping of structure residue numbers to sequence residue numbers

Return type:

dict

map_uniprot_to_pdb(seq_ident_cutoff=0.0, outdir=None, force_rerun=False)[source]

Map the representative sequence’s UniProt ID to PDB IDs using the PDBe “Best Structures” API. Will save a JSON file of the results to the protein sequences folder.

The “Best structures” API is available at https://www.ebi.ac.uk/pdbe/api/doc/sifts.html The list of PDB structures mapping to a UniProt accession sorted by coverage of the protein and, if the same, resolution.

Parameters:
  • seq_ident_cutoff (float) – Sequence identity cutoff in decimal form
  • outdir (str) – Output directory to cache JSON results of search
  • force_rerun (bool) – Force re-downloading of JSON results if they already exist
Returns:

A rank-ordered list of PDBProp objects that map to the UniProt ID

Return type:

list

num_sequences

int – Return the total number of sequences

num_structures

int – Return the total number of structures

num_structures_experimental

int – Return the total number of experimental structures

num_structures_homology

int – Return the total number of homology models

pairwise_align_sequences_to_representative(gapopen=10, gapextend=0.5, outdir=None, engine='needle', parse=True, force_rerun=False)[source]

Pairwise all sequences in the sequences attribute to the representative sequence. Stores the alignments in the sequence_alignments DictList attribute.

Parameters:
  • gapopen (int) – Only for engine='needle' - Gap open penalty is the score taken away when a gap is created
  • gapextend (float) – Only for engine='needle' - Gap extension penalty is added to the standard gap penalty for each base or residue in the gap
  • outdir (str) – Only for engine='needle' - Path to output directory. Default is the protein sequence directory.
  • engine (str) – biopython or needle - which pairwise alignment program to use. needle is the standard EMBOSS tool to run pairwise alignments. biopython is Biopython’s implementation of needle. Results can differ!
  • parse (bool) – Store locations of mutations, insertions, and deletions in the alignment object (as an annotation)
  • force_rerun (bool) – Only for engine='needle' - Default False, set to True if you want to rerun the alignment if outfile exists.
pairwise_align_sequences_to_representative_parallelize(sc, gapopen=10, gapextend=0.5, outdir=None, engine='needle', parse=True, force_rerun=False)[source]

Pairwise all sequences in the sequences attribute to the representative sequence. Stores the alignments in the sequence_alignments DictList attribute.

Parameters:
  • sc (SparkContext) – Configured spark context for parallelization
  • gapopen (int) – Only for engine='needle' - Gap open penalty is the score taken away when a gap is created
  • gapextend (float) – Only for engine='needle' - Gap extension penalty is added to the standard gap penalty for each base or residue in the gap
  • outdir (str) – Only for engine='needle' - Path to output directory. Default is the protein sequence directory.
  • engine (str) – biopython or needle - which pairwise alignment program to use. needle is the standard EMBOSS tool to run pairwise alignments. biopython is Biopython’s implementation of needle. Results can differ!
  • parse (bool) – Store locations of mutations, insertions, and deletions in the alignment object (as an annotation)
  • force_rerun (bool) – Only for engine='needle' - Default False, set to True if you want to rerun the alignment if outfile exists.
parse_all_stored_structures(outdir=None, pdb_file_type=None, force_rerun=False)[source]

Runs parse_structure for any stored structure with a file available

pdb_downloader_and_metadata(outdir=None, pdb_file_type=None, force_rerun=False)[source]

Download ALL mapped experimental structures to the protein structures directory.

Parameters:
  • outdir (str) – Path to output directory, if protein structures directory not set or other output directory is desired
  • pdb_file_type (str) – Type of PDB file to download, if not already set or other format is desired
  • force_rerun (bool) – If files should be re-downloaded if they already exist
Returns:

List of PDB IDs that were downloaded

Return type:

list

Todo

  • Parse mmtf or PDB file for header information, rather than always getting the cif file for header info
pdb_file_type = None

strpdb, pdb.gz, mmcif, cif, cif.gz, xml.gz, mmtf, mmtf.gz - choose a file type for files downloaded from the PDB

prep_itasser_modeling(itasser_installation, itlib_folder, runtype, create_in_dir=None, execute_from_dir=None, print_exec=False, **kwargs)[source]

Prepare to run I-TASSER homology modeling for the representative sequence.

Parameters:
  • itasser_installation (str) – Path to I-TASSER folder, i.e. ~/software/I-TASSER4.4
  • itlib_folder (str) – Path to ITLIB folder, i.e. ~/software/ITLIB
  • runtype – How you will be running I-TASSER - local, slurm, or torque
  • create_in_dir (str) – Local directory where folders will be created
  • execute_from_dir (str) – Optional path to execution directory - use this if you are copying the homology models to another location such as a supercomputer for running
  • all_genes (bool) – If all genes should be prepped, or only those without any mapped structures
  • print_exec (bool) – If the execution statement should be printed to run modelling

Todo

  • Document kwargs - extra options for I-TASSER, SLURM or Torque execution
  • Allow modeling of any sequence in sequences attribute, select by ID or provide SeqProp?
protein_dir

str – Protein folder

protein_statistics

Get a dictionary of basic statistics describing this protein

representative_chain = None

str – Chain ID in the representative structure which best represents a sequence

representative_chain_seq_coverage = None

float – Percent identity of sequence coverage for the representative chain

representative_sequence = None

SeqProp – Sequence set to represent this protein

representative_structure = None

StructProp – Structure set to represent this protein, usually in monomeric form

root_dir

str – Path to where the folder named by this protein’s ID will be created. Default is current working directory.

sequence_alignments = None

DictList – Pairwise or multiple sequence alignments stored as Bio.Align.MultipleSeqAlignment objects

sequence_dir

str – Directory where sequence related files are stored

sequence_mutation_summary(alignment_ids=None, alignment_type=None)[source]

Summarize all mutations found in the sequence_alignments attribute.

Returns 2 dictionaries, single_counter and fingerprint_counter.

single_counter:

Dictionary of {point mutation: list of genes/strains} Example:

{
    ('A', 24, 'V'): ['Strain1', 'Strain2', 'Strain4'],
    ('R', 33, 'T'): ['Strain2']
}

Here, we report which genes/strains have the single point mutation.

fingerprint_counter:

Dictionary of {mutation group: list of genes/strains} Example:

{
    (('A', 24, 'V'), ('R', 33, 'T')): ['Strain2'],
    (('A', 24, 'V')): ['Strain1', 'Strain4']
}

Here, we report which genes/strains have the specific combinations (or “fingerprints”) of point mutations

Parameters:
  • alignment_ids (str, list) – Specified alignment ID or IDs to use
  • alignment_type (str) – Specified alignment type contained in the annotation field of an alignment object, seqalign or structalign are the current types.
Returns:

single_counter, fingerprint_counter

Return type:

dict, dict

sequences = None

DictList – Stored protein sequences which are related to this protein

set_representative_sequence(force_rerun=False)[source]

Automatically consolidate loaded sequences (manual, UniProt, or KEGG) and set a single representative sequence.

Manually set representative sequences override all existing mappings. UniProt mappings override KEGG mappings except when KEGG mappings have PDBs associated with them and UniProt doesn’t.

Parameters:force_rerun (bool) – Set to True to recheck stored sequences
Returns:Which sequence was set as representative
Return type:SeqProp
set_representative_structure(seq_outdir=None, struct_outdir=None, pdb_file_type=None, engine='needle', always_use_homology=False, rez_cutoff=0.0, seq_ident_cutoff=0.5, allow_missing_on_termini=0.2, allow_mutants=True, allow_deletions=False, allow_insertions=False, allow_unresolved=True, clean=True, keep_chemicals=None, skip_large_structures=False, force_rerun=False)[source]

Set a representative structure from a structure in the structures attribute.

Each gene can have a combination of the following, which will be analyzed to set a representative structure.
  • Homology model(s)
  • Ranked PDBs
  • BLASTed PDBs

If the always_use_homology flag is true, homology models are always set as representative when they exist. If there are multiple homology models, we rank by the percent sequence coverage.

Parameters:
  • seq_outdir (str) – Path to output directory of sequence alignment files, must be set if Protein directory was not created initially
  • struct_outdir (str) – Path to output directory of structure files, must be set if Protein directory was not created initially
  • pdb_file_type (str) – pdb, mmCif, xml, mmtf - file type for files downloaded from the PDB
  • engine (str) – biopython or needle - which pairwise alignment program to use. needle is the standard EMBOSS tool to run pairwise alignments. biopython is Biopython’s implementation of needle. Results can differ!
  • always_use_homology (bool) – If homology models should always be set as the representative structure
  • rez_cutoff (float) – Resolution cutoff, in Angstroms (only if experimental structure)
  • seq_ident_cutoff (float) – Percent sequence identity cutoff, in decimal form
  • allow_missing_on_termini (float) – Percentage of the total length of the reference sequence which will be ignored when checking for modifications. Example: if 0.1, and reference sequence is 100 AA, then only residues 5 to 95 will be checked for modifications.
  • allow_mutants (bool) – If mutations should be allowed or checked for
  • allow_deletions (bool) – If deletions should be allowed or checked for
  • allow_insertions (bool) – If insertions should be allowed or checked for
  • allow_unresolved (bool) – If unresolved residues should be allowed or checked for
  • clean (bool) – If structure should be cleaned
  • keep_chemicals (str, list) – Keep specified chemical names if structure is to be cleaned
  • skip_large_structures (bool) – Default False – currently, large structures can’t be saved as a PDB file even if you just want to save a single chain, so Biopython will throw an error when trying to do so. As an alternative, if a large structure is selected as representative, the pipeline will currently point to it and not clean it. If you don’t want this to happen, set this to true.
  • force_rerun (bool) – If sequence to structure alignment should be rerun
Returns:

Representative structure from the list of structures. This is a not a map to the original structure, it is copied and optionally cleaned from the original one.

Return type:

StructProp

Todo

  • Remedy large structure representative setting
structure_alignments = None

DictList – Pairwise or multiple structure alignments - currently a placeholder

structure_dir

str – Directory where structure related files are stored

structures = None

DictList – Stored protein structures which are related to this protein

write_all_sequences_file(outname, outdir=None)[source]

Write all the stored sequences as a single FASTA file. By default, sets IDs to model gene IDs.

Parameters:
  • outname (str) – Name of the output FASTA file without the extension
  • outdir (str) – Path to output directory for the file, default is the sequences directory

Further reading

For examples in which tools from the Protein class have been used for analysis, please see the following:

[1]Broddrick JT, Rubin BE, Welkie DG, Du N, Mih N, Diamond S, et al. Unique attributes of cyanobacterial metabolism revealed by improved genome-scale metabolic modeling and essential gene analysis. Proc Natl Acad Sci U S A. 2016;113: E8344–E8353. doi:10.1073/pnas.1613446113
[2]Mih N, Brunk E, Bordbar A, Palsson BO. A Multi-scale Computational Platform to Mechanistically Assess the Effect of Genetic Variation on Drug Responses in Human Erythrocyte Metabolism. PLoS Comput Biol. 2016;12: e1005039. doi:10.1371/journal.pcbi.1005039

The StructProp Class

The StructProp Class

Introduction

This section will give an overview of the methods that can be executed for a single protein structure.

Tutorials

PDBProp - Working With a Single PDB Structure

This notebook gives a tutorial of the PDBProp object, specifically how chains are handled and how to map a sequence to it.

Input: PDB ID
Output: PDBProp object
Imports
In [ ]:
from ssbio.databases.pdb import PDBProp
from ssbio.databases.uniprot import UniProtProp
In [ ]:
import sys
import logging
In [ ]:
# Create logger
logger = logging.getLogger()
logger.setLevel(logging.DEBUG)  # SET YOUR LOGGING LEVEL HERE #
In [ ]:
# Other logger stuff for Jupyter notebooks
handler = logging.StreamHandler(sys.stderr)
formatter = logging.Formatter('[%(asctime)s] [%(name)s] %(levelname)s: %(message)s', datefmt="%Y-%m-%d %H:%M")
handler.setFormatter(formatter)
logger.handlers = [handler]
Basic methods
In [ ]:
my_structure = PDBProp(ident='5T4Q', description='E. coli ATP synthase')
Download the structure

Downloading will: - Download the file type of choice to the specific output directory - Parse the PDB header file to fill out the metadata fields

In [ ]:
import tempfile
my_structure.download_structure_file(outdir=tempfile.gettempdir(), file_type='mmtf')
View all attributes
In [ ]:
my_structure.get_dict()
Set chains that we are interested in (if any)

The mapped_chains attribute allows us to limit sequence analyses to specified chains (see the later section where we align a sequence to this structure). For this example, the ATP synthase is a complex of a number of protein chains, and if we are interested in a specific gene transcript, we can set those.

In [ ]:
# Chains A, B, and C make up ATP synthase subunit alpha - from the gene b3734 (UniProt ID P0ABB0)
my_structure.add_mapped_chain_ids(['A', 'B', 'C'])
Parse the structure to work with the Biopython Structure object

Parsing the structure will parse the sequences of each chain, and store those in the chains attribute. It will also return a Biopython Structure object which opens up all methods available for structures in Biopython.

In [ ]:
parsed_structure = my_structure.parse_structure()
print(type(parsed_structure.structure))
print(type(parsed_structure.first_model))
Clean the structure and save the structure

Cleaning a structure does the following: - Add missing chain identifiers to a PDB file - Select a single chain if noted - Remove alternate atom locations - Add atom occupancies - Add B (temperature) factors (default Biopython behavior)

In the example below, we will clean the structure so it only includes our mapped chains.

In [ ]:
cleaned_structure = my_structure.clean_structure(outdir='/tmp', keep_chains=my_structure.mapped_chains, force_rerun=True)
cleaned_structure
Viewing the structure
In [ ]:
# The original structure
my_structure.view_structure(recolor=False)
In [ ]:
# The cleaned structure
import nglview
nglview.show_structure_file(cleaned_structure)
FATCAT - Structure Similarity

This notebook shows how to run and parse FATCAT, a structural similarity calculator.

In [ ]:
import ssbio.protein.structure.properties.fatcat as fatcat
In [ ]:
import os
import os.path as op
import tempfile

ROOT_DIR = tempfile.gettempdir()
OUT_DIR = op.join(ROOT_DIR, 'fatcat_testing')
if not op.exists(OUT_DIR):
    os.mkdir(OUT_DIR)
FATCAT_SH = 'fatcat'
Pairwise
In [ ]:
fatcat_outfile = fatcat.run_fatcat(structure_path_1='../../ssbio/test/test_files/structures/12as-A_clean.pdb',
                                   structure_path_2='../../ssbio/test/test_files/structures/1a9x-A_clean.pdb',
                                   outdir=OUT_DIR,
                                   fatcat_sh=FATCAT_SH, print_cmd=True, force_rerun=True)
print('Output file:', fatcat_outfile)
In [ ]:
fatcat.parse_fatcat(fatcat_outfile)
All-by-all
In [ ]:
structs = ['../../ssbio/test/test_files/structures/12as-A_clean.pdb',
           '../../ssbio/test/test_files/structures/1af6-A_clean.pdb',
           '../../ssbio/test/test_files/structures/1a9x-A_clean.pdb']
In [ ]:
tm_scores = fatcat.run_fatcat_all_by_all(structs, fatcat_sh=FATCAT_SH, outdir=OUT_DIR)
tm_scores
In [ ]:
%matplotlib inline
import seaborn as sns
sns.heatmap(tm_scores)

Available functions

Sequence & structure-based predictions
Function Description Internal Python class
used and functions provided
External software
to install
Web server Alternate external
software to install
Homology modeling Preparation scripts and parsers for
executing homology modeling algorithms
I-TASSER    
Transmembrane
orientation
Prediction of transmembrane domains and
orientation in a membrane
opm module   OPM  
Kinetic folding rate Prediction of protein folding rates
from amino acid sequence
kinetic_folding_rate module   FOLD-RATE  
Structure-based calculations or functions
Function Description Internal Python class
used and functions provided
External software
to install
Web server Alternate external
software to install
Secondary structure Calculations of secondary structure DSSP   STRIDE
Solvent accessibilities Calculations of per-residue absolute and
relative solvent accessibilities
DSSP   FreeSASA
Residue depths Calculations of residue depths MSMS    
Structural similarity Pairwise calculations of 3D structural
similarity
fatcat module FATCAT    
Various structure
properties
Basic properties of the structure, such
as distance measurements between residues
or number of disulfide bridges
     
Quality Custom functions to allow ranking of
structures by percent identity to a defined
sequence, structure resolution, and other structure
quality metrics
set_representative_structure function      
Structure cleaning,
mutating
Custom functions to allow for the preparation
of structure files for molecular modeling, with
options to remove hydrogens/waters/heteroatoms,
select specific chains, or mutate specific residues.
AmberTools    

API

StructProp
class ssbio.protein.structure.structprop.StructProp(ident, description=None, chains=None, mapped_chains=None, is_experimental=False, structure_path=None, file_type=None)[source]

Generic class to represent information for a protein structure.

The main utilities of this class are to:

  • Provide access to the 3D coordinates using a Biopython Structure object through the method parse_structure.
  • Run predictions and computations on the structure
  • Analyze specific chains using the mapped_chains attribute
  • Provide wrapper methods to nglview to view the structure in a Jupyter notebook
Parameters:
  • ident (str) – Unique identifier for this structure
  • description (str) – Optional human-readable description
  • chains (str, list) – Chain ID or list of IDs
  • mapped_chains (str, list) – A chain ID or IDs to indicate what chains should be analyzed
  • is_experimental (bool) – Flag to indicate if structure is an experimental or computational model
  • structure_path (str) – Path to structure file
  • file_type (str) – Type of structure file - pdb, pdb.gz, mmcif, cif, cif.gz, xml.gz, mmtf, mmtf.gz
add_chain_ids(chains)[source]

Add chains by ID into the chains attribute

Parameters:chains (str, list) – Chain ID or list of IDs
add_mapped_chain_ids(mapped_chains)[source]

Add chains by ID into the mapped_chains attribute

Parameters:mapped_chains (str, list) – Chain ID or list of IDs
add_residues_highlight_to_nglview(view, structure_resnums, chain=None, res_color='red')[source]

Add a residue number or numbers to an NGLWidget view object.

Parameters:
  • view (NGLWidget) – NGLWidget view object
  • structure_resnums (int, list) – Residue number(s) to highlight, structure numbering
  • chain (str, list) – Chain ID or IDs of which residues are a part of. If not provided, all chains in the mapped_chains attribute will be used. If that is also empty, and exception is raised.
  • res_color (str) – Color to highlight residues with
add_scaled_residues_highlight_to_nglview(view, structure_resnums, chain=None, color='red', unique_colors=False, opacity_range=(0.5, 1), scale_range=(0.7, 10))[source]
Add a list of residue numbers (which may contain repeating residues) to a view, or add a dictionary of
residue numbers to counts. Size and opacity of added residues are scaled by counts.
Parameters:
  • view (NGLWidget) – NGLWidget view object
  • structure_resnums (int, list, dict) – Residue number(s) to highlight, or a dictionary of residue number to frequency count
  • chain (str, list) – Chain ID or IDs of which residues are a part of. If not provided, all chains in the mapped_chains attribute will be used. If that is also empty, and exception is raised.
  • color (str) – Color to highlight residues with
  • unique_colors (bool) – If each mutation should be colored uniquely (will override color argument)
  • opacity_range (tuple) – Min/max opacity values (residues that have higher frequency counts will be opaque)
  • scale_range (tuple) – Min/max size values (residues that have higher frequency counts will be bigger)
chains = None

DictList – A DictList of chains have their sequence stored in them, along with residue-specific

clean_structure(out_suffix='_clean', outdir=None, force_rerun=False, remove_atom_alt=True, keep_atom_alt_id='A', remove_atom_hydrogen=True, add_atom_occ=True, remove_res_hetero=True, keep_chemicals=None, keep_res_only=None, add_chain_id_if_empty='X', keep_chains=None)[source]

Clean the structure file associated with this structure, and save it as a new file. Returns the file path.

Parameters:
  • out_suffix (str) – Suffix to append to original filename
  • outdir (str) – Path to output directory
  • force_rerun (bool) – If structure should be re-cleaned if a clean file exists already
  • remove_atom_alt (bool) – Remove alternate positions
  • keep_atom_alt_id (str) – If removing alternate positions, which alternate ID to keep
  • remove_atom_hydrogen (bool) – Remove hydrogen atoms
  • add_atom_occ (bool) – Add atom occupancy fields if not present
  • remove_res_hetero (bool) – Remove all HETATMs
  • keep_chemicals (str, list) – If removing HETATMs, keep specified chemical names
  • keep_res_only (str, list) – Keep ONLY specified resnames, deletes everything else!
  • add_chain_id_if_empty (str) – Add a chain ID if not present
  • keep_chains (str, list) – Keep only these chains
Returns:

Path to cleaned PDB file

Return type:

str

file_type = None

str – Type of structure file

find_disulfide_bridges(threshold=3.0)[source]

Run Biopython’s search_ss_bonds to find potential disulfide bridges for each chain and store in ChainProp.

get_dict_with_chain(chain, only_keys=None, chain_keys=None, exclude_attributes=None, df_format=False)[source]

get_dict method which incorporates attributes found in a specific chain. Does not overwrite any attributes in the original StructProp.

Parameters:
  • chain
  • only_keys
  • chain_keys
  • exclude_attributes
  • df_format
Returns:

attributes of StructProp + the chain specified

Return type:

dict

get_dssp_annotations(outdir, force_rerun=False)[source]

Run DSSP on this structure and store the DSSP annotations in the corresponding ChainProp SeqRecords

Calculations are stored in the ChainProp’s letter_annotations at the following keys:

  • SS-dssp
  • RSA-dssp
  • ASA-dssp
  • PHI-dssp
  • PSI-dssp
Parameters:
  • outdir (str) – Path to where DSSP dataframe will be stored.
  • force_rerun (bool) – If DSSP results should be recalculated

Todo

  • Also parse global properties, like total accessible surface area. Don’t think Biopython parses those?
get_freesasa_annotations(outdir, include_hetatms=False, force_rerun=False)[source]

Run freesasa on this structure and store the calculated properties in the corresponding ChainProps

get_msms_annotations(outdir, force_rerun=False)[source]

Run MSMS on this structure and store the residue depths/ca depths in the corresponding ChainProp SeqRecords

get_structure_seqs(model)[source]

Gather chain sequences and store in their corresponding ChainProp objects in the chains attribute.

Parameters:model (Model) – Biopython Model object of the structure you would like to parse
is_experimental = None

bool – Flag to note if this structure is an experimental model or a homology model

load_structure_path(structure_path, file_type)[source]

Load a structure file and provide pointers to its location

Parameters:
  • structure_path (str) – Path to structure file
  • file_type (str) – Type of structure file
mapped_chains = None

list – A simple list of chain IDs (strings) that will be used to subset analyses

parse_structure(store_in_memory=False)[source]

Read the 3D coordinates of a structure file and return it as a Biopython Structure object. Also create ChainProp objects in the chains attribute for each chain in the first model.

Parameters:store_in_memory (bool) – If the Biopython Structure object should be stored in the attribute structure.
Returns:Biopython Structure object
Return type:Structure
parsed = None

bool – Simple flag to track if this structure has had its structure + chain sequences parsed

structure = None

Structure – Biopython Structure object, only used if store_in_memory option of parse_structure is set to True

structure_file = None

str – Name of the structure file

view_structure(only_chains=None, opacity=1.0, recolor=False, gui=False)[source]

Use NGLviewer to display a structure in a Jupyter notebook

Parameters:
  • only_chains (str, list) – Chain ID or IDs to display
  • opacity (float) – Opacity of the structure
  • recolor (bool) – If structure should be cleaned and recolored to silver
  • gui (bool) – If the NGLview GUI should show up
Returns:

NGLviewer object

The SeqProp Class

The SeqProp Class

Introduction

This section will give an overview of the methods that can be executed for a single protein sequence.

Tutorials

SeqProp - Protein Sequence Properties

This notebook gives an overview the available calculations for properties of a single protein sequence.

Input: Amino acid sequence
Output: Amino acid sequence properties

Note

See ssbio.protein.sequence.seqprop.SeqProp for a description of all the available attributes and functions.

Imports
In [ ]:
import sys
import logging
import os.path as op
In [ ]:
# Import the SeqProp class
from ssbio.protein.sequence.seqprop import SeqProp
In [ ]:
# Printing multiple outputs per cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
Logging

Set the logging level in logger.setLevel(logging.<LEVEL_HERE>) to specify how verbose you want the pipeline to be. Debug is most verbose.

  • CRITICAL
    • Only really important messages shown
  • ERROR
    • Major errors
  • WARNING
    • Warnings that don’t affect running of the pipeline
  • INFO (default)
    • Info such as the number of structures mapped per gene
  • DEBUG
    • Really detailed information that will print out a lot of stuff
In [ ]:
# Create logger
logger = logging.getLogger()
logger.setLevel(logging.INFO)  # SET YOUR LOGGING LEVEL HERE #
In [ ]:
# Other logger stuff for Jupyter notebooks
handler = logging.StreamHandler(sys.stderr)
formatter = logging.Formatter('[%(asctime)s] [%(name)s] %(levelname)s: %(message)s', datefmt="%Y-%m-%d %H:%M")
handler.setFormatter(formatter)
logger.handlers = [handler]
Initialization of the project

Set these two things:

  • PROTEIN_ID
    • Your protein ID
  • PROTEIN_SEQ
    • Your protein sequence
In [ ]:
# SET IDS HERE
PROTEIN_ID = 'YIAJ_ECOLI'
PROTEIN_SEQ = 'MGKEVMGKKENEMAQEKERPAGSQSLFRGLMLIEILSNYPNGCPLAHLSELAGLNKSTVHRLLQGLQSCGYVTTAPAAGSYRLTTKFIAVGQKALSSLNIIHIAAPHLEALNIATGETINFSSREDDHAILIYKLEPTTGMLRTRAYIGQHMPLYCSAMGKIYMAFGHPDYVKSYWESHQHEIQPLTRNTITELPAMFDELAHIRESGAAMDREENELGVSCIAVPVFDIHGRVPYAVSISLSTSRLKQVGEKNLLKPLRETAQAISNELGFTVRDDLGAIT'
In [ ]:
# Create the SeqProp object
my_seq = SeqProp(id=PROTEIN_ID, seq=PROTEIN_SEQ)
SeqProp.write_fasta_file(outfile, force_rerun=False)[source]

Write a FASTA file for the protein sequence, seq will now load directly from this file.

Parameters:
  • outfile (str) – Path to new FASTA file to be written to
  • force_rerun (bool) – If an existing file should be overwritten
In [ ]:
# Write temporary FASTA file for property calculations that require FASTA file as input
import tempfile
ROOT_DIR = tempfile.gettempdir()

my_seq.write_fasta_file(outfile=op.join(ROOT_DIR, 'tmp.fasta'), force_rerun=True)
my_seq.sequence_path
Computing and storing protein properties

A SeqProp object is simply an extension of the Biopython SeqRecord object. Global properties which describe or summarize the entire protein sequence are stored in the annotations attribute, while local residue-specific properties are stored in the letter_annotations attribute.

Basic global properties
SeqProp.get_biopython_pepstats()[source]

Run Biopython’s built in ProteinAnalysis module and store statistics in the annotations attribute.

In [ ]:
# Global properties using the Biopython ProteinAnalysis module
my_seq.get_biopython_pepstats()
{k:v for k,v in my_seq.annotations.items() if k.endswith('-biop')}
SeqProp.get_emboss_pepstats()[source]

Run the EMBOSS pepstats program on the protein sequence.

Stores statistics in the annotations attribute. Saves a .pepstats file of the results where the sequence file is located.

In [ ]:
# Global properties from the EMBOSS pepstats program
my_seq.get_emboss_pepstats()
{k:v for k,v in my_seq.annotations.items() if k.endswith('-pepstats')}
SeqProp.get_aggregation_propensity(email, password, cutoff_v=5, cutoff_n=5, run_amylmuts=False, outdir=None)[source]

Run the AMYLPRED2 web server to calculate the aggregation propensity of this protein sequence, which is the number of aggregation-prone segments on the unfolded protein sequence.

Stores statistics in the annotations attribute, under the key aggprop-amylpred.

See ssbio.protein.sequence.properties.aggregation_propensity for instructions and details.

In [ ]:
# Aggregation propensity - the predicted number of aggregation-prone segments on an unfolded protein sequence
my_seq.get_aggregation_propensity(outdir=ROOT_DIR, email='nmih@ucsd.edu', password='ssbiotest', cutoff_v=5, cutoff_n=5, run_amylmuts=False)
{k:v for k,v in my_seq.annotations.items() if k.endswith('-amylpred')}
SeqProp.get_kinetic_folding_rate(secstruct, at_temp=None)[source]

Run the FOLD-RATE web server to calculate the kinetic folding rate given an amino acid sequence and its structural classficiation (alpha/beta/mixed)

Stores statistics in the annotations attribute, under the key kinetic_folding_rate_<TEMP>-foldrate.

See ssbio.protein.sequence.properties.kinetic_folding_rate.get_foldrate() for instructions and details.

In [ ]:
# Kinetic folding rate - the predicted rate of folding for this protein sequence
secstruct_class = 'mixed'
my_seq.get_kinetic_folding_rate(secstruct=secstruct_class)
{k:v for k,v in my_seq.annotations.items() if k.endswith('-foldrate')}
SeqProp.get_thermostability(at_temp)[source]

Run the thermostability calculator using either the Dill or Oobatake methods.

Stores calculated (dG, Keq) tuple in the annotations attribute, under the key thermostability_<TEMP>-<METHOD_USED>.

See ssbio.protein.sequence.properties.thermostability.get_dG_at_T() for instructions and details.

In [ ]:
# Thermostability - prediction of free energy of unfolding dG from protein sequence
# Stores (dG, Keq)
my_seq.get_thermostability(at_temp=32.0)
my_seq.get_thermostability(at_temp=37.0)
my_seq.get_thermostability(at_temp=42.0)
{k:v for k,v in my_seq.annotations.items() if k.startswith('thermostability_')}

Available functions

Sequence-based predictions
Function Description Internal Python class
used and functions provided
External software
to install
Web server Alternate external
software to install
Secondary structure
and
solvent accessibilities
Predictions of secondary structure and
relative solvent accessibilities per residue
scratch module SCRATCH    
Thermostability Free energy of unfolding (ΔG), adapted from
Oobatake (Oobatake & Ooi 1993) and Dill (Dill et al. 2011)
thermostability module      
Transmembrane domains Prediction of transmembrane domains from sequence tmhmm module TMHMM    
Aggregation propensity Consensus method to predict the aggregation
propensity of proteins, specifically the number
of aggregation-prone segments on an unfolded
protein sequence
aggregation_propensity module   AMYLPRED2  
Sequence-based calculations
Function Description Internal Python class
used and functions provided
External software
to install
Web server Alternate external
software to install
Various sequence
properties
Basic properties of the sequence, such as
percent of polar, non-polar, hydrophobic
or hydrophilic residues.
    EMBOSS pepstats
Sequence alignment Basic functions to run pairwise or multiple
sequence alignments
    EMBOSS needle

API

SeqProp
class ssbio.protein.sequence.seqprop.SeqProp(seq, id, name='<unknown name>', description='<unknown description>', sequence_path=None, metadata_path=None, feature_path=None)[source]

Generic class to represent information for a protein sequence.

Extends the Biopython SeqRecord class. The main functionality added is the ability to set and load directly from sequence, metadata, and feature files. Additionally, methods are provided to calculate and store sequence properties in the annotations and letter_annotations field of a SeqProp. These can then be accessed for a range of residue numbers.

id

str – Unique identifier for this protein sequence

seq

Seq – Protein sequence as a Biopython Seq object

name

str – Optional name for this sequence

description

str – Optional description for this sequence

bigg

str, list – BiGG IDs mapped to this sequence

kegg

str, list – KEGG IDs mapped to this sequence

refseq

str, list – RefSeq IDs mapped to this sequence

uniprot

str, list – UniProt IDs mapped to this sequence

gene_name

str, list – Gene names mapped to this sequence

pdbs

list – PDB IDs mapped to this sequence

go

str, list – GO terms mapped to this sequence

pfam

str, list – PFAMs mapped to this sequence

ec_number

str, list – EC numbers mapped to this sequence

sequence_file

str – FASTA file for this sequence

metadata_file

str – Metadata file (any format) for this sequence

feature_file

str – GFF file for this sequence

features

list – List of protein sequence features, which define regions of the protein

annotations

dict – Annotations of this protein sequence, which summarize global properties

letter_annotations

RestrictedDict – Residue-level annotations, which describe single residue properties

Todo

  • Properly inherit methods from the Object class…
add_point_feature(resnum, feat_type=None, feat_id=None)[source]

Add a feature to the features list describing a single residue.

Parameters:
  • resnum (int) – Protein sequence residue number
  • feat_type (str, optional) – Optional description of the feature type (ie. ‘catalytic residue’)
  • feat_id (str, optional) – Optional ID of the feature type (ie. ‘TM1’)
add_region_feature(start_resnum, end_resnum, feat_type=None, feat_id=None)[source]

Add a feature to the features list describing a region of the protein sequence.

Parameters:
  • start_resnum (int) – Start residue number of the protein sequence feature
  • end_resnum (int) – End residue number of the protein sequence feature
  • feat_type (str, optional) – Optional description of the feature type (ie. ‘binding domain’)
  • feat_id (str, optional) – Optional ID of the feature type (ie. ‘TM1’)
blast_pdb(seq_ident_cutoff=0, evalue=0.0001, display_link=False, outdir=None, force_rerun=False)[source]

BLAST this sequence to the PDB

equal_to(seq_prop)[source]

Test if the sequence is equal to another SeqProp’s sequence

Parameters:seq_prop – SeqProp object
Returns:If the sequences are the same
Return type:bool
feature_path_unset()[source]

Copy features to memory and remove the association of the feature file.

features

list – Get the features stored in memory or in the GFF file

get_aggregation_propensity(email, password, cutoff_v=5, cutoff_n=5, run_amylmuts=False, outdir=None)[source]

Run the AMYLPRED2 web server to calculate the aggregation propensity of this protein sequence, which is the number of aggregation-prone segments on the unfolded protein sequence.

Stores statistics in the annotations attribute, under the key aggprop-amylpred.

See ssbio.protein.sequence.properties.aggregation_propensity for instructions and details.

get_biopython_pepstats()[source]

Run Biopython’s built in ProteinAnalysis module and store statistics in the annotations attribute.

get_dict(only_attributes=None, exclude_attributes=None, df_format=False)[source]

Get a dictionary of this object’s attributes. Optional format for storage in a Pandas DataFrame.

Parameters:
  • only_attributes (str, list) – Attributes that should be returned. If not provided, all are returned.
  • exclude_attributes (str, list) – Attributes that should be excluded.
  • df_format (bool) – If dictionary values should be formatted for a dataframe (everything possible is transformed into strings, int, or float - if something can’t be transformed it is excluded)
Returns:

Dictionary of attributes

Return type:

dict

get_emboss_pepstats()[source]

Run the EMBOSS pepstats program on the protein sequence.

Stores statistics in the annotations attribute. Saves a .pepstats file of the results where the sequence file is located.

get_kinetic_folding_rate(secstruct, at_temp=None)[source]

Run the FOLD-RATE web server to calculate the kinetic folding rate given an amino acid sequence and its structural classficiation (alpha/beta/mixed)

Stores statistics in the annotations attribute, under the key kinetic_folding_rate_<TEMP>-foldrate.

See ssbio.protein.sequence.properties.kinetic_folding_rate.get_foldrate() for instructions and details.

get_residue_annotations(start_resnum, end_resnum=None)[source]

Retrieve letter annotations for a residue or a range of residues

Parameters:
  • start_resnum (int) – Residue number
  • end_resnum (int) – Optional residue number, specify if a range is desired
Returns:

Letter annotations for this residue or residues

Return type:

dict

get_thermostability(at_temp)[source]

Run the thermostability calculator using either the Dill or Oobatake methods.

Stores calculated (dG, Keq) tuple in the annotations attribute, under the key thermostability_<TEMP>-<METHOD_USED>.

See ssbio.protein.sequence.properties.thermostability.get_dG_at_T() for instructions and details.

num_pdbs

int – Report the number of PDB IDs stored in the pdbs attribute

seq

Seq – Dynamically loaded Seq object from the sequence file

seq_len

int – Get the sequence length

seq_str

str – Get the sequence formatted as a string

write_fasta_file(outfile, force_rerun=False)[source]

Write a FASTA file for the protein sequence, seq will now load directly from this file.

Parameters:
  • outfile (str) – Path to new FASTA file to be written to
  • force_rerun (bool) – If an existing file should be overwritten
write_gff_file(outfile, force_rerun=False)[source]

Write a GFF file for the protein features, features will now load directly from this file.

Parameters:
  • outfile (str) – Path to new FASTA file to be written to
  • force_rerun (bool) – If an existing file should be overwritten

Index: Software

This section provides a simple list of external software that may be required to carry out specific computations on a protein sequence or structure. This list only contains software that is wrapped with ssbio – there may be other programs that carry out these same functions, and do it better (or worse)!

Tables describing functionalities of these software packages in relation to their input, as well as links to internal wrappers and parses are found on The SeqProp Class and The StructProp Class pages.


Protein structure predictions

Homology modeling
I-TASSER
Homology modeling
Description

I-TASSER (Iterative Threading ASSEmbly Refinement) is a program for protein homology modeling and functional prediction from a protein sequence. The I-TASSER suite provides numerous other tools such as for ligand-binding site predictions, model refinement, secondary structure predictions, B-factor estimations, and more. ssbio mainly provides tools to run and parse I-TASSER homology modeling results, as well as COACH consensus binding site predictions (optionally with EC number and GO term predictions). Also, scripts are provided to automate homology modeling on a large scale using TORQUE or Slurm job schedulers in a cluster computing environment.

Installation instructions

Note

These instructions were created on an Ubuntu 17.04 system.

Note

Read the README on the I-TASSER Suite page for the most up-to-date instructions

  1. Make sure you have Java installed and it can be run from the command line with java

  2. Head to the I-TASSER download page and register for an license (academic only) to get a password emailed to you

  3. Log in to the I-TASSER download page and download the archive

  4. Unpack the software archive into a convenient directory - a library should also be downloaded to this directory

  5. Run download_lib.pl to then download the library files - this will take some time:

    /path/to/<I-TASSER_directory>/download_lib.pl -libdir ITLIB
    
  6. Now, I-TASSER can be run according to the README under section 4

  7. To enable GO term predictions…

    1. under construction…
  8. Tip: to update template libraries, create a new command in your crontab (first run crontab -e), and make sure to replace <USERNAME> with your username:

    0 4 * * 1,5 <USERNAME> /path/to/I-TASSER4.4/download_lib.pl -libdir /path/to/ITLIB
    

    That will run the library update at 4 am every Monday and Friday.

Program execution
In the shell

To run the program on its own in the shell…

<code>
With ssbio

To run the program using the ssbio Python wrapper, see: ssbio.protein.path.to.wrapper()

FAQs
  • What is a homology model?

    • A predicted 3D structure model of a protein sequence. Models can be template-based, when they are based on an existing experimental structure; or ab initio, generated without a template. Generally, ab initio models are much less reliable.
  • Can I just run I-TASSER using their web server and parse those results with ssbio?

    • Not yet, but you can manually input the model1.pdb file as a new structure for now.
  • How do I cite I-TASSER?

    • Roy A, Kucukural A & Zhang Y (2010) I-TASSER: a unified platform for automated protein structure and function prediction. Nat. Protoc. 5: 725–738 Available at: http://dx.doi.org/10.1038/nprot.2010.5
  • How do I run I-TASSER with TORQUE or Slurm job schedulers?

    • under construction…
  • I’m having issues running I-TASSER…

    • See the ssbio wiki for (hopefully) some solutions - or add yours in when you find the answer!
API
class ssbio.protein.structure.homology.itasser.itasserprep.ITASSERPrep(ident, seq_str, root_dir, itasser_path, itlib_path, execute_dir=None, light=True, runtype='local', print_exec=False, java_home=None, binding_site_pred=False, ec_pred=False, go_pred=False, additional_options=None, job_scheduler_header=None)[source]

Prepare a protein sequence for an I-TASSER homology modeling run.

The main utilities of this class are to:

  • Allow for the input of a protein sequence string and paths to I-TASSER to create execution scripts
  • Automate large-scale homology modeling efforts by creating Slurm or TORQUE job scheduling scripts
Parameters:
  • ident – Identifier for your sequence. Will be used as the global ID (folder name, sequence name)
  • seq_str – Sequence in string format
  • root_dir – Local directory where I-TASSER folder will be created
  • itasser_path – Path to I-TASSER folder, i.e. ‘~/software/I-TASSER4.4’
  • itlib_path – Path to ITLIB folder, i.e. ‘~/software/ITLIB’
  • execute_dir – Optional path to execution directory - use this if you are copying the homology models to another location such as a supercomputer for running
  • light – If simulations should be limited to 5 runs
  • runtype – How you will be running I-TASSER - local, slurm, or torque
  • print_exec – If the execution script should be printed out
  • java_home – Path to Java executable
  • binding_site_pred – If binding site predictions should be run
  • ec_pred – If EC number predictions should be run
  • go_pred – If GO term predictions should be run
  • additional_options – Any other additional I-TASSER options, appended to the command
  • job_scheduler_header – Any job scheduling options, prepended as a header to the file
prep_folder(seq)[source]

Take in a sequence string and prepares the folder for the I-TASSER run.

class ssbio.protein.structure.homology.itasser.itasserprop.ITASSERProp(ident, original_results_path, coach_results_folder='model1/coach', model_to_use='model1')[source]

Parse all available information for a local I-TASSER modeling run.

Initializes a class to collect I-TASSER modeling information and optionally copy results to a new directory. SEE: https://zhanglab.ccmb.med.umich.edu/papers/2015_1.pdf for detailed information.

Parameters:
  • ident (str) – ID of I-TASSER modeling run
  • original_results_path (str) – Path to I-TASSER modeling folder
  • coach_results_folder (str) – Path to original COACH results
  • model_to_use (str) – Which I-TASSER model to use. Default is “model1”
copy_results(copy_to_dir, rename_model_to=None, force_rerun=False)[source]

Copy the raw information from I-TASSER modeling to a new folder.

Copies all files in the list _attrs_to_copy.

Parameters:
  • copy_to_dir (str) – Directory to copy the minimal set of results per sequence.
  • rename_model_to (str) – New file name (without extension)
  • force_rerun (bool) – If existing models and results should be overwritten.
get_dict(only_attributes=None, exclude_attributes=None, df_format=False)[source]

Summarize the I-TASSER run in a dictionary containing modeling results and top predictions from COACH

Parameters:
  • only_attributes (str, list) – Attributes that should be returned. If not provided, all are returned.
  • exclude_attributes (str, list) – Attributes that should be excluded.
  • df_format (bool) – If dictionary values should be formatted for a dataframe (everything possible is transformed into strings, int, or float - if something can’t be transformed it is excluded)
Returns:

Dictionary of attributes

Return type:

dict

load_structure_path(structure_path, file_type='pdb')[source]

Load a structure file and provide pointers to its location

Parameters:
  • structure_path (str) – Path to structure file
  • file_type (str) – Type of structure file
ssbio.protein.structure.homology.itasser.itasserprop.parse_bfp_dat(infile)[source]

Parse the B-factor predictions in BFP.dat

Parameters:infile (str) – Path to BFP.dat
Returns:List of B-factor predictions for all residues
Return type:list
ssbio.protein.structure.homology.itasser.itasserprop.parse_coach_bsites_inf(infile)[source]

Parse the Bsites.inf output file of COACH and return a list of rank-ordered binding site predictions

Bsites.inf contains the summary of COACH clustering results after all other prediction algorithms have finished For each site (cluster), there are three lines:

  • Line 1: site number, c-score of coach prediction, cluster size
  • Line 2: algorithm, PDB ID, ligand ID, center of binding site (cartesian coordinates), c-score of the algorithm’s prediction, binding residues from single template
  • Line 3: Statistics of ligands in the cluster

C-score information:

Parameters:infile (str) – Path to Bsites.inf
Returns:Ranked list of dictionaries, keys defined below
  • site_num: cluster which is the consensus binding site
  • c_score: confidence score of the cluster prediction
  • cluster_size: number of predictions within this cluster
  • algorithm: main? algorithm used to make the prediction
  • pdb_template_id: PDB ID of the template used to make the prediction
  • pdb_template_chain: chain of the PDB which has the ligand
  • pdb_ligand: predicted ligand to bind
  • binding_location_coords: centroid of the predicted ligand position in the homology model
  • c_score_method: confidence score for the main algorithm
  • binding_residues: predicted residues to bind the ligand
  • ligand_cluster_counts: number of predictions per ligand
Return type:list
ssbio.protein.structure.homology.itasser.itasserprop.parse_coach_ec(infile)[source]

Parse the EC.dat output file of COACH and return a list of rank-ordered EC number predictions

EC.dat contains the predicted EC number and active residues. The columns are: PDB_ID, TM-score, RMSD, Sequence identity, Coverage, Confidence score, EC number, and Active site residues

Parameters:infile (str) – Path to EC.dat
Returns:Ranked list of dictionaries, keys defined below
  • pdb_template_id: PDB ID of the template used to make the prediction
  • pdb_template_chain: chain of the PDB which has the ligand
  • tm_score: TM-score of the template to the model (similarity score)
  • rmsd: RMSD of the template to the model (also a measure of similarity)
  • seq_ident: percent sequence identity
  • seq_coverage: percent sequence coverage
  • c_score: confidence score of the EC prediction
  • ec_number: predicted EC number
  • binding_residues: predicted residues to bind the ligand
Return type:list
ssbio.protein.structure.homology.itasser.itasserprop.parse_coach_ec_df(infile)[source]

Parse the EC.dat output file of COACH and return a dataframe of results

EC.dat contains the predicted EC number and active residues. The columns are: PDB_ID, TM-score, RMSD, Sequence identity, Coverage, Confidence score, EC number, and Active site residues

Parameters:infile (str) – Path to EC.dat
Returns:Pandas DataFrame summarizing EC number predictions
Return type:DataFrame
ssbio.protein.structure.homology.itasser.itasserprop.parse_coach_go(infile)[source]

Parse a GO output file from COACH and return a rank-ordered list of GO term predictions

The columns in all files are: GO terms, Confidence score, Name of GO terms. The files are:

  • GO_MF.dat - GO terms in ‘molecular function’
  • GO_BP.dat - GO terms in ‘biological process’
  • GO_CC.dat - GO terms in ‘cellular component’
Parameters:infile (str) – Path to any COACH GO prediction file
Returns:Organized dataframe of results, columns defined below
  • go_id: GO term ID
  • go_term: GO term text
  • c_score: confidence score of the GO prediction
Return type:Pandas DataFrame
ssbio.protein.structure.homology.itasser.itasserprop.parse_cscore(infile)[source]

Parse the cscore file to return a dictionary of scores.

Parameters:infile (str) – Path to cscore
Returns:Dictionary of scores
Return type:dict
ssbio.protein.structure.homology.itasser.itasserprop.parse_exp_dat(infile)[source]

Parse the solvent accessibility predictions in exp.dat

Parameters:infile (str) – Path to exp.dat
Returns:List of solvent accessibility predictions for all residues
Return type:list
ssbio.protein.structure.homology.itasser.itasserprop.parse_init_dat(infile)[source]

Parse the main init.dat file which contains the modeling results

The first line of the file init.dat contains stuff like:

"120 easy  40   8"

The other lines look like this:

"     161   11.051   1  1guqA MUSTER"

and getting the first 10 gives you the top 10 templates used in modeling

Parameters:infile (stt) – Path to init.dat
Returns:Dictionary of parsed information
Return type:dict
ssbio.protein.structure.homology.itasser.itasserprop.parse_seq_dat(infile)[source]

Parse the secondary structure predictions in seq.dat

Parameters:infile (str) – Path to seq.dat
Returns:List of secondary structure predictions for all residues
Return type:list
Transmembrane orientations
OPM
Description

OPM is a program to predict the location of transmembrane planes in protein structures, utilizing the atomic coordinates. ssbio provides a wrapper to submit PDB files to the web server, cache, and parse the results.

Instructions
  1. Use the function ssbio.protein.structure.properties.opm.run_ppm_server() to upload a PDB file to the PPM server.
FAQs
  • How can I install OPM?

    • OPM is only available as a web server. ssbio provides a wrapper for the web server and allows you to submit protein structures to it along with caching the output files.
  • How do I cite OPM?

    • Lomize MA, Pogozheva ID, Joo H, Mosberg HI & Lomize AL (2012) OPM database and PPM web server: resources for positioning of proteins in membranes. Nucleic Acids Res. 40: D370–6 Available at: http://dx.doi.org/10.1093/nar/gkr703
  • I’m having issues running OPM…

    • See the ssbio wiki for (hopefully) some solutions - or add yours in when you find the answer!
API
ssbio.protein.structure.properties.opm.run_ppm_server(pdb_file, outfile, force_rerun=False)[source]

Run the PPM server from OPM to predict transmembrane residues.

Parameters:
  • pdb_file (str) – Path to PDB file
  • outfile (str) – Path to output HTML results file
  • force_rerun (bool) – Flag to rerun PPM if HTML results file already exists
Returns:

Dictionary of information from the PPM run, including a link to download the membrane protein file

Return type:

dict

Kinetic folding rate
FOLD-RATE
Description

This module provides a function to predict the kinetic folding rate (kf) given an amino acid sequence and its structural classficiation (alpha/beta/mixed).

Instructions
  1. Obtain your protein’s sequence
  2. Determine the main secondary structure composition of the protein (all-alpha, all-beta, mixed, or unknown)
  3. Input the sequence and secondary structure composition into the function ssbio.protein.sequence.properties.kinetic_folding_rate.get_foldrate()
FAQs
  • What is the main secondary structure composition of my protein?

    • all-alpha = dominated by α-helices; α > 40% and β < 5%
    • all-beta = dominated by β-strands; β > 40% and α < 5%
    • mixed = contain both α-helices and β-strands; α > 15% and β > 10%
  • What is the kinetic folding rate?

    • Protein folding rate is a measure of slow/fast folding of proteins from the unfolded state to native three-dimensional structure.
  • What units is it in?

    • Number of proteins folded per second
  • How can I install FOLD-RATE?

    • FOLD-RATE is only available as a web server. ssbio provides a wrapper for the web server and allows you to submit protein sequences to it along with caching the output files.
  • How do I cite FOLD-RATE?

    • Gromiha MM, Thangakani AM & Selvaraj S (2006) FOLD-RATE: prediction of protein folding rates from amino acid sequence. Nucleic Acids Res. 34: W70–4 Available at: http://dx.doi.org/10.1093/nar/gkl043
  • How can this parameter be used on a genome-scale?

    • See: Chen K, Gao Y, Mih N, O’Brien EJ, Yang L & Palsson BO (2017) Thermosensitivity of growth is determined by chaperone-mediated proteome reallocation. Proceedings of the National Academy of Sciences 114: 11548–11553 Available at: http://www.pnas.org/content/114/43/11548.abstract
  • I’m having issues running FOLD-RATE…

    • See the ssbio wiki for (hopefully) some solutions - or add yours in when you find the answer!
API
ssbio.protein.sequence.properties.kinetic_folding_rate.get_foldrate(seq, secstruct)[source]

Submit sequence and structural class to FOLD-RATE calculator (http://www.iitm.ac.in/bioinfo/fold-rate/) to calculate kinetic folding rate.

Parameters:
  • seq (str, Seq, SeqRecord) – Amino acid sequence
  • secstruct (str) – Structural class: all-alpha`, all-beta, mixed, or unknown
Returns:

Kinetic folding rate k_f

Return type:

float

ssbio.protein.sequence.properties.kinetic_folding_rate.get_foldrate_at_temp(ref_rate, new_temp, ref_temp=37.0)[source]

Scale the predicted kinetic folding rate of a protein to temperature T, based on the relationship ln(k_f)∝1/T

Parameters:
  • ref_rate (float) – Kinetic folding rate calculated from the function get_foldrate()
  • new_temp (float) – Temperature in degrees C
  • ref_temp (float) – Reference temperature, default to 37 C
Returns:

Kinetic folding rate k_f at temperature T

Return type:

float


Protein structure calculations

Secondary structure
DSSP
DSSP calculations
Description

DSSP (Define Secondary Structure of Proteins) is the standard method used to assign secondary structure annotations to a protein structure. DSSP utilizes the atomic coordinates of a structure to assign the secondary codes, which are:

Code Description
H Alpha helix
B Beta bridge
E Strand
G Helix-3
I Helix-5
T Turn
S Bend

Furthermore, DSSP calculates geometric properties such as the phi and psi angles between residues and solvent accessibilities. ssbio provides wrappers around the Biopython DSSP module to execute and parse DSSP results, as well as converting the information into a Pandas DataFrame format with calculated relative solvent accessbilities (see ssbio.protein.structure.properties.dssp for details).

Installation instructions (Ubuntu)

Note

These instructions were created on an Ubuntu 17.04 system.

  1. Install the DSSP package

    sudo apt-get install dssp
    
  2. The program installs itself as mkdssp, not dssp, and Biopython looks to execute dssp, so we need to symlink the name dssp to mkdssp

    sudo ln -s /usr/bin/mkdssp /usr/bin/dssp
    
  3. Then you should be able to run dssp in your terminal

Program execution
In the shell

To run the program on its own in the shell…

dssp -i <path_to_pdb_file> -o <new_path_to_output_file>
With ssbio

To run the program using the ssbio Python wrapper, see: ssbio.protein.structure.properties.dssp.get_dssp_df_on_file()

FAQs
  • How do I cite DSSP?

    • Kabsch W & Sander C (1983) DSSP: definition of secondary structure of proteins given a set of 3D coordinates. Biopolymers 22: 2577–2637
  • I’m having issues running DSSP…

    • See the ssbio wiki for (hopefully) some solutions - or add yours in when you find the answer!
API
ssbio.protein.structure.properties.dssp.all_dssp_props(filename, file_type)[source]

Returns a large dictionary of SASA, secondary structure composition, and surface/buried composition. Values are computed using DSSP. Input: PDB or MMCIF filename Output: Dictionary of values obtained from dssp

ssbio.protein.structure.properties.dssp.calc_sasa(dssp_df)[source]

Calculation of SASA utilizing the DSSP program.

DSSP must be installed for biopython to properly call it. Install using apt-get on Ubuntu or from: http://swift.cmbi.ru.nl/gv/dssp/

Input: PDB or CIF structure file Output: SASA (integer) of structure

ssbio.protein.structure.properties.dssp.calc_surface_buried(dssp_df)[source]

Calculates the percent of residues that are in the surface or buried, as well as if they are polar or nonpolar. Returns a dictionary of this.

ssbio.protein.structure.properties.dssp.get_dssp_df_on_file(pdb_file, outfile=None, outdir=None, outext='_dssp.df', force_rerun=False)[source]

Run DSSP directly on a structure file with the Biopython method Bio.PDB.DSSP.dssp_dict_from_pdb_file

Avoids errors like: PDBException: Structure/DSSP mismatch at <Residue MSE het= resseq=19 icode= > by not matching information to the structure file (DSSP fills in the ID “X” for unknown residues)

Parameters:
  • pdb_file – Path to PDB file
  • outfile – Name of output file
  • outdir – Path to output directory
  • outext – Extension of output file
  • force_rerun – If DSSP should be rerun if the outfile exists
Returns:

DSSP results summarized

Return type:

DataFrame

ssbio.protein.structure.properties.dssp.get_ss_class(pdb_file, dssp_file, chain)[source]

Define the secondary structure class of a PDB file at the specific chain

Parameters:
  • pdb_file
  • dssp_file
  • chain

Returns:

ssbio.protein.structure.properties.dssp.secondary_structure_summary(dssp_df)[source]

Summarize the secondary structure content of the DSSP dataframe for each chain.

Parameters:dssp_df – Pandas DataFrame of parsed DSSP results
Returns:Chain to secondary structure summary dictionary
Return type:dict
STRIDE
Secondary structure
Description

STRIDE (Structural identification) is a program used to assign secondary structure annotations to a protein structure. STRIDE has slightly more complex criteria to assign codes compared to DSSP. STRIDE utilizes the atomic coordinates of a structure to assign the structure codes, which are:

Code Description
H Alpha helix
G 3-10 helix
I PI-helix
E Extended conformation
B or b Isolated bridge
T Turn
C Coil (none of the above)
Installation instructions (Unix)

Note

These instructions were created on an Ubuntu 17.04 system.

  1. Download the source from the STRIDE download page

  2. Create a new folder named “stride” in a place where you store software and extract the source into it

    mkdir /path/to/software/stride
    cp /path/to/downloaded/stride.tar.gz /path/to/software/stride
    cd /path/to/software/stride
    tar -zxf stride.tar.gz
    
  3. Build the program from source and copy its binary:

    cd /path/to/software/stride
    make
    cp stride /usr/local/bin
    
Program execution
In the shell

To run the program on its own in the shell…

stride
With ssbio

To run the program using the ssbio Python wrapper, see: ssbio.protein.path.to.wrapper()

FAQs
  • How do I cite STRIDE?

  • I’m having issues running STRIDE…

    • See the ssbio wiki for (hopefully) some solutions - or add yours in when you find the answer!
API
Solvent accessibilities
FreeSASA
SASA
Description

FreeSASA is an open source library written in C for calculating solvent accessible surface areas of a protein. FreeSASA also contains Python bidings, and the plan is to include these bindings with ssbio in the future.

Installation instructions (Unix)

Note

These instructions were created on an Ubuntu 17.04 system with a Python installation through Anaconda3.

Note

FreeSASA Python bindings are slightly difficult to install with Python 3 - ssbio provides wrappers for the command line executable instead

  1. Download the latest tarball (see FreeSASA home page), expand it and run

    ./configure --disable-json --disable-xml
    make
    
  2. Install with

    sudo make install
    
Program execution
In the shell

To run the program on its own in the shell…

freesasa
With ssbio

To run the program using the ssbio Python wrapper, see: ssbio.protein.structure.properties.freesasa.run_freesasa()

FAQs
  • How do I cite FreeSASA?

  • I’m having issues running FreeSASA…

    • See the ssbio wiki for (hopefully) some solutions - or add yours in when you find the answer!
API
ssbio.protein.structure.properties.freesasa.parse_rsa_data(rsa_outfile, ignore_hets=True)[source]

Process a NACCESS or freesasa RSA output file. Adapted from Biopython NACCESS modele.

Parameters:
  • rsa_outfile (str) – Path to RSA output file
  • ignore_hets (bool) – If HETATMs should be excluded from the final dictionary. This is extremely important when loading this information into a ChainProp’s SeqRecord, since this will throw off the sequence matching.
Returns:

Per-residue dictionary of RSA values

Return type:

dict

ssbio.protein.structure.properties.freesasa.run_freesasa(infile, outfile, include_hetatms=True, outdir=None, force_rerun=False)[source]

Run freesasa on a PDB file, output using the NACCESS RSA format.

Parameters:
  • infile (str) – Path to PDB file (only PDB file format is accepted)
  • outfile (str) – Path or filename of output file
  • include_hetatms (bool) – If heteroatoms should be included in the SASA calculations
  • outdir (str) – Path to output file if not specified in outfile
  • force_rerun (bool) – If freesasa should be rerun even if outfile exists
Returns:

Path to output SASA file

Return type:

str

Residue depths
MSMS
Residue depths
Description

MSMS computes solvent excluded surfaces on a protein structure. Generally, MSMS is used to calculate residue depths (in Angstroms) from the surface of a protein, using a PDB file as an input. ssbio provides wrappers through Biopython to run MSMS as well as store the depths in an associated StructProp object.

Installation instructions (Unix)

Note

These instructions were created on an Ubuntu 17.04 system.

  1. Head to the Download page, and under the header “MSMS 2.6.X - Current Release” download the “Unix/Linux i86_64” version - if this doesn’t work though you’ll want to try the “Unix/Linux i86” version later.

  2. Download it, unarchive it to your library path:

    sudo mkdir /usr/local/lib/msms
    cd /usr/local/lib/msms
    tar zxvf /path/to/your/downloaded/file/msms_i86_64Linux2_2.6.1.tar.gz
    
  3. Symlink the binaries (or alternatively, add the two locations to your PATH):

    sudo ln -s /usr/local/lib/msms/msms.x86_64Linux2.2.6.1 /usr/local/bin/msms
    sudo ln -s /usr/local/lib/msms/pdb_to_xyzr* /usr/local/bin
    
  4. Fix a bug in the pdb_to_xyzr file (see: http://mailman.open-bio.org/pipermail/biopython/2015-November/015787.html):

    sudo vi /usr/local/lib/msms/pdb_to_xyzr
    

    at line 34, change:

    numfile = "./atmtypenumbers"
    

    to:

    numfile = "/usr/local/lib/msms/atmtypenumbers"
    
  5. Repeat step 5 for the file /usr/local/lib/msms/pdb_to_xyzrn

  6. Now try running msms in the terminal, it should say:

    $ msms
    MSMS 2.6.1 started on structure
    Copyright M.F. Sanner (1994)
    Compilation flags -O2 -DVERBOSE -DTIMING
    MSMS: No input stream specified
    
Program execution
In the shell

To run the program on its own in the shell…

<code>
With ssbio

To run the program using the ssbio Python wrapper, see: ssbio.protein.path.to.wrapper()

FAQs
  • How do I cite MSMS?

  • How long does it take to run?

    • Depending on the size of the protein structure, the program can take up to a couple minutes to execute.
  • I’m having issues running MSMS…

    • See the ssbio wiki for (hopefully) some solutions - or add yours in when you find the answer!
API
ssbio.protein.structure.properties.msms.get_msms_df(model, pdb_id, outfile=None, outdir=None, outext='_msms.df', force_rerun=False)[source]

Run MSMS (using Biopython) on a Biopython Structure Model.

Depths are in units Angstroms. 1A = 10^-10 m = 1nm. Returns a dictionary of:

{
    chain_id:{
                resnum1_id: (res_depth, ca_depth),
                resnum2_id: (res_depth, ca_depth)
             }
}
Parameters:model – Biopython Structure Model
Returns:ResidueDepth property_dict, reformatted
Return type:Pandas DataFrame
ssbio.protein.structure.properties.msms.get_msms_df_on_file(pdb_file, outfile=None, outdir=None, outext='_msms.df', force_rerun=False)[source]

Run MSMS (using Biopython) on a PDB file.

Saves a CSV file of:
chain: chain ID resnum: residue number (PDB numbering) icode: residue insertion code res_depth: average depth of all atoms in a residue ca_depth: depth of the alpha carbon atom

Depths are in units Angstroms. 1A = 10^-10 m = 1nm

Parameters:
  • pdb_file – Path to PDB file
  • outfile – Optional name of output file (without extension)
  • outdir – Optional output directory
  • outext – Optional extension for the output file
  • outext – Suffix appended to json results file
  • force_rerun – Rerun MSMS even if results exist already
Returns:

ResidueDepth property_dict, reformatted

Return type:

Pandas DataFrame

Structural similarity
FATCAT
Description

FATCAT is a structural alignment tool that allows you to determine the similarity of a pair of protein structures.

Warning

Parsing FATCAT results is currently incomplete and will only return TM-scores as of now - but TM-scores only show up in development versions of jFATCAT

Installation instructions

Note

These instructions were created on an Ubuntu 17.04 system.

  1. Make sure Java is installed on your system and can be run with the command java
  2. Download the Java port of FATCAT from the jFATCAT download page, under the section “Older file downloads” with the filename protein-comparison-tool\_<DATE>.tar.gz, with the most recent date.
  3. Extract it to a place where you store software
Program execution
In the shell

To run the program on its own in the shell…

/path/to/software/fatcat/runFATCAT.sh
With ssbio

To run the program using the ssbio Python wrapper, see: ssbio.protein.structure.properties.fatcat.run_fatcat(). Run it on two structures, pointing to the path of the runFATCAT.sh script.

FAQs
  • How do I cite FATCAT?

  • I’m having issues running FATCAT…

    • See the ssbio wiki for (hopefully) some solutions - or add yours in when you find the answer!
API
ssbio.protein.structure.properties.fatcat.parse_fatcat(fatcat_xml)[source]

Parse a FATCAT XML result file.

Parameters:fatcat_xml (str) – Path to FATCAT XML result file
Returns:Parsed information from the output
Return type:dict

Todo

  • Only returning TM-score at the moment
ssbio.protein.structure.properties.fatcat.run_fatcat(structure_path_1, structure_path_2, fatcat_sh, outdir='', silent=False, print_cmd=False, force_rerun=False)[source]

Run FATCAT on two PDB files, and return the path of the XML result file.

Parameters:
  • structure_path_1 (str) – Path to PDB file
  • structure_path_2 (str) – Path to PDB file
  • fatcat_sh (str) – Path to “runFATCAT.sh” executable script
  • outdir (str) – Path to where FATCAT XML output files will be saved
  • silent (bool) – If stdout should be silenced from showing up in Python console output
  • print_cmd (bool) – If command to run FATCAT should be printed to stdout
  • force_rerun (bool) – If FATCAT should be run even if XML output files already exist
Returns:

Path to XML output file

Return type:

str

ssbio.protein.structure.properties.fatcat.run_fatcat_all_by_all(list_of_structure_paths, fatcat_sh, outdir='', silent=True, force_rerun=False)[source]

Run FATCAT on all pairs of structures given a list of structures.

Parameters:
  • list_of_structure_paths (list) – List of PDB file paths
  • fatcat_sh (str) – Path to “runFATCAT.sh” executable script
  • outdir (str) – Path to where FATCAT XML output files will be saved
  • silent (bool) – If command to run FATCAT should be printed to stdout
  • force_rerun (bool) – If FATCAT should be run even if XML output files already exist
Returns:

TM-scores (similarity) between all structures

Return type:

Pandas DataFrame

Various structure properties
Structure cleaning, mutating

Protein sequence predictions

Secondary structure
SCRATCH
Secondary structure
Description

SCRATCH is a suite of tools to predict many types of structural properties directly from sequence. ssbio contains wrappers to execute and parse results from SSpro/SSpro8 - predictors of secondary structure, and ACCpro/ACCpro20 - predictors of solvent accessibility.

Installation instructions (Ubuntu)

Note

These instructions were created on an Ubuntu 17.04 system.

  1. Download the source and install it using the perl script:

    mkdir /path/to/my/software/scratch
    cd /path/to/my/software/scratch
    wget http://download.igb.uci.edu/SCRATCH-1D_1.1.tar.gz
    tar -zxf SCRATCH-1D_1.1.tar.gz
    cd SCRATCH-1D_1.1
    perl install.pl
    
  2. To run it from the command line directly:

    
    
  3. ssbio also provides command line wrappers to run it and parse the results, see for details.

Program execution
In the shell

To run the program on its own in the shell…

/path/to/my/software/scratch/SCRATCH-1D_1.1/bin/run_SCRATCH-1D_predictors.sh  input_fasta  output_prefix  [num_threads]
With ssbio

To run the program using the ssbio Python wrapper, see: ssbio.protein.sequence.properties.scratch.SCRATCH.run_scratch()

FAQs
  • How do I cite SCRATCH?

    • Cheng J, Randall AZ, Sweredoski MJ & Baldi P (2005) SCRATCH: a protein structure and structural feature prediction server. Nucleic Acids Res. 33: W72–6 Available at: http://dx.doi.org/10.1093/nar/gki396
  • I’m having issues running STRIDE…

    • See the ssbio wiki for (hopefully) some solutions - or add yours in when you find the answer!
API
class ssbio.protein.sequence.properties.scratch.SCRATCH(project_name, seq_file=None, seq_str=None)[source]

Provide wrappers for running and parsing SCRATCH on a sequence file or sequence string.

To run from the command line:

./run_SCRATCH-1D_predictors.sh  input_fasta  output_prefix  [num_threads]

SCRATCH predicts:

  • Secondary structure

    • 3 classes (helix, strand, other) using SSpro
    • 8 classes (standard DSSP definitions) using SSpro8
  • Relative solvent accessibility (RSA, also known as relative accessible surface area)

    • @ 25% exposed RSA cutoff (<25% RSA means it is buried)
    • @ all cutoffs in 5% increments from 0 to 100
accpro20_results()[source]

Parse the ACCpro output file and return a dict of secondary structure compositions

accpro20_summary(cutoff)[source]

Parse the ACCpro output file and return a summary of percent exposed/buried residues based on a cutoff.

Below the cutoff = buried Equal to or greater than cutoff = exposed The default cutoff used in accpro is 25%.

The output file is just a FASTA formatted file, so you can get residue level
information by parsing it like a normal sequence file.
Parameters:cutoff (float) – Cutoff for defining a buried or exposed residue.
Returns:Percentage of buried and exposed residues
Return type:dict
accpro_results()[source]

Parse the ACCpro output file and return a dict of secondary structure compositions.

accpro_summary()[source]

Parse the ACCpro output file and return a summary of percent exposed/buried residues.

The output file is just a FASTA formatted file, so you can get residue level
information by parsing it like a normal sequence file.
Returns:Percentage of buried and exposed residues
Return type:dict
run_scratch(path_to_scratch, num_cores=1, outname=None, outdir=None, force_rerun=False)[source]

Run SCRATCH on the sequence_file that was loaded into the class.

Parameters:
  • path_to_scratch – Path to the SCRATCH executable, run_SCRATCH-1D_predictors.sh
  • outname – Prefix to name the output files
  • outdir – Directory to store the output files
  • force_rerun – Flag to force rerunning of SCRATCH even if the output files exist

Returns:

sspro8_results()[source]

Parse the SSpro8 output file and return a dict of secondary structure compositions.

sspro8_summary()[source]

Parse the SSpro8 output file and return a summary of secondary structure composition.

The output file is just a FASTA formatted file, so you can get residue level
information by parsing it like a normal sequence file.
Returns:
Percentage of:
H: alpha-helix G: 310-helix I: pi-helix (extremely rare) E: extended strand B: beta-bridge T: turn S: bend C: the rest
Return type:dict
sspro_results()[source]

Parse the SSpro output file and return a dict of secondary structure compositions.

Returns:
Keys are sequence IDs, values are the lists of secondary structure predictions.
H: helix E: strand C: the rest
Return type:dict
sspro_summary()[source]

Parse the SSpro output file and return a summary of secondary structure composition.

The output file is just a FASTA formatted file, so you can get residue level
information by parsing it like a normal sequence file.
Returns:
Percentage of:
H: helix E: strand C: the rest
Return type:dict
ssbio.protein.sequence.properties.scratch.read_accpro20(infile)[source]

Read the accpro20 output (.acc20) and return the parsed FASTA records.

Keeps the spaces between the accessibility numbers.

Parameters:infile – Path to .acc20 file
Returns:Dictionary of accessibilities with keys as the ID
Return type:dict
Solvent accessibilities
Thermostability
Transmembrane domains
TMHMM
Description

TMHMM is a program to predict the location of transmembrane helices in proteins, directly from sequence. ssbio provides a wrapper to execute and parse the “long” output format of TMHMM.

Installation instructions (Unix)

Note

These instructions were created on an Ubuntu 17.04 system.

  1. Register for the software (academic license only) at the TMHMM download page

  2. Receive instructions to download the software at your email address

  3. Download the file tmhmm-2.0c.Linux.tar.gz

  4. Extract it to a place where you store software

  5. Install it according to the TMHMM installation instructions, repeated and annotated below…

    1. Insert the correct path for perl 5.x in the first line of the scripts bin/tmhmm and bin/tmhmmformat.pl (if not /usr/local/bin/perl). Use which perl and perl -v in the terminal to help find the correct path.
    2. Make sure you have an executable version of decodeanhmm in the bin directory.
    3. Include the directory containing tmhmm in your path (how do I add something to my Path?)
    4. Read the TMHMM2.0.guide.html
Program execution
In the shell

To run the program on its own, execute the following command with your protein sequences contained in a FASTA file:

tmhmm my_sequences.fasta
With ssbio

To run the program using the ssbio Python wrapper, see: ssbio.protein.path.to.wrapper()

FAQs
  • How do I cite TMHMM?

    • Krogh A, Larsson B, von Heijne G & Sonnhammer EL (2001) Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J. Mol. Biol. 305: 567–580 Available at: http://dx.doi.org/10.1006/jmbi.2000.4315
  • I’m having issues running TMHMM…

    • See the ssbio wiki for (hopefully) some solutions - or add yours in when you find the answer!
API
ssbio.protein.sequence.properties.tmhmm.label_TM_tmhmm_residue_numbers_and_leaflets(tmhmm_seq)[source]

Determine the residue numbers of the TM-helix residues that cross the membrane and label them by leaflet.

Parameters:tmhmm_seq – g.protein.representative_sequence.seq_record.letter_annotations[‘TM-tmhmm’]
Returns:a dictionary with leaflet_variable : [residue list] where the variable is inside or outside TM_boundary dict: outputs a dictionar with : TM helix number : [TM helix residue start , TM helix residue end]
Return type:leaflet_dict

Todo

untested method!

Aggregation propensity
AMYLPRED2
Description

This module provides a function to predict the aggregation propensity of proteins, specifically the number of aggregation-prone segments on an unfolded protein sequence. AMYLPRED2 is a consensus method of different methods. In order to obtain the best balance between sensitivity and specificity, we follow the author’s guidelines to consider every 5 consecutive residues agreed among at least 5 methods contributing 1 to the aggregation propensity.

Instructions
  1. Create an account on the webserver at the AMYLPRED2 registration link.
  2. Create a new AMYLPRED object with your email and password initialized along with it.
  3. Run ssbio.protein.sequence.properties.aggregation_propensity.AMYLPRED.get_aggregation_propensity() on a protein sequence.
FAQs
  • What is aggregation propensity?

    • The number of aggregation-prone segments on an unfolded protein sequence.
  • How can I install AMYLPRED2?

    • AMYLPRED2 is only available as a web server. ssbio provides a wrapper for the web server and allows you to submit protein sequences to it along with caching the output files.
  • How do I cite AMYLPRED2?

    • Tsolis AC, Papandreou NC, Iconomidou VA & Hamodrakas SJ (2013) A consensus method for the prediction of ‘aggregation-prone’ peptides in globular proteins. PLoS One 8: e54175 Available at: http://dx.doi.org/10.1371/journal.pone.0054175
  • How can this parameter be used on a genome-scale?

    • See: Chen K, Gao Y, Mih N, O’Brien EJ, Yang L & Palsson BO (2017) Thermosensitivity of growth is determined by chaperone-mediated proteome reallocation. Proceedings of the National Academy of Sciences 114: 11548–11553 Available at: http://www.pnas.org/content/114/43/11548.abstract
  • I’m having issues running AMYLPRED2…

    • See the ssbio wiki for (hopefully) some solutions - or add yours in when you find the answer!
API
class ssbio.protein.sequence.properties.aggregation_propensity.AMYLPRED(email, password)[source]

Class to submit sequences to AMYLPRED2.

Instructions:

  1. Create an account on the webserver at the AMYLPRED2 registration link.
  2. Create a new AMYLPRED object with your email and password initialized along with it.
  3. Run get_aggregation_propensity on a protein sequence.
email

str – Account email

password

str – Account password

Todo

  • Properly implement force_rerun and caching functions
get_aggregation_propensity(seq, outdir, cutoff_v=5, cutoff_n=5, run_amylmuts=False)[source]

Run the AMYLPRED2 web server for a protein sequence and get the consensus result for aggregation propensity.

Parameters:
  • seq (str, Seq, SeqRecord) – Amino acid sequence
  • outdir (str) – Directory to where output files should be saved
  • cutoff_v (int) – The minimal number of methods that agree on a residue being a aggregation-prone residue
  • cutoff_n (int) – The minimal number of consecutive residues to be considered as a ‘stretch’ of aggregation-prone region
  • run_amylmuts (bool) – If AMYLMUTS method should be run, default False. AMYLMUTS is optional as it is the most time consuming and generates a slightly different result every submission.
Returns:

Aggregation propensity - the number of aggregation-prone segments on an unfolded protein sequence

Return type:

int

parse_method_results(results_file, met)[source]

Parse the output of a AMYLPRED2 result file.

run_amylpred2(seq, outdir, run_amylmuts=False)[source]

Run all methods on the AMYLPRED2 web server for an amino acid sequence and gather results.

Result files are cached in /path/to/outdir/AMYLPRED2_results.

Parameters:
  • seq (str) – Amino acid sequence as a string
  • outdir (str) – Directory to where output files should be saved
  • run_amylmuts (bool) – If AMYLMUTS method should be run, default False
Returns:

Result for each method run

Return type:

dict


Protein sequence calculations

Various sequence properties
EMBOSS
Description

EMBOSS is the European Molecular Biology Open Software Suite. EMBOSS contains a wide array of general purpose bioinformatics programs. For the GEM-PRO pipeline, we mainly need the needle pairwise alignment tool (although this can be replaced with Biopython’s built-in pairwise alignment function), and the pepstats protein sequence statistics tool.

Installation instructions (Ubuntu)

Note

These instructions were created on an Ubuntu 17.04 system.

  1. Install the EMBOSS package which contains many programs

    sudo apt-get install emboss
    
  2. And then once that installs, try running the needle program:

    needle
    
Installation instructions (Mac OSX, other Unix)
  1. Just install after downloading the EMBOSS source code

    ./configure
    make
    sudo make install
    
Program execution
In the shell

To run the program on its own in the shell…

needle
With ssbio

To run the program using the ssbio Python wrapper, see:

FAQs
  • How do I cite EMBOSS?

  • I’m having issues running EMBOSS programs…

    • See the ssbio wiki for (hopefully) some solutions - or add yours in when you find the answer!
API
ssbio.protein.sequence.properties.residues.biopython_protein_analysis(inseq)[source]

Utiize Biopython’s ProteinAnalysis module to return general sequence properties of an amino acid string.

For full definitions see: http://biopython.org/DIST/docs/api/Bio.SeqUtils.ProtParam.ProteinAnalysis-class.html

Parameters:inseq – Amino acid sequence
Returns:Dictionary of sequence properties. Some definitions include: instability_index: Any value above 40 means the protein is unstable (has a short half life). secondary_structure_fraction: Percentage of protein in helix, turn or sheet
Return type:dict

Todo

Finish definitions of dictionary

ssbio.protein.sequence.properties.residues.emboss_pepstats_on_fasta(infile, outfile='', outdir='', outext='.pepstats', force_rerun=False)[source]

Run EMBOSS pepstats on a FASTA file.

Parameters:
  • infile – Path to FASTA file
  • outfile – Name of output file without extension
  • outdir – Path to output directory
  • outext – Extension of results file, default is “.pepstats”
  • force_rerun – Flag to rerun pepstats
Returns:

Path to output file.

Return type:

str

ssbio.protein.sequence.properties.residues.emboss_pepstats_parser(infile)[source]

Get dictionary of pepstats results.

Parameters:infile – Path to pepstats outfile
Returns:Parsed information from pepstats
Return type:dict

Todo

Only currently parsing the bottom of the file for percentages of properties.

ssbio.protein.sequence.properties.residues.flexibility_index(aa_one)[source]

From Smith DK, Radivoja P, ObradovicZ, et al. Improved amino acid flexibility parameters, Protein Sci.2003, 12:1060

Author: Ke Chen

Parameters:aa_one

Returns:

ssbio.protein.sequence.properties.residues.grantham_score(ref_aa, mut_aa)[source]

https://github.com/ashutoshkpandey/Annotation/blob/master/Grantham_score_calculator.py

Sequence alignment

Index: Tutorials

Welcome to the ssbio Binder! Here you can interactively launch notebook tutorials or even alter them to run on your own data.

Testing

These notebooks make sure the Binder environment has been installed correctly.

  • Software Installation Tester - This notebook simply tests if external programs have been installed correctly and can run in a Binder environment.
  • I-TASSER and TMHMM Install Guide - This notebook provides a guide to installing I-TASSER and TMHMM in Binder, since these require you to register with your email and cannot be installed beforehand.

The GEM-PRO Pipeline

The GEM-PRO pipeline is focused on annotating genome-scale models with protein structure information, and subsequently making it easier to work with proteins at this scale.

The Protein Class

The StructProp Class

The SeqProp Class

Other tutorials

Python API

Information on select functions, classes, or methods.

ssbio.pipeline.gempro

GEMPRO
class ssbio.pipeline.gempro.GEMPRO(gem_name, root_dir=None, pdb_file_type='mmtf', gem=None, gem_file_path=None, gem_file_type=None, genes_list=None, genes_and_sequences=None, genome_path=None, write_protein_fasta_files=True, description=None, custom_spont_id=None)[source]

Generic class to represent all information for a GEM-PRO project.

Initialize the GEM-PRO project with a genome-scale model, a list of genes, or a dict of genes and sequences. Specify the name of your project, along with the root directory where a folder with that name will be created.

Main methods provided are:

  1. Automated mapping of sequence IDs

    • With KEGG mapper
    • With UniProt mapper
    • Allowing manual gene ID –> protein sequence entry
    • Allowing manual gene ID –> UniProt ID
  2. Consolidating sequence IDs and setting a representative sequence

    • Currently these are set based on available PDB IDs
  3. Mapping of representative sequence –> structures

    • With UniProt –> ranking of PDB structures
    • BLAST representative sequence –> PDB database
  4. Preparation of files for homology modeling (currently for I-TASSER)

    • Mapping to existing models
    • Preparation for running I-TASSER
    • Parsing I-TASSER runs
  5. Running QC/QA on structures and setting a representative structure

    • Various cutoffs (mutations, insertions, deletions) can be set to filter structures
  6. Automation of protein sequence and structure property calculation

  7. Creation of Pandas DataFrame summaries directly from downloaded metadata

Parameters:
  • gem_name (str) – The name of your GEM or just your project in general. This will be the name of the main folder that is created in root_dir.
  • root_dir (str) – Path to where the folder named after gem_name will be created. If not provided, directories will not be created and output directories need to be specified for some steps.
  • pdb_file_type (str) – pdb, mmCif, xml, mmtf - file type for files downloaded from the PDB
  • gem (Model) – COBRApy Model object
  • gem_file_path (str) – Path to GEM file
  • gem_file_type (str) – GEM model type - sbml (or xml), mat, or json formats
  • genes_list (list) – List of gene IDs that you want to map
  • genes_and_sequences (dict) – Dictionary of gene IDs and their amino acid sequence strings
  • genome_path (str) – FASTA file of all protein sequences
  • write_protein_fasta_files (bool) – If individual protein FASTA files should be written out
  • description (str) – Description string of your project
  • custom_spont_id (str) – ID of spontaneous genes in a COBRA model which will be ignored for analysis
add_gene_ids(genes_list)[source]

Add gene IDs manually into the GEM-PRO project.

Parameters:genes_list (list) – List of gene IDs as strings.
base_dir

str – GEM-PRO project folder.

blast_seqs_to_pdb(seq_ident_cutoff=0, evalue=0.0001, all_genes=False, display_link=False, outdir=None, force_rerun=False)[source]

BLAST each representative protein sequence to the PDB. Saves raw BLAST results (XML files).

Parameters:
  • seq_ident_cutoff (float, optional) – Cutoff results based on percent coverage (in decimal form)
  • evalue (float, optional) – Cutoff for the E-value - filters for significant hits. 0.001 is liberal, 0.0001 is stringent (default).
  • all_genes (bool) – If all genes should be BLASTed, or only those without any structures currently mapped
  • display_link (bool, optional) – Set to True if links to the HTML results should be displayed
  • outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
  • force_rerun (bool, optional) – If existing BLAST results should not be used, set to True. Default is False
custom_spont_id = None

str – ID of spontaneous genes in a COBRA model which will be ignored for analysis

data_dir

str – Directory where all data are stored.

df_homology_models

DataFrame – Get a dataframe of I-TASSER homology model results

df_kegg_metadata

DataFrame – Pandas DataFrame of KEGG metadata per protein.

df_pdb_blast

DataFrame – Get a dataframe of PDB BLAST results

df_pdb_metadata

DataFrame – Get a dataframe of PDB metadata (PDBs have to be downloaded first).

df_pdb_ranking

DataFrame – Get a dataframe of UniProt -> best structure in PDB results

df_proteins

DataFrame – Get a summary dataframe of all proteins in the project.

df_representative_sequences

DataFrame – Pandas DataFrame of representative sequence information per protein.

df_representative_structures

DataFrame – Get a dataframe of representative protein structure information.

df_uniprot_metadata

DataFrame – Pandas DataFrame of UniProt metadata per protein.

find_disulfide_bridges(representatives_only=True)[source]

Run Biopython’s disulfide bridge finder and store found bridges.

Annotations are stored in the protein structure’s chain sequence at: <chain_prop>.seq_record.annotations['SSBOND-biopython']

Parameters:representative_only (bool) – If analysis should only be run on the representative structure
find_disulfide_bridges_parallelize(sc, representatives_only=True)[source]

Run Biopython’s disulfide bridge finder and store found bridges.

Annotations are stored in the protein structure’s chain sequence at: <chain_prop>.seq_record.annotations['SSBOND-biopython']

Parameters:representative_only (bool) – If analysis should only be run on the representative structure
functional_genes

DictList – All genes with a representative protein structure.

genes = None

DictList – All protein-coding genes in this GEM-PRO project

genes_dir

str – Directory where all gene specific information is stored.

genes_with_a_representative_sequence

DictList – All genes with a representative sequence.

genes_with_a_representative_structure

DictList – All genes with a representative protein structure.

genes_with_experimental_structures

DictList – All genes that have at least one experimental structure.

genes_with_homology_models

DictList – All genes that have at least one homology model.

genes_with_structures

DictList – All genes with any mapped protein structures.

genome_path = None

str – Simple link to the filepath of the FASTA file containing all protein sequences

get_dssp_annotations(representatives_only=True, force_rerun=False)[source]

Run DSSP on structures and store calculations.

Annotations are stored in the protein structure’s chain sequence at: <chain_prop>.seq_record.letter_annotations['*-dssp']

Parameters:
  • representative_only (bool) – If analysis should only be run on the representative structure
  • force_rerun (bool) – If calculations should be rerun even if an output file exists
get_dssp_annotations_parallelize(sc, representatives_only=True, force_rerun=False)[source]

Run DSSP on structures and store calculations.

Annotations are stored in the protein structure’s chain sequence at: <chain_prop>.seq_record.letter_annotations['*-dssp']

Parameters:
  • representative_only (bool) – If analysis should only be run on the representative structure
  • force_rerun (bool) – If calculations should be rerun even if an output file exists
get_freesasa_annotations(include_hetatms=False, representatives_only=True, force_rerun=False)[source]

Run freesasa on structures and store calculations.

Annotations are stored in the protein structure’s chain sequence at: <chain_prop>.seq_record.letter_annotations['*-freesasa']

Parameters:
  • include_hetatms (bool) – If HETATMs should be included in calculations. Defaults to False.
  • representative_only (bool) – If analysis should only be run on the representative structure
  • force_rerun (bool) – If calculations should be rerun even if an output file exists
get_freesasa_annotations_parallelize(sc, include_hetatms=False, representatives_only=True, force_rerun=False)[source]

Run freesasa on structures and store calculations.

Annotations are stored in the protein structure’s chain sequence at: <chain_prop>.seq_record.letter_annotations['*-freesasa']

Parameters:
  • include_hetatms (bool) – If HETATMs should be included in calculations. Defaults to False.
  • representative_only (bool) – If analysis should only be run on the representative structure
  • force_rerun (bool) – If calculations should be rerun even if an output file exists
get_itasser_models(homology_raw_dir, custom_itasser_name_mapping=None, outdir=None, force_rerun=False)[source]

Copy generated I-TASSER models from a directory to the GEM-PRO directory.

Parameters:
  • homology_raw_dir (str) – Root directory of I-TASSER folders.
  • custom_itasser_name_mapping (dict) – Use this if your I-TASSER folder names differ from your model gene names. Input a dict of {model_gene: ITASSER_folder}.
  • outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
  • force_rerun (bool) – If homology files should be copied again even if they exist in the GEM-PRO directory
get_manual_homology_models(input_dict, outdir=None, clean=True, force_rerun=False)[source]

Copy homology models to the GEM-PRO project.

Requires an input of a dictionary formatted like so:

{
    model_gene: {
                    homology_model_id1: {
                                            'model_file': '/path/to/homology/model.pdb',
                                            'file_type': 'pdb'
                                            'additional_info': info_value
                                        },
                    homology_model_id2: {
                                            'model_file': '/path/to/homology/model.pdb'
                                            'file_type': 'pdb'
                                        }
                }
}
Parameters:
  • input_dict (dict) – Dictionary of dictionaries of gene names to homology model IDs and other information
  • outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
  • clean (bool) – If homology files should be cleaned and saved as a new PDB file
  • force_rerun (bool) – If homology files should be copied again even if they exist in the GEM-PRO directory
get_msms_annotations(representatives_only=True, force_rerun=False)[source]

Run MSMS on structures and store calculations.

Annotations are stored in the protein structure’s chain sequence at: <chain_prop>.seq_record.letter_annotations['*-msms']

Parameters:
  • representative_only (bool) – If analysis should only be run on the representative structure
  • force_rerun (bool) – If calculations should be rerun even if an output file exists
get_msms_annotations_parallelize(sc, representatives_only=True, force_rerun=False)[source]

Run MSMS on structures and store calculations.

Annotations are stored in the protein structure’s chain sequence at: <chain_prop>.seq_record.letter_annotations['*-msms']

Parameters:
  • representative_only (bool) – If analysis should only be run on the representative structure
  • force_rerun (bool) – If calculations should be rerun even if an output file exists
get_scratch_predictions(path_to_scratch, results_dir, scratch_basename='scratch', num_cores=1, exposed_buried_cutoff=25, custom_gene_mapping=None)[source]

Run and parse SCRATCH results to predict secondary structure and solvent accessibility. Annotations are stored in the protein’s representative sequence at:

  • .annotations
  • .letter_annotations
Parameters:
  • path_to_scratch (str) – Path to SCRATCH executable
  • results_dir (str) – Path to SCRATCH results folder, which will have the files (scratch.ss, scratch.ss8, scratch.acc, scratch.acc20)
  • scratch_basename (str) – Basename of the SCRATCH results (‘scratch’ is default)
  • num_cores (int) – Number of cores to use to parallelize SCRATCH run
  • exposed_buried_cutoff (int) – Cutoff of exposed/buried for the acc20 predictions
  • custom_gene_mapping (dict) – Default parsing of SCRATCH output files is to look for the model gene IDs. If your output files contain IDs which differ from the model gene IDs, use this dictionary to map model gene IDs to result file IDs. Dictionary keys must match model genes.
get_sequence_properties(representatives_only=True)[source]

Run Biopython ProteinAnalysis and EMBOSS pepstats to summarize basic statistics of all protein sequences. Results are stored in the protein’s respective SeqProp objects at .annotations

Parameters:representative_only (bool) – If analysis should only be run on the representative sequences
get_tmhmm_predictions(tmhmm_results, custom_gene_mapping=None)[source]

Parse TMHMM results and store in the representative sequences.

This is a basic function to parse pre-run TMHMM results. Run TMHMM from the web service (http://www.cbs.dtu.dk/services/TMHMM/) by doing the following:

  1. Write all representative sequences in the GEM-PRO using the function write_representative_sequences_file
  2. Upload the file to http://www.cbs.dtu.dk/services/TMHMM/ and choose “Extensive, no graphics” as the output
  3. Copy and paste the results (ignoring the top header and above “HELP with output formats”) into a file and save it
  4. Run this function on that file
Parameters:
  • tmhmm_results (str) – Path to TMHMM results (long format)
  • custom_gene_mapping (dict) – Default parsing of TMHMM output is to look for the model gene IDs. If your output file contains IDs which differ from the model gene IDs, use this dictionary to map model gene IDs to result file IDs. Dictionary keys must match model genes.
kegg_mapping_and_metadata(kegg_organism_code, custom_gene_mapping=None, outdir=None, set_as_representative=False, force_rerun=False)[source]

Map all genes in the model to KEGG IDs using the KEGG service.

Steps:
  1. Download all metadata and sequence files in the sequences directory
  2. Creates a KEGGProp object in the protein.sequences attribute
  3. Returns a Pandas DataFrame of mapping results
Parameters:
  • kegg_organism_code (str) – The three letter KEGG code of your organism
  • custom_gene_mapping (dict) – If your model genes differ from the gene IDs you want to map, custom_gene_mapping allows you to input a dictionary which maps model gene IDs to new ones. Dictionary keys must match model gene IDs.
  • outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
  • set_as_representative (bool) – If mapped KEGG IDs should be set as representative sequences
  • force_rerun (bool) – If you want to overwrite any existing mappings and files
kegg_mapping_and_metadata_parallelize(sc, kegg_organism_code, custom_gene_mapping=None, outdir=None, set_as_representative=False, force_rerun=False)[source]

Map all genes in the model to KEGG IDs using the KEGG service.

Steps:
  1. Download all metadata and sequence files in the sequences directory
  2. Creates a KEGGProp object in the protein.sequences attribute
  3. Returns a Pandas DataFrame of mapping results
Parameters:
  • sc (SparkContext) – Spark Context to parallelize this function
  • kegg_organism_code (str) – The three letter KEGG code of your organism
  • custom_gene_mapping (dict) – If your model genes differ from the gene IDs you want to map, custom_gene_mapping allows you to input a dictionary which maps model gene IDs to new ones. Dictionary keys must match model gene IDs.
  • outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
  • set_as_representative (bool) – If mapped KEGG IDs should be set as representative sequences
  • force_rerun (bool) – If you want to overwrite any existing mappings and files
load_cobra_model(model)[source]

Load a COBRApy Model object into the GEM-PRO project.

Parameters:model (Model) – COBRApy Model object
manual_seq_mapping(gene_to_seq_dict, outdir=None, write_fasta_files=True, set_as_representative=True)[source]

Read a manual input dictionary of model gene IDs –> protein sequences. By default sets them as representative.

Parameters:
  • gene_to_seq_dict (dict) – Mapping of gene IDs to their protein sequence strings
  • outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
  • write_fasta_files (bool) – If individual protein FASTA files should be written out
  • set_as_representative (bool) – If mapped sequences should be set as representative
manual_uniprot_mapping(gene_to_uniprot_dict, outdir=None, set_as_representative=True)[source]

Read a manual dictionary of model gene IDs –> UniProt IDs. By default sets them as representative.

This allows for mapping of the missing genes, or overriding of automatic mappings.

Input a dictionary of:

{
    <gene_id1>: <uniprot_id1>,
    <gene_id2>: <uniprot_id2>,
}
Parameters:
  • gene_to_uniprot_dict – Dictionary of mappings as shown above
  • outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
  • set_as_representative (bool) – If mapped UniProt IDs should be set as representative sequences
map_uniprot_to_pdb(seq_ident_cutoff=0.0, outdir=None, force_rerun=False)[source]

Map all representative sequences’ UniProt ID to PDB IDs using the PDBe “Best Structures” API. Will save a JSON file of the results to each protein’s sequences folder.

The “Best structures” API is available at https://www.ebi.ac.uk/pdbe/api/doc/sifts.html The list of PDB structures mapping to a UniProt accession sorted by coverage of the protein and, if the same, resolution.

Parameters:
  • seq_ident_cutoff (float) – Sequence identity cutoff in decimal form
  • outdir (str) – Output directory to cache JSON results of search
  • force_rerun (bool) – Force re-downloading of JSON results if they already exist
Returns:

A rank-ordered list of PDBProp objects that map to the UniProt ID

Return type:

list

missing_homology_models

list – List of genes with no mapping to any homology models.

missing_kegg_mapping

list – List of genes with no mapping to KEGG.

missing_pdb_structures

list – List of genes with no mapping to any experimental PDB structure.

missing_representative_sequence

list – List of genes with no mapping to a representative sequence.

missing_representative_structure

list – List of genes with no mapping to a representative structure.

missing_uniprot_mapping

list – List of genes with no mapping to UniProt.

model = None

Model – COBRApy model object

model_dir

str – Directory where original GEMs and GEM-related files are stored.

pdb_downloader_and_metadata(outdir=None, pdb_file_type=None, force_rerun=False)[source]

Download ALL mapped experimental structures to each protein’s structures directory.

Parameters:
  • outdir (str) – Path to output directory, if GEM-PRO directories were not set or other output directory is desired
  • pdb_file_type (str) – Type of PDB file to download, if not already set or other format is desired
  • force_rerun (bool) – If files should be re-downloaded if they already exist
pdb_file_type = None

strpdb, mmCif, xml, mmtf - file type for files downloaded from the PDB

prep_itasser_modeling(itasser_installation, itlib_folder, runtype, create_in_dir=None, execute_from_dir=None, all_genes=False, print_exec=False, **kwargs)[source]

Prepare to run I-TASSER homology modeling for genes without structures, or all genes.

Parameters:
  • itasser_installation (str) – Path to I-TASSER folder, i.e. ~/software/I-TASSER4.4
  • itlib_folder (str) – Path to ITLIB folder, i.e. ~/software/ITLIB
  • runtype – How you will be running I-TASSER - local, slurm, or torque
  • create_in_dir (str) – Local directory where folders will be created
  • execute_from_dir (str) – Optional path to execution directory - use this if you are copying the homology models to another location such as a supercomputer for running
  • all_genes (bool) – If all genes should be prepped, or only those without any mapped structures
  • print_exec (bool) – If the execution statement should be printed to run modelling

Todo

  • Document kwargs - extra options for I-TASSER, SLURM or Torque execution
  • Allow modeling of any sequence in sequences attribute, select by ID or provide SeqProp?
root_dir

str – Directory where GEM-PRO project folder named after the attribute base_dir is located.

set_representative_sequence(force_rerun=False)[source]

Automatically consolidate loaded sequences (manual, UniProt, or KEGG) and set a single representative sequence.

Manually set representative sequences override all existing mappings. UniProt mappings override KEGG mappings except when KEGG mappings have PDBs associated with them and UniProt doesn’t.

Parameters:force_rerun (bool) – Set to True to recheck stored sequences
set_representative_structure(seq_outdir=None, struct_outdir=None, pdb_file_type=None, engine='needle', always_use_homology=False, rez_cutoff=0.0, seq_ident_cutoff=0.5, allow_missing_on_termini=0.2, allow_mutants=True, allow_deletions=False, allow_insertions=False, allow_unresolved=True, skip_large_structures=False, clean=True, force_rerun=False)[source]

Set all representative structure for proteins from a structure in the structures attribute.

Each gene can have a combination of the following, which will be analyzed to set a representative structure.

  • Homology model(s)
  • Ranked PDBs
  • BLASTed PDBs

If the always_use_homology flag is true, homology models are always set as representative when they exist. If there are multiple homology models, we rank by the percent sequence coverage.

Parameters:
  • seq_outdir (str) – Path to output directory of sequence alignment files, must be set if GEM-PRO directories were not created initially
  • struct_outdir (str) – Path to output directory of structure files, must be set if GEM-PRO directories were not created initially
  • pdb_file_type (str) – pdb, mmCif, xml, mmtf - file type for files downloaded from the PDB
  • engine (str) – biopython or needle - which pairwise alignment program to use. needle is the standard EMBOSS tool to run pairwise alignments. biopython is Biopython’s implementation of needle. Results can differ!
  • always_use_homology (bool) – If homology models should always be set as the representative structure
  • rez_cutoff (float) – Resolution cutoff, in Angstroms (only if experimental structure)
  • seq_ident_cutoff (float) – Percent sequence identity cutoff, in decimal form
  • allow_missing_on_termini (float) – Percentage of the total length of the reference sequence which will be ignored when checking for modifications. Example: if 0.1, and reference sequence is 100 AA, then only residues 5 to 95 will be checked for modifications.
  • allow_mutants (bool) – If mutations should be allowed or checked for
  • allow_deletions (bool) – If deletions should be allowed or checked for
  • allow_insertions (bool) – If insertions should be allowed or checked for
  • allow_unresolved (bool) – If unresolved residues should be allowed or checked for
  • skip_large_structures (bool) – Default False – currently, large structures can’t be saved as a PDB file even if you just want to save a single chain, so Biopython will throw an error when trying to do so. As an alternative, if a large structure is selected as representative, the pipeline will currently point to it and not clean it. If you don’t want this to happen, set this to true.
  • clean (bool) – If structures should be cleaned
  • force_rerun (bool) – If sequence to structure alignment should be rerun

Todo

  • Remedy large structure representative setting
structures_dir

str – Directory where all structures are stored.

uniprot_mapping_and_metadata(model_gene_source, custom_gene_mapping=None, outdir=None, set_as_representative=False, force_rerun=False)[source]

Map all genes in the model to UniProt IDs using the UniProt mapping service. Also download all metadata and sequences.

Parameters:
  • model_gene_source (str) –

    the database source of your model gene IDs. See: http://www.uniprot.org/help/api_idmapping Common model gene sources are:

    • Ensembl Genomes - ENSEMBLGENOME_ID (i.e. E. coli b-numbers)
    • Entrez Gene (GeneID) - P_ENTREZGENEID
    • RefSeq Protein - P_REFSEQ_AC
  • custom_gene_mapping (dict) – If your model genes differ from the gene IDs you want to map, custom_gene_mapping allows you to input a dictionary which maps model gene IDs to new ones. Dictionary keys must match model genes.
  • outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
  • set_as_representative (bool) – If mapped UniProt IDs should be set as representative sequences
  • force_rerun (bool) – If you want to overwrite any existing mappings and files
write_representative_sequences_file(outname, outdir=None, set_ids_from_model=True)[source]

Write all the model’s sequences as a single FASTA file. By default, sets IDs to model gene IDs.

Parameters:
  • outname (str) – Name of the output FASTA file without the extension
  • outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
  • set_ids_from_model (bool) – If the gene ID source should be the model gene IDs, not the original sequence ID

ssbio.databases

PDBProp
class ssbio.databases.pdb.PDBProp(ident, description=None, chains=None, mapped_chains=None, structure_path=None, file_type=None)[source]

Store information about a protein structure from the Protein Data Bank.

Extends the StructProp class to allow initialization of the structure by its PDB ID, and then enabling downloads of the structure file as well as parsing its metadata.

Parameters:
  • ident (str) –
  • description (str) –
  • chains (str) –
  • mapped_chains (str) –
  • structure_path (str) –
  • file_type (str) – pdb, mmCif, xml, mmtf - file type for files downloaded from the PDB
biological_assemblies = None

DictList – A list for storing Bioassembly objects related to this PDB ID

download_structure_file(outdir, file_type=None, load_header_metadata=True, force_rerun=False)[source]

Download a structure file from the PDB, specifying an output directory and a file type. Optionally download the mmCIF header file and parse data from it to store within this object.

Parameters:
  • outdir (str) – Path to output directory
  • file_type (str) – pdb, mmCif, xml, mmtf - file type for files downloaded from the PDB
  • load_header_metadata (bool) – If header metadata should be loaded into this object, fastest with mmtf files
  • force_rerun (bool) – If structure file should be downloaded even if it already exists
ssbio.databases.pdb.best_structures(uniprot_id, outname=None, outdir=None, seq_ident_cutoff=0.0, force_rerun=False)[source]

Use the PDBe REST service to query for the best PDB structures for a UniProt ID.

More information found here: https://www.ebi.ac.uk/pdbe/api/doc/sifts.html Link used to retrieve results: https://www.ebi.ac.uk/pdbe/api/mappings/best_structures/:accession The list of PDB structures mapping to a UniProt accession sorted by coverage of the protein and, if the same, resolution.

Here is the ranking algorithm described by the PDB paper: https://nar.oxfordjournals.org/content/44/D1/D385.full

“Finally, a single quality indicator is also calculated for each entry by taking the harmonic average of all the percentile scores representing model and model-data-fit quality measures and then subtracting 10 times the numerical value of the resolution (in Angstrom) of the entry to ensure that resolution plays a role in characterising the quality of a structure. This single empirical ‘quality measure’ value is used by the PDBe query system to sort results and identify the ‘best’ structure in a given context. At present, entries determined by methods other than X-ray crystallography do not have similar data quality information available and are not considered as ‘best structures’.”

Parameters:
  • uniprot_id (str) – UniProt Accession ID
  • outname (str) – Basename of the output file of JSON results
  • outdir (str) – Path to output directory of JSON results
  • seq_ident_cutoff (float) – Cutoff results based on percent coverage (in decimal form)
  • force_rerun (bool) – Obtain best structures mapping ignoring previously downloaded results
Returns:

Rank-ordered list of dictionaries representing chain-specific PDB entries. Keys are:
  • pdb_id: the PDB ID which maps to the UniProt ID
  • chain_id: the specific chain of the PDB which maps to the UniProt ID
  • coverage: the percent coverage of the entire UniProt sequence
  • resolution: the resolution of the structure
  • start: the structure residue number which maps to the start of the mapped sequence
  • end: the structure residue number which maps to the end of the mapped sequence
  • unp_start: the sequence residue number which maps to the structure start
  • unp_end: the sequence residue number which maps to the structure end
  • experimental_method: type of experiment used to determine structure
  • tax_id: taxonomic ID of the protein’s original organism

Return type:

list

ssbio.databases.pdb.blast_pdb(seq, outfile='', outdir='', evalue=0.0001, seq_ident_cutoff=0.0, link=False, force_rerun=False)[source]

Returns a list of BLAST hits of a sequence to available structures in the PDB.

Parameters:
  • seq (str) – Your sequence, in string format
  • outfile (str) – Name of output file
  • outdir (str, optional) – Path to output directory. Default is the current directory.
  • evalue (float, optional) – Cutoff for the E-value - filters for significant hits. 0.001 is liberal, 0.0001 is stringent (default).
  • seq_ident_cutoff (float, optional) – Cutoff results based on percent coverage (in decimal form)
  • link (bool, optional) – Set to True if a link to the HTML results should be displayed
  • force_rerun (bool, optional) – If existing BLAST results should not be used, set to True. Default is False
Returns:

Rank ordered list of BLAST hits in dictionaries.

Return type:

list

ssbio.databases.pdb.blast_pdb_df(blast_results)[source]

Make a dataframe of BLAST results

ssbio.databases.pdb.download_mmcif_header(pdb_id, outdir='', force_rerun=False)[source]

Download a mmCIF header file from the RCSB PDB by ID.

Parameters:
  • pdb_id – PDB ID
  • outdir – Optional output directory, default is current working directory
  • force_rerun – If the file should be downloaded again even if it exists
Returns:

Path to outfile

Return type:

str

ssbio.databases.pdb.download_sifts_xml(pdb_id, outdir='', force_rerun=False)[source]

Download the SIFTS file for a PDB ID.

Parameters:
  • pdb_id (str) – PDB ID
  • outdir (str) – Output directory, current working directory if not specified.
  • force_rerun (bool) – If the file should be downloaded again even if it exists
Returns:

Path to downloaded file

Return type:

str

ssbio.databases.pdb.download_structure(pdb_id, file_type, outdir='', only_header=False, force_rerun=False)[source]

Download a structure from the RCSB PDB by ID. Specify the file type desired.

Parameters:
  • pdb_id – PDB ID
  • file_type – pdb, pdb.gz, mmcif, cif, cif.gz, xml.gz, mmtf, mmtf.gz
  • outdir – Optional output directory
  • only_header – If only the header file should be downloaded
  • force_rerun – If the file should be downloaded again even if it exists
Returns:

Path to outfile

Return type:

str

Deprecated since version 1.0: This will be removed in 2.0. Use Biopython’s PDBList.retrieve_pdb_file function instead

ssbio.databases.pdb.get_bioassembly_info(pdb_id, biomol_num, cache=False, outdir=None, force_rerun=False)[source]

Get metadata about a bioassembly from the RCSB PDB’s REST API.

See: https://www.rcsb.org/pdb/rest/bioassembly/bioassembly?structureId=1hv4&nr=1 The API returns an XML file containing the information on a biological assembly that looks like this:

<bioassembly structureId="1HV4" assemblyNr="1" method="PISA" desc="author_and_software_defined_assembly">
    <transformations operator="1" chainIds="A,B,C,D">
        <transformation index="1">
            <matrix m11="1.00000000" m12="0.00000000" m13="0.00000000" m21="0.00000000" m22="1.00000000" m23="0.00000000" m31="0.00000000" m32="0.00000000" m33="1.00000000"/>
            <shift v1="0.00000000" v2="0.00000000" v3="0.00000000"/>
        </transformation>
    </transformations>
</bioassembly>
Parameters:
  • pdb_id (str) – PDB ID
  • biomol_num (int) – Biological assembly number you are interested in
  • cache (bool) – If the XML file should be downloaded
  • outdir (str) – If cache, then specify the output directory
  • force_rerun (bool) – If cache, and if file exists, specify if API should be queried again
ssbio.databases.pdb.get_num_bioassemblies(pdb_id, cache=False, outdir=None, force_rerun=False)[source]

Check if there are bioassemblies using the PDB REST API, and if there are, get the number of bioassemblies available.

See: https://www.rcsb.org/pages/webservices/rest, section ‘List biological assemblies’

Not all PDB entries have biological assemblies available and some have multiple. Details that are necessary to recreate a biological assembly from the asymmetric unit can be accessed from the following requests.

  • Number of biological assemblies associated with a PDB entry
  • Access the transformation information needed to generate a biological assembly (nr=0 will return information for the asymmetric unit, nr=1 will return information for the first assembly, etc.)

A query of https://www.rcsb.org/pdb/rest/bioassembly/nrbioassemblies?structureId=1hv4 returns this:

<nrBioAssemblies structureId="1HV4" hasAssemblies="true" count="2"/>
Parameters:
  • pdb_id (str) – PDB ID
  • cache (bool) – If the XML file should be downloaded
  • outdir (str) – If cache, then specify the output directory
  • force_rerun (bool) – If cache, and if file exists, specify if API should be queried again
ssbio.databases.pdb.get_release_date(pdb_id)[source]

Quick way to get the release date of a PDB ID using the table of results from the REST service

Returns None if the release date is not available.

Returns:Organism of a PDB ID
Return type:str
ssbio.databases.pdb.get_resolution(pdb_id)[source]

Quick way to get the resolution of a PDB ID using the table of results from the REST service

Returns infinity if the resolution is not available.

Returns:resolution of a PDB ID in Angstroms
Return type:float

Todo

  • Unit test
ssbio.databases.pdb.map_uniprot_resnum_to_pdb(uniprot_resnum, chain_id, sifts_file)[source]

Map a UniProt residue number to its corresponding PDB residue number.

This function requires that the SIFTS file be downloaded, and also a chain ID (as different chains may have different mappings).

Parameters:
  • uniprot_resnum (int) – integer of the residue number you’d like to map
  • chain_id (str) – string of the PDB chain to map to
  • sifts_file (str) – Path to the SIFTS XML file
Returns:

tuple containing:

mapped_resnum (int): Mapped residue number is_observed (bool): Indicates if the 3D structure actually shows the residue

Return type:

(tuple)

ssbio.databases.pdb.parse_mmcif_header(infile)[source]

Parse a couple important fields from the mmCIF file format with some manual curation of ligands.

If you want full access to the mmCIF file just use the MMCIF2Dict class in Biopython.

Parameters:infile – Path to mmCIF file
Returns:Dictionary of parsed header
Return type:dict
ssbio.databases.pdb.parse_mmtf_header(infile)[source]

Parse an MMTF file and return basic header-like information.

Parameters:infile (str) – Path to MMTF file
Returns:Dictionary of parsed header
Return type:dict

Todo

  • Can this be sped up by not parsing the 3D coordinate info somehow?
  • OR just store the sequences when this happens since it is already being parsed.
PISA
ssbio.databases.pisa.download_pisa_multimers_xml(pdb_ids, save_single_xml_files=True, outdir=None, force_rerun=False)[source]

Download the PISA XML file for multimers.

See: http://www.ebi.ac.uk/pdbe/pisa/pi_download.html for more info

XML description of macromolecular assemblies:
http://www.ebi.ac.uk/pdbe/pisa/cgi-bin/multimers.pisa?pdbcodelist where “pdbcodelist” is a comma-separated (strictly no spaces) list of PDB codes. The resulting file contain XML output of assembly data, equivalent to that displayed in PISA assembly pages, for each of the specified PDB entries. NOTE: If a mass-download is intended, please minimize the number of retrievals by specifying as many PDB codes in the URL as feasible (20-50 is a good range), and never send another URL request until the previous one has been completed (meaning that the multimers.pisa file has been downloaded). Excessive requests will silently die in the server queue.
Parameters:
  • pdb_ids (str, list) – PDB ID or list of IDs
  • save_single_xml_files (bool) – If single XML files should be saved per PDB ID. If False, if multiple PDB IDs are provided, then a single, combined XML output file is downloaded
  • outdir (str) – Directory to output PISA XML files
  • force_rerun (bool) – Redownload files if they already exist
Returns:

of files downloaded

Return type:

list

ssbio.databases.pisa.parse_pisa_multimers_xml(pisa_multimers_xml, download_structures=False, outdir=None, force_rerun=False)[source]

Retrieve PISA information from an XML results file

See: http://www.ebi.ac.uk/pdbe/pisa/pi_download.html for more info

XML description of macromolecular assemblies:
http://www.ebi.ac.uk/pdbe/pisa/cgi-bin/multimers.pisa?pdbcodelist where “pdbcodelist” is a comma-separated (strictly no spaces) list of PDB codes. The resulting file contain XML output of assembly data, equivalent to that displayed in PISA assembly pages, for each of the specified PDB entries. NOTE: If a mass-download is intended, please minimize the number of retrievals by specifying as many PDB codes in the URL as feasible (20-50 is a good range), and never send another URL request until the previous one has been completed (meaning that the multimers.pisa file has been downloaded). Excessive requests will silently die in the server queue.
Parameters:
  • pisa_multimers_xml (str) – Path to PISA XML output file
  • download_structures (bool) – If assembly files should be downloaded
  • outdir (str) – Directory to output assembly files
  • force_rerun (bool) – Redownload files if they already exist
Returns:

of parsed PISA information

Return type:

dict

ssbio.databases.pisa.pdb_chain_stoichiometry_biomolone(pdbid)[source]

Get the stoichiometry of the chains in biological assembly 1 as a dictionary.

Steps taken are: 1) Download PDB and parse header, make biomolecule if provided 2) Count how many times each chain appears in biomolecule #1 3) Convert chain id to uniprot id 4) Return final dictionary

Parameters:pdbid (str) – 4 character PDB ID
Returns:{(ChainID,UniProtID): # occurences}
Return type:dict
SWISSMODEL
class ssbio.databases.swissmodel.SWISSMODEL(metadata_dir)[source]

Methods to parse through a SWISS-MODEL metadata set.

Download a particular organism’s metadata from SWISS-MODEL here: https://swissmodel.expasy.org/repository

Parameters:metadata_dir (str) – Path to the extracted SWISS-MODEL_Repository folder
all_models = None

dict – Dictionary of lists, UniProt ID as the keys

download_models(uniprot_acc, outdir='', force_rerun=False)[source]

Download all models available for a UniProt accession number.

Parameters:
  • uniprot_acc (str) – UniProt ACC/ID
  • outdir (str) – Path to output directory, uses working directory if not set
  • force_rerun (bool) – Force a redownload the models if they already exist
Returns:

Paths to the downloaded models

Return type:

list

get_model_filepath(infodict)[source]

Get the path to the homology model using information from the index dictionary for a single model.

Example: use self.get_models(UNIPROT_ID) to get all the models, which returns a list of dictionaries.
Use one of those dictionaries as input to this function to get the filepath to the model itself.
Parameters:infodict (dict) – Information about a model from get_models
Returns:Path to homology model
Return type:str
get_models(uniprot_acc)[source]

Return all available models for a UniProt accession number.

Parameters:uniprot_acc (str) – UniProt ACC/ID
Returns:All available models in SWISS-MODEL for this UniProt entry
Return type:dict
metadata_dir = None

str – Path to the extracted SWISS-MODEL_Repository folder

metadata_index_json

str – Path to the INDEX_JSON file.

organize_models(outdir, force_rerun=False)[source]

Organize and rename SWISS-MODEL models to a single folder with a name containing template information.

Parameters:
  • outdir (str) – New directory to copy renamed models to
  • force_rerun (bool) – If models should be copied again even if they already exist
Returns:

Dictionary of lists, UniProt IDs as the keys and new file paths as the values

Return type:

dict

parse_metadata()[source]

Parse the INDEX_JSON file and reorganize it as a dictionary of lists.

uniprots_modeled

list – Return all UniProt accession numbers with at least one model

ssbio.databases.swissmodel.get_oligomeric_state(swiss_model_path)[source]

Parse the oligomeric prediction in a SWISS-MODEL repository file

As of 2018-02-26, works on all E. coli models. Untested on other pre-made organism models.

Parameters:swiss_model_path (str) – Path to SWISS-MODEL PDB file
Returns:Information parsed about the oligomeric state
Return type:dict
ssbio.databases.swissmodel.translate_ostat(ostat)[source]

Translate the OSTAT field to an integer.

As of 2018-02-26, works on all E. coli models. Untested on other pre-made organism models.

Parameters:ostat (str) – Predicted oligomeric state of the PDB file
Returns:Translated string to integer
Return type:int
UniProtProp
class ssbio.databases.uniprot.UniProtProp(seq, id, name='<unknown name>', description='<unknown description>', fasta_path=None, xml_path=None, gff_path=None)[source]

Generic class to store information on a UniProt entry, extended from a SeqProp object.

The main utilities of this class are to:

  1. Download and/or parse UniProt text or xml files
  2. Store extra parsed information in attributes
uniprot

str – Main UniProt accession code

alt_uniprots

list – Alternate accession codes that point to the main one

file_type

str – Metadata file type

reviewed

bool – If this entry is a “reviewed” entry. If None, then status is unknown.

ec_number

str – EC number

pfam

list – PFAM IDs

entry_version

str – Date of last update of the UniProt entry

seq_version

str – Date of last update of the UniProt sequence

download_metadata_file(outdir, force_rerun=False)[source]

Download and load the UniProt XML file

download_seq_file(outdir, force_rerun=False)[source]

Download and load the UniProt FASTA file

features

list – Get the features from the feature file, metadata file, or in memory

ranking_score()[source]

Provide a score for this UniProt ID based on reviewed (True=1, False=0) + number of PDBs

Returns:Scoring for this ID
Return type:int
seq

Seq – Get the Seq object from the sequence file, metadata file, or in memory

ssbio.databases.uniprot.blast_uniprot(seq_str, seq_ident=1, evalue=0.0001, reviewed_only=True, organism=None)[source]

BLAST the UniProt db to find what IDs match the sequence input

Parameters:
  • seq_str – Sequence string
  • seq_ident – Percent identity to match
  • evalue – E-value of BLAST hit

Returns:

ssbio.databases.uniprot.download_uniprot_file(uniprot_id, filetype, outdir='', force_rerun=False)[source]

Download a UniProt file for a UniProt ID/ACC

Parameters:
  • uniprot_id – Valid UniProt ID
  • filetype – txt, fasta, xml, rdf, or gff
  • outdir – Directory to download the file
Returns:

Absolute path to file

Return type:

str

ssbio.databases.uniprot.get_fasta(uniprot_id)[source]

Get the protein sequence for a UniProt ID as a string.

Parameters:uniprot_id – Valid UniProt ID
Returns:String of the protein (amino acid) sequence
Return type:str
ssbio.databases.uniprot.is_valid_uniprot_id(instring)[source]

Check if a string is a valid UniProt ID.

See regex from: http://www.uniprot.org/help/accession_numbers

Parameters:instring – any string identifier

Returns: True if the string is a valid UniProt ID

ssbio.databases.uniprot.old_parse_uniprot_txt_file(infile)[source]

From: boscoh/uniprot github Parses the text of metadata retrieved from uniprot.org.

Only a few fields have been parsed, but this provides a template for the other fields.

A single description is generated from joining alternative descriptions.

Returns a dictionary with the main UNIPROT ACC as keys.

ssbio.databases.uniprot.parse_uniprot_txt_file(infile)[source]

Parse a raw UniProt metadata file and return a dictionary.

Parameters:infile – Path to metadata file
Returns:Metadata dictionary
Return type:dict
ssbio.databases.uniprot.parse_uniprot_xml_metadata(sr)[source]

Load relevant attributes and dbxrefs from a parsed UniProt XML file in a SeqRecord.

Returns:All parsed information
Return type:dict
ssbio.databases.uniprot.uniprot_ec(uniprot_id)[source]

Retrieve the EC number annotation for a UniProt ID.

Parameters:uniprot_id – Valid UniProt ID

Returns:

ssbio.databases.uniprot.uniprot_reviewed_checker(uniprot_id)[source]

Check if a single UniProt ID is reviewed or not.

Parameters:uniprot_id
Returns:If the entry is reviewed
Return type:bool
ssbio.databases.uniprot.uniprot_reviewed_checker_batch(uniprot_ids)[source]

Batch check if uniprot IDs are reviewed or not

Parameters:uniprot_ids – UniProt ID or list of UniProt IDs
Returns:Boolean}
Return type:A dictionary of {UniProtID
ssbio.databases.uniprot.uniprot_sites(uniprot_id)[source]

Retrieve a list of UniProt sites parsed from the feature file

Sites are defined here: http://www.uniprot.org/help/site and here: http://www.uniprot.org/help/function_section

Parameters:uniprot_id – Valid UniProt ID

Returns:

KEGGProp
class ssbio.databases.kegg.KEGGProp(seq, id, name='<unknown name>', description='<unknown description>', fasta_path=None, txt_path=None, gff_path=None)[source]
ssbio.databases.kegg.download_kegg_aa_seq(gene_id, outdir=None, force_rerun=False)[source]

Download a FASTA sequence of a protein from the KEGG database and return the path.

Parameters:
  • gene_id – the gene identifier
  • outdir – optional path to output directory
Returns:

Path to FASTA file

ssbio.databases.kegg.download_kegg_gene_metadata(gene_id, outdir=None, force_rerun=False)[source]

Download the KEGG flatfile for a KEGG ID and return the path.

Parameters:
  • gene_id – KEGG gene ID (with organism code), i.e. “eco:1244”
  • outdir – optional output directory of metadata
Returns:

Path to metadata file

ssbio.databases.kegg.map_kegg_all_genes(organism_code, target_db)[source]

Map all of an organism’s gene IDs to the target database.

This is faster than supplying a specific list of genes to map, plus there seems to be a limit on the number you can map with a manual REST query anyway.

Parameters:
  • organism_code – the three letter KEGG code of your organism
  • target_db – ncbi-proteinid | ncbi-geneid | uniprot
Returns:

Dictionary of ID mapping

ssbio.databases.kegg.parse_kegg_gene_metadata(infile)[source]

Parse the KEGG flatfile and return a dictionary of metadata.

Dictionary keys are:
refseq uniprot pdbs taxonomy
Parameters:infile – Path to KEGG flatfile
Returns:Dictionary of metadata
Return type:dict

ssbio.protein.structure.utils

CleanPDB
ssbio.protein.structure.utils.cleanpdb.clean_pdb(pdb_file, out_suffix='_clean', outdir=None, force_rerun=False, remove_atom_alt=True, keep_atom_alt_id='A', remove_atom_hydrogen=True, add_atom_occ=True, remove_res_hetero=True, keep_chemicals=None, keep_res_only=None, add_chain_id_if_empty='X', keep_chains=None)[source]

Clean a PDB file.

Parameters:
  • pdb_file (str) – Path to input PDB file
  • out_suffix (str) – Suffix to append to original filename
  • outdir (str) – Path to output directory
  • force_rerun (bool) – If structure should be re-cleaned if a clean file exists already
  • remove_atom_alt (bool) – Remove alternate positions
  • keep_atom_alt_id (str) – If removing alternate positions, which alternate ID to keep
  • remove_atom_hydrogen (bool) – Remove hydrogen atoms
  • add_atom_occ (bool) – Add atom occupancy fields if not present
  • remove_res_hetero (bool) – Remove all HETATMs
  • keep_chemicals (str, list) – If removing HETATMs, keep specified chemical names
  • keep_res_only (str, list) – Keep ONLY specified resnames, deletes everything else!
  • add_chain_id_if_empty (str) – Add a chain ID if not present
  • keep_chains (str, list) – Keep only these chains
Returns:

Path to cleaned PDB file

Return type:

str

MutatePDB
DOCK
class ssbio.protein.structure.utils.dock.DOCK(structure_id, pdb_file, amb_file, flex1_file, flex2_file, root_dir=None)[source]

Class to prepare a structure file for docking with DOCK6.

Attributes:

auto_flexdock(binding_residues, radius, ligand_path=None, force_rerun=False)[source]

Run DOCK6 on a PDB file, given its binding residues and a radius around them.

Provide a path to a ligand to dock a ligand to it. If no ligand is provided, DOCK6 preparations will be run on that structure file.

Parameters:
  • binding_residues (str) – Comma separated string of residues (eg: ‘144,170,199’)
  • radius (int, float) – Radius around binding residues to dock to
  • ligand_path (str) – Path to ligand (mol2 format) to dock to protein
  • force_rerun (bool) – If method should be rerun even if output files exist
binding_site_mol2(residues, force_rerun=False)[source]

Create mol2 of only binding site residues from the receptor

This function will take in a .pdb file (preferably the _receptor_noH.pdb file) and a string of residues (eg: ‘144,170,199’) and delete all other residues in the .pdb file. It then saves the coordinates of the selected residues as a .mol2 file. This is necessary for Chimera to select spheres within the radius of the binding site.

Parameters:
  • residues (str) – Comma separated string of residues (eg: ‘144,170,199’)
  • force_rerun (bool) – If method should be rerun even if output file exists
dms_maker(force_rerun=False)[source]

Create surface representation (dms file) of receptor

Parameters:force_rerun (bool) – If method should be rerun even if output file exists
do_dock6_flexible(ligand_path, force_rerun=False)[source]

Dock a ligand to the protein.

Parameters:
  • ligand_path (str) – Path to ligand (mol2 format) to dock to protein
  • force_rerun (bool) – If method should be rerun even if output file exists
dock_dir

str – DOCK folder

dockprep(force_rerun=False)[source]

Prepare a PDB file for docking by first converting it to mol2 format.

Parameters:force_rerun (bool) – If method should be rerun even if output file exists
grid(force_rerun=False)[source]

Create the scoring grid within the dummy box.

Parameters:force_rerun (bool) – If method should be rerun even if output file exists
protein_only_and_noH(keep_ligands=None, force_rerun=False)[source]

Isolate the receptor by stripping everything except protein and specified ligands.

Parameters:
  • keep_ligands (str, list) – Ligand(s) to keep in PDB file
  • force_rerun (bool) – If method should be rerun even if output file exists
root_dir

str – Directory where DOCK project folder is located

showbox(force_rerun=False)[source]

Create the dummy PDB box around the selected spheres.

Parameters:force_rerun (bool) – If method should be rerun even if output file exists
sphere_selector_using_residues(radius, force_rerun=False)[source]

Select spheres based on binding site residues

Parameters:
  • radius (int, float) – Radius around binding residues to dock to
  • force_rerun (bool) – If method should be rerun even if output file exists
sphgen(force_rerun=False)[source]

Create sphere representation (sph file) of receptor from the surface representation

Parameters:force_rerun (bool) – If method should be rerun even if output file exists
ssbio.protein.structure.utils.dock.parse_results_mol2(mol2_outpath)[source]

Parse a DOCK6 mol2 output file, return a Pandas DataFrame of the results.

Parameters:mol2_outpath (str) – Path to mol2 output file
Returns:Pandas DataFrame of the results
Return type:DataFrame

ssbio.protein.structure.properties

Structure Residues
ssbio.protein.structure.properties.residues.distance_to_site(residue_of_interest, residues, model)[source]

Calculate the distance between an amino acid and a group of amino acids.

Parameters:
  • residue_of_interest – Residue number you are interested in (ie. a mutation)
  • residues – List of residue numbers
Returns:

Distance (in Angstroms) to the group of residues

Return type:

float

ssbio.protein.structure.properties.residues.get_structure_seqrecords(model)[source]

Get a dictionary of a PDB file’s sequences.

Special cases include:
  • Insertion codes. In the case of residue numbers like “15A”, “15B”, both residues are written out. Example: 9LPR
  • HETATMs. Currently written as an “X”, or unknown amino acid.
Parameters:model – Biopython Model object of a Structure
Returns:List of SeqRecords
Return type:list
ssbio.protein.structure.properties.residues.get_structure_seqs(pdb_file, file_type)[source]

Get a dictionary of a PDB file’s sequences.

Special cases include:
  • Insertion codes. In the case of residue numbers like “15A”, “15B”, both residues are written out. Example: 9LPR
  • HETATMs. Currently written as an “X”, or unknown amino acid.
Parameters:pdb_file – Path to PDB file
Returns:Dictionary of: {chain_id: sequence}
Return type:dict
ssbio.protein.structure.properties.residues.hse_output(pdb_file, file_type)[source]

The solvent exposure of an amino acid residue is important for analyzing, understanding and predicting aspects of protein structure and function [73]. A residue’s solvent exposure can be classified as four categories: exposed, partly exposed, buried and deeply buried residues. Hamelryck et al. [73] established a new 2D measure that provides a different view of solvent exposure, i.e. half-sphere exposure (HSE). By conceptually dividing the sphere of a residue into two halves- HSE-up and HSE-down, HSE provides a more detailed description of an amino acid residue’s spatial neighborhood. HSE is calculated by the hsexpo module implemented in the BioPython package [74] from a PDB file.

http://onlinelibrary.wiley.com/doi/10.1002/prot.20379/abstract

Parameters:pdb_file

Returns:

ssbio.protein.structure.properties.residues.match_structure_sequence(orig_seq, new_seq, match='X', fill_with='X', ignore_excess=False)[source]

Correct a sequence to match inserted X’s in a structure sequence

This is useful for mapping a sequence obtained from structural tools like MSMS or DSSP
to the sequence obtained by the get_structure_seqs method.

Examples

>>> structure_seq = 'XXXABCDEF'
>>> prop_list = [4, 5, 6, 7, 8, 9]
>>> match_structure_sequence(structure_seq, prop_list)
['X', 'X', 'X', 4, 5, 6, 7, 8, 9]
>>> match_structure_sequence(structure_seq, prop_list, fill_with=float('Inf'))
[inf, inf, inf, 4, 5, 6, 7, 8, 9]
>>> structure_seq = '---ABCDEF---'
>>> prop_list = ('H','H','H','C','C','C')
>>> match_structure_sequence(structure_seq, prop_list, match='-', fill_with='-')
('-', '-', '-', 'H', 'H', 'H', 'C', 'C', 'C', '-', '-', '-')
>>> structure_seq = 'ABCDEF---'
>>> prop_list = 'HHHCCC'
>>> match_structure_sequence(structure_seq, prop_list, match='-', fill_with='-')
'HHHCCC---'
>>> structure_seq = 'AXBXCXDXEXF'
>>> prop_list = ['H', 'H', 'H', 'C', 'C', 'C']
>>> match_structure_sequence(structure_seq, prop_list, match='X', fill_with='X')
['H', 'X', 'H', 'X', 'H', 'X', 'C', 'X', 'C', 'X', 'C']
Parameters:
  • orig_seq (str, Seq, SeqRecord) – Sequence to match to
  • new_seq (str, tuple, list) – Sequence to fill in
  • match (str) – What to match
  • fill_with – What to fill in when matches are found
  • ignore_excess (bool) – If excess sequence on the tail end of new_seq should be ignored
Returns:

new_seq which will match the length of orig_seq

Return type:

str, tuple, list

ssbio.protein.structure.properties.residues.resname_in_proximity(resname, model, chains, resnums, threshold=5)[source]

Search within the proximity of a defined list of residue numbers and their chains for any specifed residue name.

Parameters:
  • resname (str) – Residue name to search for in proximity of specified chains + resnums
  • model – Biopython Model object
  • chains (str, list) – Chain ID or IDs to check
  • resnums (int, list) – Residue numbers within the chain to check
  • threshold (float) – Cutoff in Angstroms for returning True if a RESNAME is near
Returns:

True if a RESNAME is within the threshold cutoff

Return type:

bool

ssbio.protein.structure.properties.residues.search_ss_bonds(model, threshold=3.0)[source]

Searches S-S bonds based on distances between atoms in the structure (first model only). Average distance is 2.05A. Threshold is 3A default. Returns iterator with tuples of residues.

ADAPTED FROM JOAO RODRIGUES’ BIOPYTHON GSOC PROJECT (http://biopython.org/wiki/GSOC2010_Joao)

ssbio.protein.structure.properties.residues.site_centroid(residues, model)[source]

Get the XYZ coordinate of the center of a list of residues.

Parameters:
  • residues – List of residue numbers
  • pdb_file – Path to PDB file
Returns:

(X, Y, Z) coordinate of centroid

Return type:

tuple

ssbio.protein.sequence.utils

Sequence Alignment
ssbio.protein.sequence.utils.alignment.get_alignment_df(a_aln_seq, b_aln_seq, a_seq_id=None, b_seq_id=None)[source]

Summarize two alignment strings in a dataframe.

Parameters:
  • a_aln_seq (str) – Aligned sequence string
  • b_aln_seq (str) – Aligned sequence string
  • a_seq_id (str) – Optional ID of a_seq
  • b_seq_id (str) – Optional ID of b_aln_seq
Returns:

a per-residue level annotation of the alignment

Return type:

DataFrame

ssbio.protein.sequence.utils.alignment.get_alignment_df_from_file(alignment_file, a_seq_id=None, b_seq_id=None)[source]

Get a Pandas DataFrame of the Needle alignment results. Contains all positions of the sequences.

Parameters:
  • alignment_file
  • a_seq_id – Optional specification of the ID of the reference sequence
  • b_seq_id – Optional specification of the ID of the aligned sequence
Returns:

all positions in the alignment

Return type:

Pandas DataFrame

ssbio.protein.sequence.utils.alignment.get_deletions(aln_df)[source]

Get a list of tuples indicating the first and last residues of a deletion region, as well as the length of the deletion.

Examples

# Deletion of residues 1 to 4, length 4 >>> test = {‘id_a’: {0: ‘a’, 1: ‘a’, 2: ‘a’, 3: ‘a’}, ‘id_a_aa’: {0: ‘M’, 1: ‘G’, 2: ‘I’, 3: ‘T’}, ‘id_a_pos’: {0: 1.0, 1: 2.0, 2: 3.0, 3: 4.0}, ‘id_b’: {0: ‘b’, 1: ‘b’, 2: ‘b’, 3: ‘b’}, ‘id_b_aa’: {0: np.nan, 1: np.nan, 2: np.nan, 3: np.nan}, ‘id_b_pos’: {0: np.nan, 1: np.nan, 2: np.nan, 3: np.nan}, ‘type’: {0: ‘deletion’, 1: ‘deletion’, 2: ‘deletion’, 3: ‘deletion’}} >>> my_alignment = pd.DataFrame.from_dict(test) >>> get_deletions(my_alignment) [((1.0, 4.0), 4)]

Parameters:aln_df (DataFrame) – Alignment DataFrame
Returns:A list of tuples with the format ((deletion_start_resnum, deletion_end_resnum), deletion_length)
Return type:list
ssbio.protein.sequence.utils.alignment.get_insertions(aln_df)[source]

Get a list of tuples indicating the first and last residues of a insertion region, as well as the length of the insertion.

If the first tuple is:
(-1, 1) that means the insertion is at the beginning of the original protein (X, Inf) where X is the length of the original protein, that means the insertion is at the end of the protein

Examples

# Insertion at beginning, length 3 >>> test = {‘id_a’: {0: ‘a’, 1: ‘a’, 2: ‘a’, 3: ‘a’}, ‘id_a_aa’: {0: np.nan, 1: np.nan, 2: np.nan, 3: ‘M’}, ‘id_a_pos’: {0: np.nan, 1: np.nan, 2: np.nan, 3: 1.0}, ‘id_b’: {0: ‘b’, 1: ‘b’, 2: ‘b’, 3: ‘b’}, ‘id_b_aa’: {0: ‘M’, 1: ‘M’, 2: ‘L’, 3: ‘M’}, ‘id_b_pos’: {0: 1, 1: 2, 2: 3, 3: 4}, ‘type’: {0: ‘insertion’, 1: ‘insertion’, 2: ‘insertion’, 3: ‘match’}} >>> my_alignment = pd.DataFrame.from_dict(test) >>> get_insertions(my_alignment) [((-1, 1.0), 3)]

Parameters:aln_df (DataFrame) – Alignment DataFrame
Returns:A list of tuples with the format ((insertion_start_resnum, insertion_end_resnum), insertion_length)
Return type:list
ssbio.protein.sequence.utils.alignment.get_mutations(aln_df)[source]

Get a list of residue numbers (in the original sequence’s numbering) that are mutated

Parameters:
  • aln_df (DataFrame) – Alignment DataFrame
  • just_resnums – If only the residue numbers should be returned, instead of a list of tuples of (original_residue, resnum, mutated_residue)
Returns:

Residue mutations

Return type:

list

ssbio.protein.sequence.utils.alignment.get_percent_identity(a_aln_seq, b_aln_seq)[source]

Get the percent identity between two alignment strings

ssbio.protein.sequence.utils.alignment.get_unresolved(aln_df)[source]

Get a list of residue numbers (in the original sequence’s numbering) that are unresolved

Parameters:aln_df (DataFrame) – Alignment DataFrame
Returns:Residue numbers that are mutated
Return type:list
ssbio.protein.sequence.utils.alignment.map_resnum_a_to_resnum_b(a_resnum, a_aln, b_aln)[source]

Map a residue number in a sequence to the corresponding residue number in an aligned sequence.

Examples: >>> map_resnum_a_to_resnum_b(5, ‘–ABCDEF’, ‘XXABCDEF’) 7

Parameters:
  • a_resnum (int) – Residue number in the first aligned sequence
  • a_aln (str, Seq, SeqRecord) – Aligned sequence string
  • b_aln (str, Seq, SeqRecord) – Aligned sequence string
Returns:

Residue number in the second aligned sequence

Return type:

int

ssbio.protein.sequence.utils.alignment.needle_statistics(infile)[source]

Reads in a needle alignment file and spits out statistics of the alignment.

Parameters:infile (str) – Alignment file name
Returns:alignment_properties - a dictionary telling you the number of gaps, identity, etc.
Return type:dict
ssbio.protein.sequence.utils.alignment.pairwise_sequence_alignment(a_seq, b_seq, engine, a_seq_id=None, b_seq_id=None, gapopen=10, gapextend=0.5, outfile=None, outdir=None, force_rerun=False)[source]

Run a global pairwise sequence alignment between two sequence strings.

Parameters:
  • a_seq (str, Seq, SeqRecord, SeqProp) – Reference sequence
  • b_seq (str, Seq, SeqRecord, SeqProp) – Sequence to be aligned to reference
  • engine (str) – biopython or needle - which pairwise alignment program to use
  • a_seq_id (str) – Reference sequence ID. If not set, is “a_seq”
  • b_seq_id (str) – Sequence to be aligned ID. If not set, is “b_seq”
  • gapopen (int) – Only for needle - Gap open penalty is the score taken away when a gap is created
  • gapextend (float) – Only for needle - Gap extension penalty is added to the standard gap penalty for each base or residue in the gap
  • outfile (str) – Only for needle - name of output file. If not set, is {id_a}_{id_b}_align.txt
  • outdir (str) – Only for needle - Path to output directory. Default is the current directory.
  • force_rerun (bool) – Only for needle - Default False, set to True if you want to rerun the alignment if outfile exists.
Returns:

Biopython object to represent an alignment

Return type:

MultipleSeqAlignment

ssbio.protein.sequence.utils.alignment.run_needle_alignment(seq_a, seq_b, gapopen=10, gapextend=0.5, outdir=None, outfile=None, force_rerun=False)[source]

Run the needle alignment program for two strings and return the raw alignment result.

More info: EMBOSS needle: http://www.bioinformatics.nl/cgi-bin/emboss/help/needle Biopython wrapper: http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc84 Using strings as input: https://www.biostars.org/p/91124/

Parameters:
  • id_a – ID of reference sequence
  • seq_a (str, Seq, SeqRecord) – Reference sequence
  • id_b – ID of sequence to be aligned
  • seq_b (str, Seq, SeqRecord) – String representation of sequence to be aligned
  • gapopen – Gap open penalty is the score taken away when a gap is created
  • gapextend – Gap extension penalty is added to the standard gap penalty for each base or residue in the gap
  • outdir (str, optional) – Path to output directory. Default is the current directory.
  • outfile (str, optional) – Name of output file. If not set, is {id_a}_{id_b}_align.txt
  • force_rerun (bool) – Default False, set to True if you want to rerun the alignment if outfile exists.
Returns:

Raw alignment result of the needle alignment in srspair format.

Return type:

str

ssbio.protein.sequence.utils.alignment.run_needle_alignment_on_files(id_a, faa_a, id_b, faa_b, gapopen=10, gapextend=0.5, outdir='', outfile='', force_rerun=False)[source]

Run the needle alignment program for two fasta files and return the raw alignment result.

More info: EMBOSS needle: http://www.bioinformatics.nl/cgi-bin/emboss/help/needle Biopython wrapper: http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc84

Parameters:
  • id_a – ID of reference sequence
  • faa_a – File path to reference sequence
  • id_b – ID of sequence to be aligned
  • faa_b – File path to sequence to be aligned
  • gapopen – Gap open penalty is the score taken away when a gap is created
  • gapextend – Gap extension penalty is added to the standard gap penalty for each base or residue in the gap
  • outdir (str, optional) – Path to output directory. Default is the current directory.
  • outfile (str, optional) – Name of output file. If not set, is {id_a}_{id_b}_align.txt
  • force_rerun (bool) – Default False, set to True if you want to rerun the alignment if outfile exists.
Returns:

Raw alignment result of the needle alignment in srspair format.

Return type:

str

Sequence BLAST
ssbio.protein.sequence.utils.blast.calculate_bbh(blast_results_1, blast_results_2, r_name=None, g_name=None, outdir='')[source]

Calculate the best bidirectional BLAST hits (BBH) and save a dataframe of results.

Parameters:
  • blast_results_1 (str) – BLAST results for reference vs. other genome
  • blast_results_2 (str) – BLAST results for other vs. reference genome
  • r_name – Name of reference genome
  • g_name – Name of other genome
  • outdir – Directory where BLAST results are stored.
Returns:

Path to Pandas DataFrame of the BBH results.

ssbio.protein.sequence.utils.blast.create_orthology_matrix(r_name, genome_to_bbh_files, pid_cutoff=None, bitscore_cutoff=None, evalue_cutoff=None, filter_condition='OR', outname='', outdir='', force_rerun=False)[source]

Create an orthology matrix using best bidirectional BLAST hits (BBH) outputs.

Parameters:
  • r_name (str) – Name of the reference genome
  • genome_to_bbh_files (dict) – Mapping of genome names to the BBH csv output from the calculate_bbh() method
  • pid_cutoff (float) – Minimum percent identity between BLAST hits to filter for in the range [0, 100]
  • bitscore_cutoff (float) – Minimum bitscore allowed between BLAST hits
  • evalue_cutoff (float) – Maximum E-value allowed between BLAST hits
  • filter_condition (str) – ‘OR’ or ‘AND’, how to combine cutoff filters. ‘OR’ gives more results since it is less stringent, as you will be filtering for hits with (>80% PID or >30 bitscore or <0.0001 evalue).
  • outname – Name of output file of orthology matrix
  • outdir – Path to output directory
  • force_rerun (bool) – Force recreation of the orthology matrix even if the outfile exists
Returns:

Path to orthologous genes matrix.

Return type:

str

ssbio.protein.sequence.utils.blast.print_run_bidirectional_blast(reference, other_genome, dbtype, outdir)[source]

Write torque submission files for running bidirectional blast on a server and print execution command.

Parameters:
  • reference (str) – Path to “reference” genome, aka your “base strain”
  • other_genome (str) – Path to other genome which will be BLASTed to the reference
  • dbtype (str) – “nucl” or “prot” - what format your genome files are in
  • outdir (str) – Path to folder where Torque scripts should be placed
ssbio.protein.sequence.utils.blast.run_bidirectional_blast(reference, other_genome, dbtype, outdir='')[source]

BLAST a genome against another, and vice versa.

This function requires BLAST to be installed, do so by running: sudo apt install ncbi-blast+

Parameters:
  • reference (str) – path to “reference” genome, aka your “base strain”
  • other_genome (str) – path to other genome which will be BLASTed to the reference
  • dbtype (str) – “nucl” or “prot” - what format your genome files are in
  • outdir (str) – path to folder where BLAST outputs should be placed
Returns:

Paths to BLAST output files. (reference_vs_othergenome.out, othergenome_vs_reference.out)

ssbio.protein.sequence.utils.blast.run_makeblastdb(infile, dbtype, outdir='')[source]

Make the BLAST database for a genome file.

Parameters:
  • infile (str) – path to genome FASTA file
  • dbtype (str) – “nucl” or “prot” - what format your genome files are in
  • outdir (str) – path to directory to output database files (default is original folder)
Returns:

Paths to BLAST databases.

ssbio.protein.sequence.properties

Thermostability
This module provides functions to predict thermostability parameters (specifically the free energy of unfolding dG)
of an amino acid sequence.

These methods are adapted from:

Oobatake, M., & Ooi, T. (1993). ‘Hydration and heat stability effects on protein unfolding’,
Progress in biophysics and molecular biology, 59/3: 237–84.
Dill, K. A., Ghosh, K., & Schmit, J. D. (2011). ‘Physical limits of cells and proteomes’,
Proceedings of the National Academy of Sciences of the United States of America, 108/44: 17876–82. DOI: 10.1073/pnas.1114477108

For an example of usage of these parameters in a genome-scale model:

Chen, K., Gao, Y., Mih, N., O’Brien, E., Yang, L., Palsson, B.O. (2017).
‘Thermo-sensitivity of growth is determined by chaperone-mediated proteome re-allocation.’, Submitted to PNAS.
ssbio.protein.sequence.properties.thermostability.calculate_dill_dG(seq_len, temp)[source]

Get free energy of unfolding (dG) using Dill method in units J/mol.

Parameters:
  • seq_len (int) – Length of amino acid sequence
  • temp (float) – Temperature in degrees C
Returns:

Free energy of unfolding dG (J/mol)

Return type:

float

ssbio.protein.sequence.properties.thermostability.calculate_oobatake_dG(seq, temp)[source]

Get free energy of unfolding (dG) using Oobatake method in units cal/mol.

Parameters:
  • seq (str, Seq, SeqRecord) – Amino acid sequence
  • temp (float) – Temperature in degrees C
Returns:

Free energy of unfolding dG (J/mol)

Return type:

float

ssbio.protein.sequence.properties.thermostability.calculate_oobatake_dH(seq, temp)[source]

Get dH using Oobatake method in units cal/mol.

Parameters:
  • seq (str, Seq, SeqRecord) – Amino acid sequence
  • temp (float) – Temperature in degrees C
Returns:

dH in units cal/mol

Return type:

float

ssbio.protein.sequence.properties.thermostability.calculate_oobatake_dS(seq, temp)[source]

Get dS using Oobatake method in units cal/mol.

Parameters:
  • seq (str, Seq, SeqRecord) – Amino acid sequence
  • temp (float) – Temperature in degrees C
Returns:

dS in units cal/mol

Return type:

float

ssbio.protein.sequence.properties.thermostability.get_dG_at_T(seq, temp)[source]

Predict dG at temperature T, using best predictions from Dill or Oobatake methods.

Parameters:
  • seq (str, Seq, SeqRecord) – Amino acid sequence
  • temp (float) – Temperature in degrees C
Returns:

tuple containing:

dG (float) Free energy of unfolding dG (cal/mol) keq (float): Equilibrium constant Keq method (str): Method used to calculate

Return type:

(tuple)

Indices and tables