Welcome to GOcats’ documentation!¶
GOcats¶
GOcats is an Open Biomedical Ontology (OBO) parser and categorizing utility–currently specialized for the Gene Ontology (GO)–which can help scientists interpret large-scale experimental results by organizing redundant and highly- specific annotations into customizable, biologically-relevant concept categories. Concept subgraphs are defined by lists of keywords created by the user.
- Currently, the GOcats package can be used to:
- Create subgraphs of GO which each represent a user-specified concept.
- Map specific, or fine-grained, GO terms in a Gene Annotation File (GAF) to an arbitrary number of concept categories.
- Remap ancestor Gene Ontology term relationships and the gene annotations with a set of user defined relationships.
- Explore the Gene Ontology graph within a Python interpreter.
Citation¶
Please cite the following papers when using GOcats:
Hinderer EW, Moseley NHB. GOcats: A tool for categorizing Gene Ontology into subgraphs of user-defined concepts. PLoS One. 2020;15(6):1-29.
Hinderer EW, Flight RM, Dubey R, Macleod JN, Moseley HNB. Advances in Gene Ontology utilization improve statistical power of annotation enrichment. PLoS One. 2019;14(8):1-20.
Installation¶
GOcats runs under Python 3.4+ and is available through python3-pip. Install via pip or clone the git repo and install the following dependencies and you are ready to go!
Install on Linux¶
Pip installation¶
Dependencies should be automatically installed using this method. It is strongly recommended that you install with this method. .. code:: bash
pip3 install gocats
GitHub Package installation¶
Make sure you have git installed:
cd ~/
git clone https://github.com/MoseleyBioinformaticsLab/GOcats.git
Dependencies¶
GOcats requires the following Python libraries:
- docopt for creating the gocats command-line interface.
- JSONPickle for saving Python objects in a JSON serializable form and outputting to a file.
To install dependencies manually:
pip3 install docopt
pip3 install jsonpickle
Install on Windows¶
GOcats can also be installed on windows through pip.
Quickstart¶
For instructions on how to format your keyword list and advanced argument usage, consult the tutorial, guide, and API documentation at readthedocs.
Subgraphs can be created from the command line.
python3 -m gocats create_subgraphs /path_to_ontology_file ~/GOcats/gocats/exampledata/examplecategories.csv ~/Output --supergraph_namespace=cellular_component --subgraph_namespace=cellular_component --output_termlist
Mapping files can be found in the output directory:
- GC_content_mapping.json_pickle # A python dictionary with category-defining GO terms as keys and a list of all subgraph contents as values.
- GC_id_mapping.json_pickle # A python dictionary with every GO term of the specified namespace as keys and a list of category root terms as values.
GAF mappings can also be made from the command line:
python3 -m gocats categorize_dataset YOUR_GAF.goa YOUR_OUTPUT_DIRECTORY/GC_id_mapping.json_pickle YOUR_OUTPUT_DIRECTORY MAPPED_DATASET_NAME.goa
Gene to GO Term remappings with consideration of has_part
relationships can created from the command line:
python3 -m gocats remap_goterms /path_to_ontology_file.obo /path_to_gaf.goa ancestors_output.json namespace_output.json --allowed_relationships=is_a,part_of,has_part --identifier_column=1
Gene to GO terms will be in JSON format in ancestor_output.json
, and new GO term to namespace in namespace_output.json
.
License¶
Made available under the terms of The Clear BSD License. See full license in LICENSE.
The Clear BSD License
Copyright (c) 2017, Eugene W. Hinderer III, Hunter N.B. Moseley All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted (subject to the limitations in the disclaimer below) provided that the following conditions are met:
- Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
- Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
- Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
NO EXPRESS OR IMPLIED LICENSES TO ANY PARTY’S PATENT RIGHTS ARE GRANTED BY THIS LICENSE. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
Authors¶
- Eugene W. Hinderer III - ehinderer
- Hunter N.B. Moseley - hunter-moseley
The GOcats API Reference¶
The following are located in /GOcats/gocats.
The Gene Ontology Categories Suite (GOcats)¶
This module provides methods for the creation of directed acyclic concept subgraphs of Gene Ontology, along with methods for evaluating those subgraphs.
-
gocats.gocats.
build_graph
(args)[source]¶ Not yet implemented
Try build_graph_interpreter to create a GO graph object to explore within a Python interpreter.
-
gocats.gocats.
build_graph_interpreter
(database_file, supergraph_namespace=None, allowed_relationships=None, relationship_directionality='gocats')[source]¶ Creates a graph object of GO, which can be traversed and queried within a Python interpreter.
Parameters: - database_file (file_handle) – Ontology database file.
- supergraph_namespace (str) – Optional - Filter graph to a sub-ontology namespace.
- allowed_relationships (list) – Optional - Filter graph to use only those relationships listed.
- relationship_directionality – Optional - Any string other than ‘gocats’ will retain all original GO relationship directionalities. Defaults to reverseing has_part direction.
Returns: A Graph object of the ontology provided.
Return type: class
-
gocats.gocats.
categorize_dataset
(dataset_file, term_mapping, output_directory, mapped_dataset_filename, dataset_type='GAF', entity_col=0, go_col=1, retain_unmapped_annotations=False)[source]¶ Reads in a Gene Annotation File (GAF) and maps the annotations contained therein to the categories organized by GOcats or other methods. Outputs a mapped GAF and a list of unmapped genes in the specified output directory.
Parameters: - dataset_file – A file containing gene annotations.
- term_mapping – A dictionary mapping category-defining ontology terms to their subgraph children terms. May be produced by GOcats or another method.
- output_directory – The directory where the output file will be stored.
- mapped_dataset_filename – The desired name of the mapped GAF.
- dataset_type – Enter file type for dataset [GAF|TSV|CSV]. Defaults to “GAF”.
- entity_col – If CSV or TSV file type, indicate which column the entity IDs are listed. Defaults to 0.
- go_col – If CSV or TSV file type, indicate which column the GO IDs are listed. Defaults to 1.
- retain_unmapped_annotations – If specified, annotations that are not mapped to a concept are copied into the mapped dataset output file with its original annotation.
Returns: None
Return type:
-
gocats.gocats.
create_subgraphs
(database_file, keyword_file, output_directory, supergraph_namespace=None, subgraph_namespace=None, supergraph_relationships=['is_a', 'part_of', 'has_part'], subgraph_relationships=['is_a', 'part_of', 'has_part'], map_supersets=False, output_termlist=False, go_basic_scoping=False, network_table_name=None, test=False)[source]¶ Creates a graph object of an ontology, processed into
gocats.dag.OboGraph
or to an object that inherits fromgocats.dag.OboGraph
, and then extracts subgraphs which represent concepts that are defined by a list of provided keywords. Each subgraph is processed intogocats.subdag.SubGraph
.Parameters: - database_file – Ontology database file.
- keyword_file – A CSV file with two columns: column 1 naming categories, and column 2 listing search strings (no quotation marks, separated by semicolons).
- output_directory – The directory where results are stored.
- supergraph_namespace – a supergraph sub-ontology to filter e.g. cellular_component, optional
- subgraph_namespace – a subgraph sub-ontology to filter e.g. cellular_component, optional
- supergraph_relationships – a list of relationships to limit in the supergraph e.g. [‘is_a’, ‘part_of’], optional
- subgraph_relationships – a list of relationships to limit in subgraphs e.g. [‘is_a’, ‘part_of’], optional
- map_supersets – whether to allow subgraphs to subsume other subgraphs, logical, optional
- output_termlist – whether to create a translation of ontology terms to their names to improve interpretability of dev test results, logical, optional
- go-basic-scoping – whether to create a GO graph similar to go-basic with only scoping-type relationships (is_a and part_of), logical, optional
- network_table_name – whether to make a specific name for the network table produced from the subgraphs (defaults to NetworkTable.csv)
Returns: None
Return type:
-
gocats.gocats.
find_category_subsets
(subgraph_collection)[source]¶ Finds subgraphs which are subsets of other subgraphs to remove redundancy, when specified.
Parameters: subgraph_collection – A dictionary of subgraph objects (keys: subgraph name, values: subgraph object). Returns: A dictionary relating which subgraph objects are subsets of other subgraphs (keys: subset subgraph, values: superset subgraphs). Return type: dict
-
gocats.gocats.
json_format_graph
(graph_object, graph_identifier)[source]¶ Creates a dictionary representing the edges in the graph and formats it in such a way that it can be encoded into JSON for comparing the graph objects between versions of GOcats.
-
gocats.gocats.
remap_goterms
(go_database, goa_gaf, ancestor_filename, namespace_filename, allowed_relationships, identifier_column)[source]¶ Reads in a Gene Ontology relationship file, and a Gene Annotation File (GAF), and follows the GOcats rules for allowed term-to-term relationships. Generates as output a new GAF, and a new term to ontology namespace mapping.
Parameters: - go_database – the gene ontology dataset
- goa_gaf – the gene annotation file
- ancestor_filename – the output file containing new gene to ontology mappings
- namespace_filename – the output file containing the term to ontology mappings
- allowed_relationships – what term to term relationships will be considered (is_a,part_of,has_part)
- identifier_column – which column is being used for the gene identifiers (1)
Returns: None
Return type:
Directed Acyclic Graph (DAG)¶
Contains necessary objects for creating a Directed Acyclic Graph (DAG) object to represent Open Biomedical Ontologies (OBO).
-
class
gocats.dag.
OboGraph
(namespace_filter=None, allowed_relationships=None)[source]¶ A pythonic graph of a generic Open Biomedical Ontology (OBO) directed acyclic graph (DAG).
-
__init__
(namespace_filter=None, allowed_relationships=None)[source]¶ OboGraph initializer. Leave namespace_filter and allowed_relationship as
None
to create the entire ontology graph. Otherwise, provide filters to limit what information is pulled into the graph.Parameters:
-
orphans
¶ property
defining a set of nodes in the graph which have no parents. When the graph is modified, calls_update_graph()
to repopulate the sets of orphan and leaf nodes.Returns: Set of ‘orphan’ gocats.dag.AbstractNode
objects.Return type: set
-
leaves
¶ property
defining a set of nodes in the graph which have no children. When the graph is modified, calls_update_graph()
to repopulate the sets of orphan and leaf nodes.Returns: Set of ‘leaf’ gocats.dag.AbstractNode
objects.Return type: set
-
valid_node
(node)[source]¶ Defines condition of a valid node. Node is valid if it is not obsolete and is contained within the given ontology namespace constraint.
Parameters: node – A gocats.dag.AbstractNode
objectReturns: True if node is valid, False otherwise Return type: True
orFalse
-
valid_edge
(edge)[source]¶ Defines condition of a valid edge. Edge is valid if it is within the list of allowed edges and connects two nodes that are both contained in the graph in question.
Parameters: edge – A gocats.dag.AbstractEdge
objectReturns: True if node is valid, False otherwise Return type: True
orFalse
-
add_node
(node)[source]¶ Adds a node object to the graph, adds an object pointer to the vocabulary index to reference nodes to every word in the node name and definition. Sets modification state to
True
.Parameters: node – A gocats.dag.AbstractNode
object.Returns: None Return type: None
-
remove_node
(node)[source]¶ Removes a node from the graph and deletes node references from all entries in the vocabulary index. Sets modification state to
True
.Parameters: node – A gocats.dag.AbstractNode
object.Returns: None Return type: None
-
add_edge
(edge)[source]¶ Adds an edge object to the graph, and counts the edge relationship type. Sets modification state to
True
.Parameters: edge – A gocats.dag.AbstractEdge
object.Returns: None Return type: None
-
remove_edge
(edge)[source]¶ Removes an edge object from the graph, and removes references to that edge from the node objects involved. Sets modification state to
True
.Parameters: edge – A gocats.dag.AbstractEdge
object.Returns: None Return type: None
-
add_relationship
(relationship)[source]¶ Adds a
gocats.dag.AbstractRelationship
object to the graph’s relationship index, referenced by that relationships ID. Sets modification state toTrue
.Parameters: relationship – A gocats.dag.AbstractRelationship
object.Returns: None Return type: None
-
instantiate_valid_edges
()[source]¶ Add all edge references to their respective nodes and vice versa if both nodes of the edge are in the graph. This is carried out by
AbstractEdge.connect_nodes()
. Also addsgocats.dag.AbstractRelationship
object reference to each edge. If both nodes are not in the graph, the edge is deleted from the graph. Sets modification state toTrue
.Returns: None Return type: None
-
node_depth
(sample_node)[source]¶ Returns an integer representing how many nodes are between the given node and the root node of the graph (depth level).
Parameters: sample_node – A gocats.dag.AbstractNode
object.Returns: Depth level. Return type: int
-
filter_nodes
(search_string_list)[source]¶ Returns a list of node objects that contain vocabulary matching the keywords provided in the search string list. Nodes are selected by searching through the vocablary index.
Parameters: search_string_list – A list
of search strings provided in the keyword_file provided togocats.gocats.create_subgraphs()
.Returns: A list of gocats.dag.AbstractNode
objects.Return type: list
-
filter_edges
(filtered_nodes)[source]¶ Returns a list of edges in the graph that connect the nodes provided in the filtered nodes list.
Parameters: filtered_nodes – List of filtered nodes provided by filter_nodes()
.Returns: A list of gocats.dag.AbstractEdge
objects.Return type: list
-
nodes_between
(start_node, end_node)[source]¶ Returns a set of nodes that occur along all paths between the start node and the end node. If no paths exist, an empty set is returned.
Parameters: - start_node –
gocats.dag.AbstractNode
object to start the paths. - end_node –
gocats.dag.AbstractNode
object to end the paths.
Returns: A set of
gocats.dag.AbstractNode
objects if there is at least one path between the parameters, an empty set otherwise.Return type: - start_node –
-
__weakref__
¶ list of weak references to the object (if defined)
-
-
class
gocats.dag.
AbstractNode
[source]¶ A node containing all basic properties of an OBO node. The parsing object,
gocats.ontologyparser.OboParser
currently has direct access to data members (id, name, definition, namespace, edges, and obsolete) so that information from the database file can be added to the object.-
descendants
¶ property
defining a set of nodes in the graph that are recursively reverse of a node with a scoping-type relationship. When the node is modified, callsgocats.dag.AbstractNode._update_node()
to repopulate the sets of descendants and ancestors. This represents a “lazy” evaluation of node descendants.Returns: Set of gocats.dag.AbstractNode
objectsReturn type: set
-
ancestors
¶ property
defining a set of nodes in the graph that are recursively forward of a node with a scoping-type relationship. When the node is modified, callsgocats.dag.AbstractNode._update_node()
to repopulate the sets of descendants and ancestors. This represents a “lazy” evaluation of node ancestors.Returns: Set of gocats.dag.AbstractNode
objectsReturn type: set
-
_update_node
()[source]¶ Repopulates ancestor and descendant sets for a node. Sets modification state to
True
.Returns: None Return type: None
-
add_edge
(edge, allowed_relationships)[source]¶ Adds a given
gocats.dag.AbstractEdge
to a eachgocats.dag.AbstractNode
objects that the edge connects. If there is a filter for the types of relationships allowed, edges with non-allowed relationship types are not processed. Sets modification state toTrue
.Returns: None Return type: None
-
remove_edge
(edge)[source]¶ Removes a given
gocats.dag.AbstractEdge
thegocats.dag.AbstractNode
object. Also removes parent or child node references that the edge referenced. Sets modification state toTrue
.Returns: None Return type: None
-
_update_descendants
()[source]¶ Used for the lazy evaluation of graph descendants of the current
gocats.dag.AbstractNode
object. Creates internalset
variable, descendant_set. Iterates through node children until the bottom of the graph is reached. The descendant_set is a set of all nodes across all paths encountered from the current node.Returns: None Return type: None
-
_update_ancestors
()[source]¶ Used for the lazy evaluation of graph ancestors of the current
gocats.dag.AbstractNode
object. Creates internalset
variable, ancestors_set. Iterates through node parents until the top of the graph is reached. The ancestors_set is a set of all nodes across all paths encountered from the current node.Returns: None Return type: None
-
__weakref__
¶ list of weak references to the object (if defined)
-
-
class
gocats.dag.
AbstractEdge
(node1_id, node2_id, relationship_id, node_pair=None)[source]¶ An OBO edge which links two ontology term nodes and contains a relationship type describing now the two nodes are related.
-
__init__
(node1_id, node2_id, relationship_id, node_pair=None)[source]¶ AbstractEdge initializer. Node pair refers to a
tuple
ofgocats.dag.AbstractNode
objects that are connected by the edge. Defaults toNone
and is later populated.Parameters: - node1_id (str) – The ID of the first term referenced from the ontology file’s relationship line.
- node2_id (str) – The ID of the second term referenced from the ontology file’s relationship line.
- relationship_id (str) – The ID of the relationship in the ontology file’s relationship line.
- node_pair (tuple) – Default-
None
, provide atuple
containing twogocats.dag.AbstractNode
objects if they are already created and able to be referenced.
-
json_edge
¶ property
which returns a tuple where position 0 is a unique string representation of the edge made by combining the ID of the reverse node and the id of the forward nodes and where position 1 is a list of two node IDs: the reverse and forward node.Returns: tuple
of a uniqueAbstractEdge
ID and a list of that edge object’s reverse and forward node IDs, respectively. Returns an empty :py:obj:str at a position for which there are no forward or reverse nodes in the graph.Return type: tuple
-
parent_id
¶ property
defining the ID of the node forward of the currentgocats.dag.AbstractEdge
object.Returns: str
ID of the forward node in the node_pair associated with the edge if the edge’s relationship is assigned,None
otherwise.Return type: str
orNone
-
child_id
¶ property
defining the ID of the node reverse of the currentgocats.dag.AbstractEdge
object.Returns: str
ID of the reverse node in the node_pair associated with the edge if the edge’s relationship is assigned,None
otherwise.Return type: str
orNone
-
forward_node
¶ property
defining thegocats.dag.AbstractNode
object forward of the currentgocats.dag.AbstractEdge
object.Returns: gocats.dag.AbstractNode
object of the forward node in the node_pair associated with the edge if the edge’s relationship is assigned, the node_pair is assigned, and the type of relationship is instantiated bygocats.dag.DirectionalRelationship
None
otherwise.Return type: gocats.dag.AbstractNode
orNone
-
reverse_node
¶ property
defining thegocats.dag.AbstractNode
object reverse of the currentgocats.dag.AbstractEdge
object.Returns: gocats.dag.AbstractNode
object of the reverse node in the node_pair associated with the edge if the edge’s relationship is assigned, the node_pair is assigned, and the type of relationship is instantiated bygocats.dag.DirectionalRelationship
None
otherwise.Return type: gocats.dag.AbstractNode
orNone
-
parent_node
¶ property
defining thegocats.dag.AbstractNode
object forward of the currentgocats.dag.AbstractEdge
object. This designation will be unique to scoping-type relationships, although this is not yet specified.Returns: gocats.dag.AbstractNode
object of the forward node in the node_pair associated with the edge if the edge’s relationship is assigned, the node_pair is assigned, and the type of relationship is instantiated bygocats.dag.DirectionalRelationship
None
otherwise.Return type: gocats.dag.AbstractNode
orNone
-
child_node
¶ property
defining thegocats.dag.AbstractNode
object reverse of the currentgocats.dag.AbstractEdge
object. This designation will be unique to scoping-type relationships, although this is not yet specified.Returns: gocats.dag.AbstractNode
object of the reverse node in the node_pair associated with the edge if the edge’s relationship is assigned, the node_pair is assigned, and the type of relationship is instantiated bygocats.dag.DirectionalRelationship
None
otherwise.Return type: gocats.dag.AbstractNode
orNone
-
connect_nodes
(node_pair, allowed_relationships)[source]¶ Adds the current edge object to the
gocats.dag.AbstractNode
objects that are connected by the edge. Populates the node_pair withgocats.dag.AbstractNode
objects.Returns: None Return type: None
-
__weakref__
¶ list of weak references to the object (if defined)
-
-
class
gocats.dag.
AbstractRelationship
[source]¶ A relationship as defined by a [typedef] stanza in an OBO ontology and augmented by GOcats to better interpret semantic correspondence.
-
__weakref__
¶ list of weak references to the object (if defined)
-
-
class
gocats.dag.
DirectionalRelationship
[source]¶ A singly-directional relationship edge connecting two nodes in the graph. The two nodes are designated ‘forward’ and ‘reverse.’ The ‘forward’ node semantically succeeds the ‘reverse’ node in a way that depends on the context of the type of relationship describing the edge to which it is applied.
-
forward
(pair)[source]¶ Returns the forward node in a node pair that semantically succeeds the other and is independent of the directionality of the edge. Default position is the second position [1].
Parameters: pair (tuple) – A pair of gocats.dag.AbstractNode
objects.Returns: The forward gocats.dag.AbstractNode
object as determined by the pre-defined semantic directionality of the relationship.
-
reverse
(pair)[source]¶ Returns the reverse node in a node pair that semantically precedes the other and is independent of the directionality of the edge. Default position is the second position [1].
Parameters: pair (tuple) – A pair of gocats.dag.AbstractNode
objects.Returns: The reverse gocats.dag.AbstractNode
object as determined by the pre-defined semantic directionality of the relationship.
-
Gene Ontology Directed Acylic Graph (GODAG)¶
Defines a Gene Ontology-specific graph which may have special properties when compared to other OBO formatted ontologies.
-
class
gocats.godag.
GoGraph
(namespace_filter=None, allowed_relationships=None)[source]¶ A Gene-Ontology-specific graph. GO-specific idiosyncrasies go here.
-
__init__
(namespace_filter=None, allowed_relationships=None)[source]¶ GoGraph initializer. Inherits and specializes properties from
gocats.dag.OboGraph
.Parameters:
-
-
class
gocats.godag.
GoGraphNode
[source]¶ Extends AbstractNode to include GO relevant information.
-
__init__
()[source]¶ GoGraphNode initializer. Inherits all properties from
gocats.dag.AbstractNode
.
-
Directed Acyclic Subgraph (SubDAG)¶
A subgraph object of an OBOGraph object.
-
class
gocats.subdag.
SubGraph
(super_graph, namespace_filter=None, allowed_relationships=None)[source]¶ A subgraph of a provided supergraph with node contents.
-
__init__
(super_graph, namespace_filter=None, allowed_relationships=None)[source]¶ SubGraph initializer. Creates a subgraph object of :class:`gocats.dag.OboGraph. Leave namespace_filter and allowed_relationship as
None
to create the entire ontology graph. Otherwise, provide filters to limit what information is pulled into the subgraph.Parameters: - super_graph (obj) – A supergraph object i.e.
gocats.godag.GoGraph
. - namespace_filter (str) – Specify the namespace of a sub-ontology namespace, if one is available for the ontology.
- allowed_relationships (list) – Specify a list of relationships to utilize in the graph, other relationships will be ignored.
- super_graph (obj) – A supergraph object i.e.
-
root_id_mapping
¶ Property describing a mapping
dict
that relates every ontology term ID of subgraphs ingocats.dag.OboGraph
to alist
of rootgocats.subdag.CategoryNode
IDs.Returns: dict
ofgocats.subdag.SubGraphNode
IDs mapped to alist
of rootgocats.subdag.CategoryNode
IDs.Return type: dict
-
root_node_mapping
¶ Property describing a mapping
dict
that relates every ontologygocats.subdag.SubGraphNode
object of subgraphs ingocats.subdag.SubGraph
to alist
of rootgocats.subdag.CategoryNode
objects.Returns: dict
ofgocats.subdag.SubGraphNode
objects mapped to alist
of rootgocats.subdag.CategoryNode
objects.Return type: dict
-
content_mapping
¶ Property describing a mapping
dict
that relates every rootgocats.subdag.CategoryNode
IDs of subgraphs in agocats.subdag.SubGraph
to alist
of their subgraph nodes’ IDs.Returns: dict
ofgocats.dag.AbstractNode
IDs mapped to alist' of :class:`gocats.dag.AbstractNode
IDs.Return type: dict
-
subnode
(super_node)[source]¶ Defines a
gocats.subdag.SubGraph
node object. Callsadd_node()
to convert a supergraph node into agocats.subdag.SubGraphNode
and add this node to the subgraph.Parameters: super_node – A node object from the supergraph i.e. gocats.godag.GoGraphNode
.Returns: A gocats.subdag.SubGraphNode
object.Return type: class
-
add_node
(super_node)[source]¶ Converts a supergraph node into a
gocats.subdag.SubGraphNode
and adds this node to the subgraph. Sets modification state toTrue
.Parameters: super_node (obj) – A node object from the supergraph i.e. gocats.godag.GoGraphNode
.Returns: None Return type: None
-
connect_subnodes
()[source]¶ Analogous to
gocats.dag.instantiate_valid_edges()
andgocats.dag.AbstractEdge.connect_nodes()
. Updates child and parent node sets for eachgocats.subdag.SubGraphNode
in thegocats.subdag.SubGraph
. Adds edge object references to nodes and node object references to edges. Counts instances of relationship IDs and sets modification state toTrue
.Returns: None Return type: None
-
greedily_extend_subgraph
()[source]¶ Extends a seeded subgraph to include all supergraph descendants of the nodes. Searches through the supergraph to add new SubGraphNode objects.
Returns: None Return type: None
-
conservatively_extend_subgraph
()[source]¶ Not currently in use.* Needs to be updated to handle CategoryNode.
Extends a seeded subgraph to include only nodes in the supergraph that occur along paths between nodes in the subgraph. Searches through the supergraph to add new node objects.
Returns: None Return type: None
-
remove_orphan_paths
()[source]¶ Not currently in use. Needs to be updated ot handle CategoryNode.
Removes nodes and their descendants from the subgraph which do not root to the category-representative node.
Returns: None Return type: None
-
static
find_representative_nodes
(subgraph, search_string_list)[source]¶ Compiles a list candidate
gocats.subdag.SubGraphNode
objects from thegocats.subdag.SubGraph
object based on a list of search strings matching strings in the names of the nodes (using regular expressions). Returns a list containing a single candidate node with the highest number of descendants when possible, returns the sole node if the subgraph only contains one node, returns a list of all seeded nodes when choosing candidates is impossible, or aborts if the subgraph is empty.Parameters: - subgraph – A
gocats.subdag.SubGraph
object. - search_string_list – A
list
of search termstr
entries.
Returns: A list of one or more candidate term
gocats.subgraph.SubGraphNode
chosen as the subgraph’s representative ontology term(s).- subgraph – A
-
static
from_filtered_graph
(super_graph, subgraph_name, keyword_list, namespace_filter=None, allowed_relationships=None, extension='greedy')[source]¶ Staticmethod for extracting a subgraph from the supergraph by selecting nodes that contain vocabulary in the supplied keyword list. Leave namespace_filter and allowed_relationship as
None
to create the entire ontology graph. Otherwise, provide filters to limit what information is pulled into the subgraph. Graph extension variable defaults to ‘greedy’ which callsgreedily_extend_subgraph()
to add nodes to the subgraph after instantiation. Conversely, ‘conservative’ may be used to callconservatively_extend_subgraph()
for this function.Parameters: - super_graph (obj) – A supergraph object i.e.
gocats.godag.GoGraph
. - subgraph_name (str) – The name of the subgraph being created; will be used as the id of the
gocats.subdag.CategoryNode
. - keyword_list – A
list
ofstr
entries used to query the supergraph for concepts to be extracted into subgraphs. - namespace_filter (str) – Specify the namespace of a sub-ontology namespace, if one is available for the ontology.
- allowed_relationships (list) – Specify a list of relationships to utilize in the graph, other relationships will be ignored.
- extension (str) – Specify ‘greedy’ or ‘conservative’ to determine how subgraphs will be extended after creation (defaults to greedy).
Returns: A
gocats.subdag.SubGraph
object.- super_graph (obj) – A supergraph object i.e.
-
-
class
gocats.subdag.
SubGraphNode
(super_node=None, allowed_relationships=None)[source]¶ An instance of a node within a subgraph of an OBO ontology (supergraph)
-
__init__
(super_node=None, allowed_relationships=None)[source]¶ SubGraphNode initializer. Inherits from
gocats.dag.AbstractNode
and contains a reference to the supergraph node it represents e.g.gocats.godag.GoGraphNode
.Parameters: - super_node – A node from the supergraph.
- allowed_relationships – Not currently used Used to specify a list of allowable relationships evaluated between nodes.
-
super_edges
¶ property
describing the set of edges referenced in the supergraph node, filtered to only those- edges with nodes in the subgraph node.
Returns: A set of gocats.subgraph.SubGraphNode
edges that were copied from the supergraph node.Return type: set
-
id
¶ property
describing the ID of the supernodeReturns: The ID of a supernode e.g. gocats.godag.GoGraphNode
Return type: str
-
name
¶ property
describing the name of the supernodeReturns: The name of a supernode e.g. gocats.godag.GoGraphNode
Return type: str
-
definition
¶ property
describing the definition of the supernodeReturns: The definition of a supernode e.g. gocats.godag.GoGraphNode
Return type: str
-
namespace
¶ property
describing the namespace of the supernodeReturns: A namespace of a supernode e.g. gocats.godag.GoGraphNode
Return type: str
-
Ontology Parser¶
A parser which reads ontologies in the OBO format and calls appropriate graph objects to store information in a graph representation. Separate parsing classes within this module operate on distinct ontologies in the OBO Foundry to handle any subtle differences among ontologies.
-
class
gocats.ontologyparser.
OboParser
[source]¶ A scaffolding for parsing OBO formatted ontologies. Contains regular expressions for the basic stanzas and information pertinent for creating a graph object of an ontology.
-
__init__
()[source]¶ OboParser initializer. Contains Regular Expressions for identifying crucial information from OBO formatted ontologies.
-
__weakref__
¶ list of weak references to the object (if defined)
-
-
class
gocats.ontologyparser.
GoParser
(database_file, go_graph, relationship_directionality='gocats')[source]¶ An ontology parser specific to Gene Ontology
-
__init__
(database_file, go_graph, relationship_directionality='gocats')[source]¶ GoParser initializer. Parses a Gene Ontology database file and adds properties found therein to a
godag.GoGraph
object. Importantly: includes descriptions of semantic directionality of all GO relationships. :param file_handle database_file: Specify the location of a Gene Ontology .obo file. :param go_graph:gocats.godag.GoGraph
object. :return: None :rtype:None
-
parse
()[source]¶ Parses the ontology database file and accesses the ontology graph object to add information found in the database. Once all information is added, this function calls the graph’s instantiate_valid_edges function to connect all nodes in the graph by their edges.
Returns: None Return type: None
-
Tools¶
Functions for handling some file input and output and reformatting tasks in GOcats.
-
gocats.tools.
json_save
(obj, filename)[source]¶ Takes a Python object, converts it into a JSON serializable object (if it is not already), and saves it to a file that is specified.
Parameters: - obj – A Python
obj
. - filename (file_handle) – A path to output the resulting JSON file.
- obj – A Python
-
gocats.tools.
jsonpickle_save
(obj, filename)[source]¶ Takes a Python object, converts it into a JsonPickle string, and writes it out to a file.
Parameters: - obj – A Python
obj
- filename (file_handle) – A path to output the resulting JsonPickle file.
- obj – A Python
-
gocats.tools.
jsonpickle_load
(filename)[source]¶ Takes a JsonPickle file and loads in the JsonPickle object into a Python object.
Parameters: filename (file_handle) – A path to a JsonPickle file.
-
gocats.tools.
list_to_file
(filename, data)[source]¶ Makes a text document from a
list
of data, with each line of the document being one item from the list and outputs the document into a file.Parameters: - filename (file_handle) – A path to the output file.
- data – A Python
list
.
User Guide¶
Description¶
GOcats
is an Open Biomedical Ontology (OBO) parser and categorizing utility–currently specialized for the Gene
Ontology (GO)–which can help scientists interpret large-scale experimental results by organizing redundant and highly-
specific annotations into customizable, biologically-relevant concept categories. Concept subgraphs are defined by lists
of keywords created by the user.
- Currently, the GOcats package can be used to:
- Create subgraphs of GO which each represent a user-specified concept.
- Map specific, or fine-grained, GO terms in a Gene Annotation File (GAF) to an arbitrary number of concept categories.
- Reorganize GO terms based on allowed term-term relationships, and re-create the gene to all direct and ancestor GO terms.
- Explore the Gene Ontology graph within a Python interpreter.
Installation¶
GOcats runs under Python 3.4+ and is available through python3-pip. Install via pip or clone the git repo and install the following dependencies and you are ready to go!
Install on Linux¶
Pip installation (method 1)¶
Dependencies should automatically be installed using this method. It is strongly recommended that you install with this method.
pip3 install gocats
GitHub Package installation (method 2)¶
Make sure you have git installed:
cd ~/
git clone https://github.com/MoseleyBioinformaticsLab/GOcats.git
Dependencies¶
GOcats requires the following Python libraries:
- docopt for creating the
gocats
command-line interface.- JSONPickle for saving Python objects in a JSON serializable form and outputting to a file.
To install dependencies manually:
pip3 install docopt
pip3 install jsonpickle
Install on Windows¶
Windows version not yet available. Sorry about that.
Basic usage¶
To see command line arguments and options, navigate to the project directory and run the –help option:
cd ~/GOcats
python3 -m gocats --help
gocats
can be used in the following ways:
To extract subgraphs of Gene Ontology that represent user-defined concepts and create mappings between high level concepts and their subgraph content terms.
1. Create a CSV file, where column 1 is the name of the concept category (this can be anything) and column 2 is a list of keywords/phrases delineating that concept (separated by semicolons). See The GOcats Tutorial for more information.
- Download a Gene Ontology database obo file
3. To create mappings, run the GOcats command,
gocats.gocats.create_subgraphs()
. If you installed by cloning the repository from GitHub, first navigate to the GOcats project directory or add the directory to the PYTHONPATH.python3 -m gocats create_subdags <ontology_database_file> <keyword_file> <output_directory>
- Mappings can be found in your specified <output_directory>:
- GC_content_mapping.json_pickle # A python dictionary with category-defining GO terms as keys and a list of all subgraph contents as values.
- GC_id_mapping.json_pickle # A python dictionary with every GO term of the specified namespace as keys and a list of category root terms as values.
To map gene annotations in a Gene Annotation File (GAF) to a set of user-defined categories.
- Create mapping files as defined in the previous section.
- Run the
gocats.gocats.categorize_dataset()
to map terms to their categories:# NOTE: Use the GC_id_mapping.jsonpickle file. python3 -m gocats categorize_dataset <GAF_file> <term_mapping_file> <output_directory> <mapped_gaf_filename>
- The output GAF will have the specified <mapped_gaf_filename> in the <output_directory>
- To reorganize parent - child Gene Ontology terms relationships and the gene annotations with a set of user defined relationships.
This has been shown to increase statistical power in GO enrichment calculations (see Hinderer).
- Download a Gene Ontology database obo file.
- Download a Gene Ontology gene annotation format gaf file.
- Run the
gocats.gocats.remap_goterms()
to generate new gene to annotation relationships:python3 -m gocats remap_goterms <go_database> <goa_gaf> <ancestor_filename> <namespace_filename> [--allowed_relationships=<relationships> --identifier_column=<column>]
--allowed_relationships
should be a comma separated string:is_a,part_of,has_part
- The output <ancestor_filename> will be in JSON format, with genes as the keys, and annotated GO terms as the set.
Within the Python interpreter to explore the Gene Ontology graph (advanced usage, see The GOcats Tutorial for more information).
1. If you’ve installed GOcats via pip, importing should work as expected. Otherwise, navigate to the Git project directory, open a Python 3.4+ interpreter, and import GOcats:
>>> from gocats import gocats as gc
- Create the graph object using
gocats.gocats.build_graph_interpreter()
:>>> # May filter to GO sub-ontology or to a set of relationships. >>> my_graph = gc.build_graph_interpreter("path_to_database_file") You may now access all properties of the Gene Ontology graph object. Here are a couple of examples:>>> # See the descendants of a term node, GO:0006306 >>> descendant_set = my_graph.id_index['GO:0006306'].descendants >>> [node.name for node in descendant_set] >>> # Access all graph leaf nodes >>> leaf_nodes = my_graph.leaves >>> [node.name for node in leaf_nodes]
The GOcats Tutorial¶
- Currently, GOcats can be used to:
- Create subgraphs of the Gene Ontology (GO) which each represent a user-specified concept.
- Map specific, or fine-grained, GO terms in a Gene Annotation File (GAF) to an arbitrary number of concept categories.
- Remap ancestor Gene Ontology term relationships and the gene annotations with a set of user defined relationships.
- Explore the Gene Ontology graph within a Python interpreter.
In this document, each use case will be explained in-depth.
Using GOcats to create subgraphs representing user-specified concepts¶
Before starting, it is important to decide what concepts you as the user wish to extract from the Gene Ontology. You may have an investigation that is focused on concepts like “DNA repair” or “autophagy,” or you may simply be interested in enumerating many arbitrary categories and seeing how ontology terms are shared between concepts. As an example to use in this tutorial, let’s consider a goal of extracting subgraphs that represent some typical subcellular locations of a eukaryotic cell.
Create a keyword file¶
The phrase “keyword file” might be slightly misleading because GOcats does not only handle keywords, but also short phrases that may be used to define a concept. Therefore, both may be used in combination in the keyword CSV file.
The CSV file is formatted as so:
- Each row represents a separate concept.
- Column 1 is the name of the concept (this is for reference and will not be used to parse GO).
- Column 2 is a list of keywords or short phrases used to describe the concept in question.
- Each item in column 2 is separated by a semicolon (;) with no whitespace around the semicolon.
- Here is an example of what the file contents should look like (do not include the header row in the actual file):
Concept Keywords/phrases mitochondria mitochondria;mitochondrial;mitochondrion nucleus nucleus;nuclei;nuclear lysosome lysosome;lysosomal;lysosomes vesicle vesicle;vesicles er endoplasmic;sarcoplasmic;reticulum golgi golgi; golgi apparatus extracellular extracellular;secreted cytosol cytosol;cytosolic cytoplasm cytoplasm;cytoplasmic cell membrane plasma;plasma membrane cytoskeleton cytoskeleton;cytoskeletal
We’ll imagine this file is located in the home directory and is called cell_locations.csv.
Download the Gene Ontology .obo file¶
The go.obo
file is available here: http://www.geneontology.org/page/download-ontology.
Be sure to download the .obo-formatted version.
All releases of GO in this format as of Jan 2015 have been verified to be compatible with GOcats.
We’ll assume this database file is located in the home directory and is called go.obo.
Extract subgraphs and create concept mappings¶
This is where GOcats does the heavy lifting.
We’ll assume GOcats was already installed via pip or the repository was already cloned into the home directory (refer to User Guide for instructions on how to install GOcats).
We can now use Python to run the gocats.gocats.create_subgraphs()
function.
We can also specify that we only want to parse the cellular_component sub-ontology of GO (the supergraph_namespace
), since we are only interested in concepts of this type.
Although it is redundant, we can also play it safe and limit subgraph creation to only consider terms listed in cellular_component as well (the subgraph_namespace
).
Run the following if you hav installed via pip (if running from the Git repository navigate to the GOcats directory or add this directory to your PYTHONPATH beforehand).
python3 -m gocats create_subgraphs ~/go.obo ~/cell_locations.csv ~/cell_locations_output --supergraph_namespace=cellular_component --subgraph_namespace=cellular_component
The results will be output to ~/cell_locations_output
.
Let’s look at the output files¶
In the output directory (i.e. ~/cell_locations_output
) we can see several files. The following table describes what can be found in each:
File Name Description GC_content_mapping.json JSON version of Python dictionary (keys: concept root nodes, values: list of subgraph term nodes). GC_content_mapping.json_pickle Same as above, but a JSONPickle version of the dictionary. GC_id_mapping.json JSON version of Python dictionary (keys: subgraph term nodes, values: list of concept roots). GC_id_mapping.json_pickle Same as above, but a JSONPickle version of the dictionary. id_translation.json_pickle A JSONPickle version of a Python dictionary mapping GO IDs to the name of the term. NetworkTable.csv A csv version of id_translation for visualizing in Cytoscape (best results with –map_supersets) subgraph_report.txt A summary of the subgraphs extracted for mapping. See below for more details.
We can look in subgraph_report.txt to get an overview of what our subgraphs contain, how they were constructed, and how they compare to the overall GO graph.
subgraph_report.txt
The first few lines give an overview of the subgraphs and supergraph (which is the full GO graph, unless a supergraph_namespace filter was used). In our example case, the supergraph is the cellular_component ontology of GO.
In each divided section, the first line indicates the subgraph name (the one provided from column 1 in the keyword file) . The following describes the meaning of the values in each section:
- Subgraph relationships: the prevalence of relationship types in the subgraph.
- Seeded size: how many GO terms were initially filtered from GO with the keyword list.
- Representative node: the name of the GO term chosen as the root for that concept’s subgraph.
- Nodes added: the number of GO terms added when extending the seeded subgraph to descendants not captured by the initial search.
- Non-subgraph hits (orphans): GO terms that were captured by the keyword search, but do not belong to the subgraph.
- Total nodes: the total number of GO terms in the subgraph.
Loading mapping files programmatically (optional)¶
While GOcats can use the mapping files described in the previous section to map terms in a GAF, it may also be useful to load them into your own scripts for use. Since the mappings are saved in JSON and JSONPickle formats, it is relatively simple to load them in programmatically:
>>># Loading a JSON file
>>>import json
>>>with open('path_to_json_file', 'r') as json_file:
>>> json_str = json_file.read()
>>> json_obj = json.loads(json_str)
>>>my_mapping = json_obj
>>># Loading a JSONPickle file
>>>import jsonpickle
>>>with open('path_to_jsonpickle_file', 'r') as jsonpickle_file:
>>> jsonpickle_str = jsonpickle_file.read()
>>> jsonpickle_obj = jsonpickle.decode(jsonpickle_str, keys=True)
>>>my_mapping = jsonpickle_obj
Using GOcats to map specific gene annotations in a GAF to custom categories¶
With mapping files produced from the previous steps, it is possible to create a GAF with annotations mapped to the categories, or concepts, that we define.
Let’s consider our current cell_locations example and imagine that we have some gene set containing annotations in a GAF called dataset_GAF.goa
in the home directory.
To map these annotations, use the gocats.gocats.categorize_dataset()
function.
Again, this should work from any location if you’ve installed via pip, otherwise navigate to the GOcats directory or add this directory to your PYTHONPATH and run the following:
# Note that you need to use the GC_id_mapping.json_pickle file for this step
python3 -m gocats categorize_dataset ~/datasetGAF.goa ~/cell_locations_output/GC_id_mapping.json_pickle ~/mapped_dataset mapped_GAF.goa
Here, we named the output directory ~/mapped_dataset
and we named the mapped GAF mapped_GAF.goa
.
The mapped gaf and a list of unmapped genes will be stored in the output directory.
Using GOcats to remap ancestor Gene Ontology term relationships and the gene annotations with a set of user defined relationships¶
As noted in the last two examples, GOcats can consider has_part relationships properly, in addition to the is_a and part_of relationships normally used for generating gene annotations to ancestor GO terms. We have previously shown that doing this can improve the statistical power of GO term enrichment (see Hinderer). In this case, we need a Gene Ontology obo file, as well as a gene annotation format gaf file.
python3 -m gocats remap_goterms ~/go.obo ~/goa_human.gaf ~/ancestors.json ~/namespace.json --allowed_relationships=is_a,part_of,has_part --identifier_column=1
The output in ancestors.json
will be a JSON list, where each gene is the name of a JSON vector of annotated GO terms. namespace.json
provides the new namespace for each GO term.
In contrast to the API in Python, the --allowed_relationships
takes a comma separated list of relationships to use.
In the GAF files, there will often be two identifiers, the database identifier (Uniprot) for human, and gene symbol.
--identifier_column
allows the user to select to use the database (1) or gene symbol (2) as the identifier in the output.
Exploring Gene Ontology graph in a Python interpreter or in your own Python project¶
If you’ve installed GOcats via pip, importing should work as expected. Otherwise, navigate to the Git project directory, open a Python 3.4+ interpreter, and import GOcats:
>>> import gocats
Next, create the graph object using gocats.gocats.build_graph_interpreter()
. Since we have been looking at the
cellular_component sub-ontology in this example, we can specify that we only want to look at that part of the graph with
the supergraph_namespace option. Additionally we can filter the relationship types using the allowed_relationships
option (only is_a, has_part, and part_of exist in cellular_component, so this is just for demonstration):
>>> # May filter to GO sub-ontology or to a set of relationships.
>>> my_graph = gocats.gocats.build_graph_interpreter("~/go.obo", supergraph_namespace="cellular_component", allowed_relationships=["is_a", "has_part", "part_of"])
>>> full_graph = gocats.gocats.build_graph_interpreter("~/go.obo")
The filtered graph (my_graph
) and the full GO graph (full_graph
) can now be explored.
The graph object contains an id_index
which allows one to access node objects by GO IDs like so:
>>>my_node = my_graph.id_index['GO:0004567']
It also contains a node_list
and an edge_list
.
Edges and nodes in the graph are objects themselves.
>>>print(my_node.name)
Here is a list of some important graph, node, and edge data members and properties:
- Graph
- node_list: list of node objects in the graph.
- edge_list: list of edge objects in the graph.
- id_index: dictionary of node IDs that point to their respective node objects.
- vocab_index: dictionary listing every word used in the gene ontology, pointing to node objects those words can be found in.
- relationship_index: dictionary of relationships in the supergraph, pointing to their respective relationship objects.
- root_nodes: a set of root nodes of the supergraph.
- orphans: a set of nodes which have no parents.
- leaves: a set of nodes which have no children.
- Node
- id
- name
- definition
- namespace
- edges: a set of edges that connect the node.
- parent_node_set
- child_node_set
- descendants: a set of recursive graph children.
- ancestors: a set of recursive graph parents.
- Edge
- node_pair_id: tuple of IDs of the nodes connected by the edge.
- node_pair: a tuple of the node objects connected by the edge.
- relationship_id: the ID of the relationship type (i.e. the name of the relationship).
- relationship: the relationship object used to describe the edge
- parent_id
- parent_node
- child_id
- child_node
- forward_node: see The GOcats API Reference
- reverse_node: see The GOcats API Reference
Plotting subgraphs in Cytoscape for visualization¶
Coming soon!