Welcome to BioTK’s documentation!¶
Overview¶
BioTK is a Python toolkit, containing a library and scripts, for various bioinformatics tasks:
- Differential expression analysis on microarray and RNA-seq data
- Functional and downstream analysis of DE gene lists
- Ontology handling
- Text mining, ranging from shallow NLP to relation extraction
- Efficient storage and querying of sets of genomic intervals (similar to BEDTools, GenomicRanges, etc.)
Tutorial¶
Expression analysis¶
BioTK aims to provide an alternative to the standard R/Bioconductor environment to perform run-of-the-mill differential expression analyses. Thus, BioTK has the ability to perform all the standard steps in a differential expression analysis pipeline:
- Loading raw or preprocessed data
- Preprocessing and normalizing the data
- Finding differentially expressed probes/genes between conditions
- Analyses of DE gene lists: - Performing enrichment analyses against ontologies - Visualizing expression or DE results as heatmaps or networks
There are also features for downstream analyses of and methods to take large collections of expression data, from GEO, in-house data, or a combination thereof, and use these collections for large-scale meta-analysis.
Todo
- put a simple example of a complete-ish analysis here
- possibly explain important data structures?
Differential expression¶
Currently, the available differential expression algorithms are:
- t-test
- ANOVA
- SAM
In the future, we plan to provide either a port or a simplified Python interface to the R package limma, which is one of the most popular tools for finding DE genes.
T-test¶
ANOVA¶
SAM¶
Enrichment analysis¶
Meta-analysis¶
BioTK can store large amounts of expression data from multiple experiments and even multiple organisms and efficiently perform meta-analyses on this data. Please see Transcript expression meta-analysis.
Transcript expression meta-analysis¶
BioTK provides a way to store up to millions of expression vectors, along with associated probe mappings and phenotype/sample data, in HDF5 format on your hard drive for efficient querying and meta-analysis. HDF5 has many benefits, including query efficiency and the ability to store all your expression data in a single file that can be easily backed up or transferred to different computers.
If you like, you can easily store and query all available GEO samples for your model organism(s) of choice.
Importing data¶
Performing a meta-analysis¶
Finding coexpressed genes¶
Inferring categories for genes or samples¶
Development information¶
General development information¶
Contributors¶
License¶
BioTK is licensed under the GNU Affero General Public License version 3 (or any later version thereof, at your option).
How to contribute¶
If you’d like to contribute to BioTK (yay!), please follow the guidelines below. In summary:
- Write the code
- Write documentation
- Write tests (that pass)
- Submit a pull request on GitHub
Although we use PEP8 for variable names and Numpy docstring format (with a few exceptions), don’t obsess over following every detail of these format guides, especially if you are new to them. Just do what you can, and we can fix minor errors later.
Writing code¶
Naming conventions¶
Code mostly follows standard Python naming conventions from PEP8 (lower_case_with_underscore for variables and functions, CamelCase for classes, private variables or methods with leading underscore, etc.).
There is one exception: modules and packages. Because of the huge preponderance of acronyms in biomedicine, modules are CamelCase with acronyms all uppercase if they describe a file format, external program, or well-known algorithm. Otherwise, they are lower-case.
Examples:
- BioTK.io.BEDGraph
- BioTK.io.Aspera
- BioTK.text.parse
- BioTK.text.AhoCorasick
Documentation¶
If you want to contribute some code, the most important kind of documentation to provide is docstrings. Sphinx documentation and in-code comments are nice to have, but not crucial.
Sphinx documentation¶
High level information and tutorials are written in the doc/ directory in reStructuredText format, and built into HTML and other formats using Sphinx. These docs are automatically mirrored to http://BioTK.readthedocs.org/ .
A useful reStructuredText primer can be found at http://docutils.sourceforge.net/docs/user/rst/quickref.html .
Docstrings¶
Modules, and public classes and functions inside them, need docstrings. Keep it high level, explaining what the module/function and the parameters are doing, not how they are doing it. Provide citations to the algorithm’s paper if appropriate. Generally speaking, the more “public” a function/class is, the more documentation it needs. If it should be rarely or never directly called by a user, it may only need one line. Conversely, a large class with many methods may need quite extensive documentation.
The docstrings are written in Numpy format, with one exception: in BioTK, there is always a leading newline on the first line. Thus, instead of:
def add(a, b):
"""The sum of two numbers.
(rest of docstring ...)
"""
we use:
def add(a, b):
"""
The sum of two numbers.
(rest of docstring ...)
"""
The numpy docstring format is described here:
And many good examples are here:
Internal code comments¶
Functions and class methods should be short enough, and the code should be clear enough, that code comments should mostly be unnecessary. Use good judgment: if a particularly tricky method is being used, it may need some explanation, but in general keep comments high level.
You can mark “wishlist” items with a TODO comment, and items that are actually broken or need urgent attention with “FIXME” (obviously the latter should be done sparingly).
Unit tests¶
They are written using the py.test framework, and are placed in the test/ directory, with a directory structure that mirrors the structure of BioTK.
If possible, avoid tests that take a long time to run or require network access.