MolVS: Molecule Validation and Standardization¶
MolVS is a molecule validation and standardization tool, written in Python using the RDKit chemistry framework.
Building a collection of chemical structures from different sources can be difficult due to differing representations, drawing conventions and mistakes. MolVS can standardize chemical structures to improve data quality, help with de-duplication and identify relationships between molecules.
There are sensible defaults that make it easy to get started:
>>> from molvs import standardize_smiles
>>> standardize_smiles('[Na]OC(=O)c1ccc(C[S+2]([O-])([O-]))cc1')
'[Na+].O=C([O-])c1ccc(CS(=O)=O)cc1'
Each standardization module is also available separately, allowing the development of custom standardization processes.
Features¶
- Normalization of functional groups to a consistent format.
- Recombination of separated charges.
- Breaking of bonds to metal atoms.
- Competitive reionization to ensure strongest acids ionize first in partially ionize molecules.
- Tautomer enumeration and canonicalization.
- Neutralization of charges.
- Standardization or removal of stereochemistry information.
- Filtering of salt and solvent fragments.
- Generation of fragment, isotope, charge, tautomer or stereochemistry insensitive parent structures.
- Validations to identify molecules with unusual and potentially troublesome characteristics.
User guide¶
A step-by-step guide to getting started with MolVS.
Introduction¶
Building a collection of chemical structures from various different sources is difficult. There are differing file formats, molecular representations, drawing conventions, and things that are just plain wrong.
A lot of this arises due to our chemical models being an imperfect description of reality, but even within the idealized models there is often no single correct answer to whether two differently represented molecules are actually “the same”. Whether tautomers or isomers of the same molecule should be considered equivalent or distinct entities can depend entirely on the specific application.
MolVS tries to address this problem through customizable validation and standardization processes, combined with the concept of “parent” molecule relationships to allow multiple simultaneous degrees of standardization.
This guide provides a quick tour through MolVS concepts and functionality.
MolVS license¶
MolVS is released under the MIT License. This is a short, permissive software license that allows commercial use, modifications, distribution, sublicensing and private use. Basically, you can do whatever you want with MolVS as long as you include the original copyright and license in any copies or derivative projects.
See the LICENSE file for the full text of the license.
Installation¶
MolVS supports Python versions 2.7 and 3.5+.
There are a variety of ways to download and install MolVS.
Option 1: Use conda (recommended)¶
The easiest and recommended way to install is using conda. Anaconda Python is a self-contained Python environment that is particularly useful for scientific applications. If you don’t already have it, start by installing Miniconda, which includes a complete Python distribution and the conda package manager. Choose the Python 3 version, unless you have a particular reason why you must use Python 2.
To install MolVS, at the command line, run:
conda config --add channels conda-forge
conda install molvs
This will add the conda-forge channel to your conda config, then install MolVS and all its dependencies into your conda environment.
Option 2: Use pip¶
An alternative method is to install using pip:
pip install molvs
This will download the latest version of MolVS, and place it in your site-packages folder so it is automatically available to all your python scripts.
Note
MolVS requires RDKit, which cannot be installed using pip. On the Mac, you can use Homebrew:
brew tap mcs07/cheminformatics
brew install rdkit
The official RDKit documentation has installation instructions for a variety of platforms.
Option 3: Download the latest release¶
Alternatively, download the latest release manually and install yourself:
tar -xzvf MolVS-0.1.1.tar.gz
cd MolVS-0.1.1
python setup.py install
The setup.py command will install MolVS in your site-packages folder so it is automatically available to all your python scripts.
Option 4: Clone the repository¶
The latest development version of MolVS is always available on GitHub. This version is not guaranteed to be stable, but may include new features that have not yet been released. Simply clone the repository and install as usual:
git clone https://github.com/mcs07/MolVS.git
cd MolVS
python setup.py install
Getting started¶
This page gives a introduction on how to get started with MolVS. This assumes you already have MolVS installed.
TODO…
Validation¶
The MolVS Validator
provides a way to identify and log unusual and potentially troublesome
characteristics of a molecule.
The validation process makes no actual changes to a molecule – that is left to the standardization process, which fixes many of the issues identified through validation. There is no real requirement to validate a molecule before or after standardizing it - the process simply provides additional information about potential problems.
Validating a molecule¶
The validate_smiles()
function is a convenient way to quickly validate a single SMILES
string:
>>> from molvs import validate_smiles
>>> validate_smiles('O=C([O-])c1ccccc1')
['INFO: [NeutralValidation] Not an overall neutral system (-1)']
It returns a list of log messages as strings.
The Validator
class provides more flexibility when working with multiple molecules or when a
custom Validation
list is required:
>>> fmt = '%(asctime)s - %(levelname)s - %(validation)s - %(message)s'
>>> validator = Validator(log_format=fmt)
>>> mol = Chem.MolFromSmiles('[2H]C(Cl)(Cl)Cl')
>>> validator.validate(mol)
['2014-08-05 16:04:23,682 - INFO - IsotopeValidation - Molecule contains isotope 2H']
Available validations¶
The API documentation contains a full list of the individual validations that are available.
Standardization¶
This page gives details on the standardization process.
Standardizing a molecule¶
The standardize_smiles
function provides a quick and easy way to get the standardized version of a given SMILES
string:
>>> from molvs import standardize_smiles
>>> standardize_smiles('C[n+]1c([N-](C))cccc1')
'CN=c1ccccn1C'
While this is convenient for one-off cases, it’s inefficient when dealing with multiple molecules and doesn’t allow any customization of the standardization process.
The Standardizer
class provides flexibility to specify custom standardization stages and efficiently standardize
multiple molecules:
>>> from rdkit import Chem
>>> mol = Chem.MolFromSmiles('[Na]OC(=O)c1ccc(C[S+2]([O-])([O-]))cc1')
>>> from molvs import Standardizer
>>> s = Standardizer()
>>> smol = s.standardize(mol)
The standardization process¶
TODO: Explain this properly…
RDKit Sanitize¶
- Nitro N=O:
CN(=O)=O >> C[N+](=O)[O-]
andC1=CC=CN(=O)=C1 >> C1=CC=C[N+]([O-])=C1
- Nitro N#O:
C-N=N#N >> C-N=[N+]=[N-]
- Perchlorate:
Cl(=O)(=O)(=O)[O-] >> [Cl+3]([O-])([O-])([O-])[O-]
- Calculate explicit and implicit valence of all atoms. Fails when atoms have illegal valence.
- Calculate symmetrized SSSR. Slowest step, fails in rare cases.
- Kekulize. Fails if a Kekule form cannot be found or non-ring bonds are marked as aromatic.
- Assign radicals if hydrogens set and bonds+hydrogens+charge < valence.
- Set aromaticity, if none set in input. Go round rings, Huckel rule to set atoms+bonds as aromatic.
- Set conjugated property on bonds where applicable.
- Set hybridisation property on atoms.
- Remove chirality markers from sp and sp2 hybridised centers.
RDKit RemoveHs¶
- RDKit implementation detail - this is the preferred way to store the molecule.
- Remove explicit H count from atoms, instead infer it on the fly from valence model.
Disconnect metals¶
- Break covalent bonds between metals and organic atoms under certain conditions.
- First, disconnect N, O, F from any metal. Then disconnect other non-metals from transition metals (with exceptions).
- For every bond broken, adjust the charges of the begin and end atoms accordingly.
- In future, we might attempt to replace with zero-order bonds.
Apply normalization rules¶
- A series of transformations to correct common drawing errors and standardize functional groups. Includes:
- Uncharge-separate sulfones
- Charge-separate nitro groups
- Charge-separate pyridine oxide
- Charge-separate azide
- Charge-separate diazo and azo groups
- Charge-separate sulfoxides
- Hydrazine-diazonium system
Reionize acids¶
If molecule with multiple acid groups is partially ionized, ensure strongest acids ionize first.
The algorithm works as follows:
- Use SMARTS to find the strongest protonated acid and the weakest ionized acid.
- If the ionized acid is weaker than the protonated acid, swap proton and repeat.
Recalculate stereochemistry¶
- Use built-in RDKit functionality to force a clean recalculation of stereochemistry
Tautomers¶
This page gives details on tautomer enumeration and canonicalization.
Background¶
Tautomers are sets of molecules that readily interconvert with each other through the movement of a hydrogen atom. Tautomers have the same molecular formula and net charge, but they differ in terms of the positions of hydrogens and the associated changes in adjacent double and single bonds.
Because they rapidly interconvert, for many applications tautomers are considered to be the same chemical compound. And even in situations where it is important to treat tautomers as distinct compounds, it is still useful to be aware of the tautomerism relationships between molecules in a collection.
Varying tautomeric forms of the same molecule can have significantly different fingerprints and descriptors, which can negatively impact models for things like property prediction if they are used inconsistently.
There are two main tautomerism tasks that MolVS carries out:
- Tautomer enumeration: Finding the set of all the different possible tautomeric forms of a molecule.
- Tautomer canonicalization: Consistently picking one of the tautomers to be the canonical tautomer for the set.
Tautomer enumeration¶
- All possible tautomers are generated using a series of transform rules.
- Remove stereochemistry from double bonds that are single in at least 1 tautomer.
Tautomer canonicalization¶
- Enumerate all possible tautomers using transform rules.
- Use scoring system to determine canonical tautomer.
- Canonical tautomer should be “reasonable” from a chemist’s point of view, but isn’t guaranteed to be the most energetically favourable.
Fragments¶
This page gives details on dealing with fragments.
The term fragment refers to covalently bonded units. A molecule can contain multiple fragments.
Getting the largest fragment¶
- LargestFragmentChooser
Filtering out fragments¶
- FragmentRemover
Charges¶
This page gives details on dealing with charges in molecules.
Acid reionization¶
- Ensure the strongest acid groups ionize first in partially ionized molecules.
Neutralization¶
- Attempt to neutralize charges by adding and/or removing hydrogens where possible.
- Not always possible to produce a neutral molecule.
Command Line Tool¶
MolVS comes with a simple command line tool that allows standardization and validation by typing molvs
at the
command line.
Standardization¶
See standardization help by typing molvs standardize -h
:
usage: molvs standardize [infile] [-i {smi,mol,sdf}] [-O <outfile>]
[-o {smi,mol,sdf}] [-: <smiles>]
positional arguments:
infile input filename
optional arguments:
-i {smi,mol,sdf}, --intype {smi,mol,sdf}
input filetype
-: <smiles>, --smiles <smiles>
input SMILES instead of file
-O <outfile>, --outfile <outfile>
output filename
-o {smi,mol,sdf}, --outtype {smi,mol,sdf}
output filetype
Validation¶
See validation help by typing molvs validate -h
:
usage: molvs validate [infile] [-i {smi,mol,sdf}] [-O <outfile>]
[-: <smiles>]
positional arguments:
infile input filename
optional arguments:
-i {smi,mol,sdf}, --intype {smi,mol,sdf}
input filetype
-: <smiles>, --smiles <smiles>
input SMILES instead of file
-O <outfile>, --outfile <outfile>
output filename
Examples¶
SMILES standardization:
$ molvs standardize -:"C[n+]1c([N-](C))cccc1"
CN=c1ccccn1C
Specifying an output format:
$ molvs standardize -:"[N](=O)(=O)O" -o mol
RDKit
4 3 0 0 0 0 0 0 0 0999 V2000
0.0000 0.0000 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0
1 3 2 0
1 4 1 0
M CHG 2 1 1 2 -1
M END
Using stdin:
$ echo "C[n+]1c([N-](C))cccc1" | molvs standardize
CN=c1ccccn1C
Specifying an input file:
$ molvs standardize example.mol
CN=c1ccccn1C
Specifying an output file:
$ molvs standardize example.mol -O output.smi
$ molvs standardize example.mol -O output.mol
$ molvs standardize example.mol -O output -o mol
Logging validations to stdout:
$ molvs validate -:"O=C([O-])c1ccccc1"
INFO: [NeutralValidation] Not an overall neutral system (-1)
Logging validations to a file:
$ molvs validate -:"O=C([O-])c1ccccc1" -O logs.txt
Contributing¶
Contributions of any kind are greatly appreciated!
Feedback¶
The Issue Tracker is the best place to post any feature ideas, requests and bug reports.
The following are especially welcome:
- General feedback on whether any standardization stages should work differently.
- Specific molecules that don’t validate or standardize as expected.
- Ideas for new validation and standardization stages.
Contributing¶
If you are able to contribute changes yourself, just fork the source code on GitHub, make changes and file a pull request. All contributions are welcome, no matter how big or small.
The following are especially welcome:
- New validation or standardization stages.
- Alternative tautomer transforms and scores.
- Lists of salts and solvents to strip out.
- New or improved documentation of existing features.
Quick guide to contributing¶
Fork the MolVS repository on GitHub, then clone your fork to your local machine:
git clone https://github.com/<username>/MolVS.git cd molvs
Install the development requirements into a conda environment:
conda env create -n molvs -f environment.yml source activate molvs
Create a new branch for your changes:
git checkout -b <name-for-changes>
Make your changes or additions. Ideally add some tests and ensure they pass by running:
pytest
Commit your changes and push to your fork on GitHub:
git add . git commit -m "<description-of-changes>" git push origin <name-for-changes>
Tips¶
- Follow the PEP8 style guide.
- Include docstrings as described in PEP257.
- Try and include tests that cover your changes.
- Try to write good commit messages.
- Consider squashing your commits with rebase.
- Read the GitHub help page on Using pull requests.
API documentation¶
Comprehensive API documentation with information on every function, class and method. This is automatically generated from the MolVS source code and comments.
API documentation¶
This part of the documentation is automatically generated from the MolVS source code and comments.
The MolVS package is made up of the following modules:
molvs.standardize¶
This module contains the main Standardizer
class that can be used to perform all
standardization tasks, as well as convenience functions like standardize_smiles()
for common
standardization tasks.
-
class
molvs.standardize.
Standardizer
(normalizations=NORMALIZATIONS, acid_base_pairs=ACID_BASE_PAIRS, tautomer_transforms=TAUTOMER_TRANSFORMS, tautomer_scores=TAUTOMER_SCORES, max_restarts=MAX_RESTARTS, max_tautomers=MAX_TAUTOMERS, prefer_organic=PREFER_ORGANIC)[source]¶ The main class for performing standardization of molecules and deriving parent molecules.
The primary usage is via the
standardize()
method:s = Standardizer() mol1 = Chem.MolFromSmiles('C1=CC=CC=C1') mol2 = s.standardize(mol1)
There are separate methods to derive fragment, charge, tautomer, isotope and stereo parent molecules.
Initialize a Standardizer with optional custom parameters.
Parameters: - normalizations – A list of Normalizations to apply (default:
NORMALIZATIONS
). - acid_base_pairs – A list of AcidBasePairs for competitive reionization (default:
ACID_BASE_PAIRS
). - charge_corrections – A list of ChargeCorrections to apply (default:
CHARGE_CORRECTIONS
). - tautomer_transforms – A list of TautomerTransforms to apply (default:
TAUTOMER_TRANSFORMS
). - tautomer_scores – A list of TautomerScores used to determine canonical tautomer (default:
TAUTOMER_SCORES
). - max_restarts – The maximum number of times to attempt to apply the series of normalizations (default 200).
- max_tautomers – The maximum number of tautomers to enumerate (default 1000).
- prefer_organic – Whether to prioritize organic fragments when choosing fragment parent (default False).
-
__call__
(mol)[source]¶ Calling a Standardizer instance like a function is the same as calling its
standardize()
method.
-
standardize
(mol)[source]¶ Return a standardized version the given molecule.
The standardization process consists of the following stages: RDKit
RemoveHs()
, RDKitSanitizeMol()
,MetalDisconnector
,Normalizer
,Reionizer
, RDKitAssignStereochemistry()
.Parameters: mol (rdkit.Chem.rdchem.Mol) – The molecule to standardize. Returns: The standardized molecule. Return type: rdkit.Chem.rdchem.Mol
-
tautomer_parent
(mol, skip_standardize=False)[source]¶ Return the tautomer parent of a given molecule.
Parameters: - mol (rdkit.Chem.rdchem.Mol) – The input molecule.
- skip_standardize (bool) – Set to True if mol has already been standardized.
Returns: The tautomer parent molecule.
Return type:
-
fragment_parent
(mol, skip_standardize=False)[source]¶ Return the fragment parent of a given molecule.
The fragment parent is the largest organic covalent unit in the molecule.
Parameters: - mol (rdkit.Chem.rdchem.Mol) – The input molecule.
- skip_standardize (bool) – Set to True if mol has already been standardized.
Returns: The fragment parent molecule.
Return type:
-
stereo_parent
(mol, skip_standardize=False)[source]¶ Return the stereo parent of a given molecule.
The stereo parent has all stereochemistry information removed from tetrahedral centers and double bonds.
Parameters: - mol (rdkit.Chem.rdchem.Mol) – The input molecule.
- skip_standardize (bool) – Set to True if mol has already been standardized.
Returns: The stereo parent molecule.
Return type:
-
isotope_parent
(mol, skip_standardize=False)[source]¶ Return the isotope parent of a given molecule.
The isotope parent has all atoms replaced with the most abundant isotope for that element.
Parameters: - mol (rdkit.Chem.rdchem.Mol) – The input molecule.
- skip_standardize (bool) – Set to True if mol has already been standardized.
Returns: The isotope parent molecule.
Return type:
-
charge_parent
(mol, skip_standardize=False)[source]¶ Return the charge parent of a given molecule.
The charge parent is the uncharged version of the fragment parent.
Parameters: - mol (rdkit.Chem.rdchem.Mol) – The input molecule.
- skip_standardize (bool) – Set to True if mol has already been standardized.
Returns: The charge parent molecule.
Return type:
-
super_parent
(mol, skip_standardize=False)[source]¶ Return the super parent of a given molecule.
THe super parent is fragment, charge, isotope, stereochemistry and tautomer insensitive. From the input molecule, the largest fragment is taken. This is uncharged and then isotope and stereochemistry information is discarded. Finally, the canonical tautomer is determined and returned.
Parameters: - mol (rdkit.Chem.rdchem.Mol) – The input molecule.
- skip_standardize (bool) – Set to True if mol has already been standardized.
Returns: The super parent molecule.
Return type:
-
disconnect_metals
¶ Returns: A callable MetalDisconnector
instance.
-
normalize
¶ Returns: A callable Normalizer
instance.
-
remove_fragments
¶ Returns: A callable FragmentRemover
instance.
-
largest_fragment
¶ Returns: A callable LargestFragmentChooser
instance.
-
enumerate_tautomers
¶ Returns: A callable TautomerEnumerator
instance.
-
canonicalize_tautomer
¶ Returns: A callable TautomerCanonicalizer
instance.
- normalizations – A list of Normalizations to apply (default:
-
molvs.standardize.
standardize_smiles
(smiles)[source]¶ Return a standardized canonical SMILES string given a SMILES string.
Note: This is a convenience function for quickly standardizing a single SMILES string. It is more efficient to use the
Standardizer
class directly when working with many molecules or when custom options are needed.Parameters: smiles (string) – The SMILES for the molecule. Returns: The SMILES for the standardized molecule. Return type: string.
-
molvs.standardize.
enumerate_tautomers_smiles
(smiles)[source]¶ Return a set of tautomers as SMILES strings, given a SMILES string.
Parameters: smiles – A SMILES string. Returns: A set containing SMILES strings for every possible tautomer. Return type: set of strings.
-
molvs.standardize.
canonicalize_tautomer_smiles
(smiles)[source]¶ Return a standardized canonical tautomer SMILES string given a SMILES string.
Note: This is a convenience function for quickly standardizing and finding the canonical tautomer for a single SMILES string. It is more efficient to use the
Standardizer
class directly when working with many molecules or when custom options are needed.Parameters: smiles (string) – The SMILES for the molecule. Returns: The SMILES for the standardize canonical tautomer. Return type: string.
molvs.normalize¶
This module contains tools for normalizing molecules using reaction SMARTS patterns.
-
molvs.normalize.
NORMALIZATIONS
¶ The default list of Normalization transforms.
-
molvs.normalize.
MAX_RESTARTS
= 200¶ The default value for the maximum number of times to attempt to apply the series of normalizations.
-
class
molvs.normalize.
Normalization
(name, transform)[source]¶ A normalization transform defined by reaction SMARTS.
Parameters: - name (string) – A name for this Normalization
- transform (string) – Reaction SMARTS to define the transformation.
-
class
molvs.normalize.
Normalizer
(normalizations=NORMALIZATIONS, max_restarts=MAX_RESTARTS)[source]¶ A class for applying Normalization transforms.
This class is typically used to apply a series of Normalization transforms to correct functional groups and recombine charges. Each transform is repeatedly applied until no further changes occur.
Initialize a Normalizer with an optional custom list of
Normalization
transforms.Parameters: - normalizations – A list of
Normalization
transforms to apply. - max_restarts (int) – The maximum number of times to attempt to apply the series of normalizations (default 200).
-
__call__
(mol)[source]¶ Calling a Normalizer instance like a function is the same as calling its normalize(mol) method.
-
normalize
(mol)[source]¶ Apply a series of Normalization transforms to correct functional groups and recombine charges.
A series of transforms are applied to the molecule. For each Normalization, the transform is applied repeatedly until no further changes occur. If any changes occurred, we go back and start from the first Normalization again, in case the changes mean an earlier transform is now applicable. The molecule is returned once the entire series of Normalizations cause no further changes or if max_restarts (default 200) is reached.
Parameters: mol (rdkit.Chem.rdchem.Mol) – The molecule to normalize. Returns: The normalized fragment. Return type: rdkit.Chem.rdchem.Mol
- normalizations – A list of
molvs.metal¶
This module contains tools for disconnecting metal atoms that are defined as covalently bonded to non-metals.
-
class
molvs.metal.
MetalDisconnector
[source]¶ Class for breaking covalent bonds between metals and organic atoms under certain conditions.
-
__call__
(mol)[source]¶ Calling a MetalDisconnector instance like a function is the same as calling its disconnect(mol) method.
-
disconnect
(mol)[source]¶ Break covalent bonds between metals and organic atoms under certain conditions.
The algorithm works as follows:
- Disconnect N, O, F from any metal.
- Disconnect other non-metals from transition metals + Al (but not Hg, Ga, Ge, In, Sn, As, Tl, Pb, Bi, Po).
- For every bond broken, adjust the charges of the begin and end atoms accordingly.
Parameters: mol (rdkit.Chem.rdchem.Mol) – The input molecule. Returns: The molecule with metals disconnected. Return type: rdkit.Chem.rdchem.Mol
-
molvs.tautomer¶
This module contains tools for enumerating tautomers and determining a canonical tautomer.
-
molvs.tautomer.
TAUTOMER_TRANSFORMS
¶ The default list of TautomerTransforms.
-
molvs.tautomer.
TAUTOMER_SCORES
¶ The default list of TautomerScores.
-
molvs.tautomer.
MAX_TAUTOMERS
= 1000¶ The default value for the maximum number of tautomers to enumerate, a limit to prevent combinatorial explosion.
-
class
molvs.tautomer.
TautomerTransform
(name, smarts, bonds=(), charges=(), radicals=())[source]¶ Rules to transform one tautomer to another.
Each TautomerTransform is defined by a SMARTS pattern where the transform involves moving a hydrogen from the first atom in the pattern to the last atom in the pattern. By default, alternating single and double bonds along the pattern are swapped accordingly to account for the hydrogen movement. If necessary, the transform can instead define custom resulting bond orders and also resulting atom charges.
Initialize a TautomerTransform with a name, SMARTS pattern and optional bonds and charges.
The SMARTS pattern match is applied to a Kekule form of the molecule, so use explicit single and double bonds rather than aromatic.
Specify custom bonds as a string of
-
,=
,#
,:
for single, double, triple and aromatic bonds respectively. Specify custom charges as+
,0
,-
for +1, 0 and -1 charges respectively.Parameters: - name (string) – A name for this TautomerTransform.
- smarts (string) – SMARTS pattern to match for the transform.
- bonds (string) – Optional specification for the resulting bonds.
- charges (string) – Optional specification for the resulting charges on the atoms.
-
class
molvs.tautomer.
TautomerScore
(name, smarts, score)[source]¶ A substructure defined by SMARTS and its score contribution to determine the canonical tautomer.
Initialize a TautomerScore with a name, SMARTS pattern and score.
Parameters: - name – A name for this TautomerScore.
- smarts – SMARTS pattern to match a substructure.
- score – The score to assign for this substructure.
-
class
molvs.tautomer.
TautomerCanonicalizer
(transforms=TAUTOMER_TRANSFORMS, scores=TAUTOMER_SCORES, max_tautomers=MAX_TAUTOMERS)[source]¶ Parameters: - transforms – A list of TautomerTransforms to use to enumerate tautomers.
- scores – A list of TautomerScores to use to choose the canonical tautomer.
- max_tautomers – The maximum number of tautomers to enumerate, a limit to prevent combinatorial explosion.
-
__call__
(mol)[source]¶ Calling a TautomerCanonicalizer instance like a function is the same as calling its canonicalize(mol) method.
-
canonicalize
(mol)[source]¶ Return a canonical tautomer by enumerating and scoring all possible tautomers.
Parameters: mol (rdkit.Chem.rdchem.Mol) – The input molecule. Returns: The canonical tautomer. Return type: rdkit.Chem.rdchem.Mol
-
class
molvs.tautomer.
TautomerEnumerator
(transforms=TAUTOMER_TRANSFORMS, max_tautomers=MAX_TAUTOMERS)[source]¶ Parameters: - transforms – A list of TautomerTransforms to use to enumerate tautomers.
- max_tautomers – The maximum number of tautomers to enumerate (limit to prevent combinatorial explosion).
-
__call__
(mol)[source]¶ Calling a TautomerEnumerator instance like a function is the same as calling its enumerate(mol) method.
-
enumerate
(mol)[source]¶ Enumerate all possible tautomers and return them as a list.
Parameters: mol (rdkit.Chem.rdchem.Mol) – The input molecule. Returns: A list of all possible tautomers of the molecule. Return type: list of rdkit.Chem.rdchem.Mol
molvs.fragment¶
This module contains tools for dealing with molecules with more than one covalently bonded unit. The main classes are
LargestFragmentChooser
, which returns the largest covalent unit in a molecule, and
FragmentRemover
, which filters out fragments from a molecule using SMARTS patterns.
-
molvs.fragment.
REMOVE_FRAGMENTS
¶ The default list of
FragmentPatterns
to be used byFragmentRemover
.
-
molvs.fragment.
LEAVE_LAST
= True¶ The default value for whether to ensure at least one fragment is left after FragmentRemover is applied.
-
molvs.fragment.
PREFER_ORGANIC
= False¶ The default value for whether LargestFragmentChooser sees organic fragments as “larger” than inorganic fragments.
-
class
molvs.fragment.
FragmentPattern
(name, smarts)[source]¶ A fragment defined by a SMARTS pattern.
Initialize a FragmentPattern with a name and a SMARTS pattern.
Parameters: - name – A name for this FragmentPattern.
- smarts – A SMARTS pattern.
-
molvs.fragment.
is_organic
(fragment)[source]¶ Return true if fragment contains at least one carbon atom.
Parameters: fragment – The fragment as an RDKit Mol object.
-
class
molvs.fragment.
FragmentRemover
(fragments=REMOVE_FRAGMENTS, leave_last=LEAVE_LAST)[source]¶ A class for filtering out fragments using SMARTS patterns.
Initialize a FragmentRemover with an optional custom list of
FragmentPattern
.Setting leave_last to True will ensure at least one fragment is left in the molecule, even if it is matched by a
FragmentPattern
. Fragments are removed in the order specified in the list, so place those you would prefer to be left towards the end of the list. If all the remaining fragments match the sameFragmentPattern
, they will all be left.Parameters: - fragments – A list of
FragmentPattern
to remove. - leave_last (bool) – Whether to ensure at least one fragment is left.
-
__call__
(mol)[source]¶ Calling a FragmentRemover instance like a function is the same as calling its remove(mol) method.
-
remove
(mol)[source]¶ Return the molecule with specified fragments removed.
Parameters: mol (rdkit.Chem.rdchem.Mol) – The molecule to remove fragments from. Returns: The molecule with fragments removed. Return type: rdkit.Chem.rdchem.Mol
- fragments – A list of
-
class
molvs.fragment.
LargestFragmentChooser
(prefer_organic=PREFER_ORGANIC)[source]¶ A class for selecting the largest covalent unit in a molecule with multiple fragments.
If prefer_organic is set to True, any organic fragment will be considered larger than any inorganic fragment. A fragment is considered organic if it contains a carbon atom.
Parameters: prefer_organic (bool) – Whether to prioritize organic fragments above all others. -
__call__
(mol)[source]¶ Calling a LargestFragmentChooser instance like a function is the same as calling its choose(mol) method.
-
choose
(mol)[source]¶ Return the largest covalent unit.
The largest fragment is determined by number of atoms (including hydrogens). Ties are broken by taking the fragment with the higher molecular weight, and then by taking the first alphabetically by SMILES if needed.
Parameters: mol (rdkit.Chem.rdchem.Mol) – The molecule to choose the largest fragment from. Returns: The largest fragment. Return type: rdkit.Chem.rdchem.Mol
-
molvs.charge¶
This module implements tools for manipulating charges on molecules. In particular, Reionizer
,
which competitively reionizes acids such that the strongest acids ionize first, and Uncharger
,
which attempts to neutralize ionized acids and bases on a molecule.
-
molvs.charge.
ACID_BASE_PAIRS
¶ The default list of AcidBasePairs, sorted from strongest to weakest. This list is derived from the Food and Drug Administration Substance Registration System Standard Operating Procedure guide.
-
class
molvs.charge.
AcidBasePair
(name, acid, base)[source]¶ An acid and its conjugate base, defined by SMARTS.
A strength-ordered list of AcidBasePairs can be used to ensure the strongest acids in a molecule ionize first.
Initialize an AcidBasePair with the following parameters:
Parameters: - name (string) – A name for this AcidBasePair.
- acid (string) – SMARTS pattern for the protonated acid.
- base (string) – SMARTS pattern for the conjugate ionized base.
-
class
molvs.charge.
Reionizer
(acid_base_pairs=ACID_BASE_PAIRS)[source]¶ A class to fix charges and reionize a molecule such that the strongest acids ionize first.
Initialize a Reionizer with the following parameter:
Parameters: - acid_base_pairs – A list of
AcidBasePairs
to reionize, sorted from strongest to weakest. - charge_corrections – A list of
ChargeCorrections
.
-
__call__
(mol)[source]¶ Calling a Reionizer instance like a function is the same as calling its reionize(mol) method.
-
reionize
(mol)[source]¶ Enforce charges on certain atoms, then perform competitive reionization.
First, charge corrections are applied to ensure, for example, that free metals are correctly ionized. Then, if a molecule with multiple acid groups is partially ionized, ensure the strongest acids ionize first.
The algorithm works as follows:
- Use SMARTS to find the strongest protonated acid and the weakest ionized acid.
- If the ionized acid is weaker than the protonated acid, swap proton and repeat.
Parameters: mol (rdkit.Chem.rdchem.Mol) – The molecule to reionize. Returns: The reionized molecule. Return type: rdkit.Chem.rdchem.Mol
- acid_base_pairs – A list of
-
class
molvs.charge.
Uncharger
(acid_base_pairs=(AcidBasePair('-OSO3H', 'OS(=O)(=O)[OH]', 'OS(=O)(=O)[O-]'), AcidBasePair('–SO3H', '[!O]S(=O)(=O)[OH]', '[!O]S(=O)(=O)[O-]'), AcidBasePair('-OSO2H', 'O[SD3](=O)[OH]', 'O[SD3](=O)[O-]'), AcidBasePair('-SO2H', '[!O][SD3](=O)[OH]', '[!O][SD3](=O)[O-]'), AcidBasePair('-OPO3H2', 'OP(=O)([OH])[OH]', 'OP(=O)([OH])[O-]'), AcidBasePair('-PO3H2', '[!O]P(=O)([OH])[OH]', '[!O]P(=O)([OH])[O-]'), AcidBasePair('-CO2H', 'C(=O)[OH]', 'C(=O)[O-]'), AcidBasePair('thiophenol', 'c[SH]', 'c[S-]'), AcidBasePair('(-OPO3H)-', 'OP(=O)([O-])[OH]', 'OP(=O)([O-])[O-]'), AcidBasePair('(-PO3H)-', '[!O]P(=O)([O-])[OH]', '[!O]P(=O)([O-])[O-]'), AcidBasePair('phthalimide', 'O=C2c1ccccc1C(=O)[NH]2', 'O=C2c1ccccc1C(=O)[N-]2'), AcidBasePair('CO3H (peracetyl)', 'C(=O)O[OH]', 'C(=O)O[O-]'), AcidBasePair('alpha-carbon-hydrogen-nitro group', 'O=N(O)[CH]', 'O=N(O)[C-]'), AcidBasePair('-SO2NH2', 'S(=O)(=O)[NH2]', 'S(=O)(=O)[NH-]'), AcidBasePair('-OBO2H2', 'OB([OH])[OH]', 'OB([OH])[O-]'), AcidBasePair('-BO2H2', '[!O]B([OH])[OH]', '[!O]B([OH])[O-]'), AcidBasePair('phenol', 'c[OH]', 'c[O-]'), AcidBasePair('SH (aliphatic)', 'C[SH]', 'C[S-]'), AcidBasePair('(-OBO2H)-', 'OB([O-])[OH]', 'OB([O-])[O-]'), AcidBasePair('(-BO2H)-', '[!O]B([O-])[OH]', '[!O]B([O-])[O-]'), AcidBasePair('cyclopentadiene', 'C1=CC=C[CH2]1', 'c1ccc[cH-]1'), AcidBasePair('-CONH2', 'C(=O)[NH2]', 'C(=O)[NH-]'), AcidBasePair('imidazole', 'c1cnc[nH]1', 'c1cnc[n-]1'), AcidBasePair('-OH (aliphatic alcohol)', '[CX4][OH]', '[CX4][O-]'), AcidBasePair('alpha-carbon-hydrogen-keto group', 'O=C([!O])[C!H0+0]', 'O=C([!O])[C-]'), AcidBasePair('alpha-carbon-hydrogen-acetyl ester group', 'OC(=O)[C!H0+0]', 'OC(=O)[C-]'), AcidBasePair('sp carbon hydrogen', 'C#[CH]', 'C#[C-]'), AcidBasePair('alpha-carbon-hydrogen-sulfone group', 'CS(=O)(=O)[C!H0+0]', 'CS(=O)(=O)[C-]'), AcidBasePair('alpha-carbon-hydrogen-sulfoxide group', 'C[SD3](=O)[C!H0+0]', 'C[SD3](=O)[C-]'), AcidBasePair('-NH2', '[CX4][NH2]', '[CX4][NH-]'), AcidBasePair('benzyl hydrogen', 'c[CX4H2]', 'c[CX3H-]'), AcidBasePair('sp2-carbon hydrogen', '[CX3]=[CX3!H0+0]', '[CX3]=[CX2-]'), AcidBasePair('sp3-carbon hydrogen', '[CX4!H0+0]', '[CX3-]')))[source]¶ Class for neutralizing charges in a molecule.
This class uncharges molecules by adding and/or removing hydrogens. In cases where there is a positive charge that is not neutralizable, any corresponding negative charge is also preserved.
-
__call__
(mol)[source]¶ Calling an Uncharger instance like a function is the same as calling its uncharge(mol) method.
-
uncharge
(mol)[source]¶ Neutralize molecule by adding/removing hydrogens.
Parameters: mol (rdkit.Chem.rdchem.Mol) – The molecule to uncharge. Returns: The uncharged molecule. Return type: rdkit.Chem.rdchem.Mol
-
molvs.validate¶
This module contains the main Validator
class that can be used to perform all
Validations
, as well as the validate_smiles()
convenience function.
-
molvs.validate.
SIMPLE_FORMAT
= '%(levelname)s: [%(validation)s] %(message)s'¶ The default format for log messages.
-
molvs.validate.
LONG_FORMAT
= '%(asctime)s - %(levelname)s - %(validation)s - %(message)s'¶ A more detailed format for log messages. Specify when initializing a Validator.
-
class
molvs.validate.
Validator
(validations=VALIDATIONS, log_format=SIMPLE_FORMAT, level=logging.INFO, stdout=False, raw=False)[source]¶ The main class for running
Validations
on molecules.Initialize a Validator with the following parameters:
Parameters: - validations – A list of Validations to apply (default:
VALIDATIONS
). - log_format (string) – A string format (default:
SIMPLE_FORMAT
). - level – The minimum logging level to output.
- stdout (bool) – Whether to send log messages to standard output.
- raw (bool) – Whether to return raw
LogRecord
objects instead of formatted log strings.
- validations – A list of Validations to apply (default:
-
molvs.validate.
validate_smiles
(smiles)[source]¶ Return log messages for a given SMILES string using the default validations.
Note: This is a convenience function for quickly validating a single SMILES string. It is more efficient to use the
Validator
class directly when working with many molecules or when custom options are needed.Parameters: smiles (string) – The SMILES for the molecule. Returns: A list of log messages. Return type: list of strings.
molvs.validations¶
This module contains all the built-in Validations
.
-
molvs.validations.
VALIDATIONS
¶ The default list of
Validations
used byValidator
.
-
class
molvs.validations.
Validation
(log)[source]¶ The base class that all
Validation
subclasses must inherit from.
-
class
molvs.validations.
SmartsValidation
(log)[source]¶ Abstract superclass for
Validations
that log a message if a SMARTS pattern matches the molecule.Subclasses can override the following attributes:
-
level
= 20¶ The logging level of the message.
-
message
= 'Molecule matched %(smarts)s'¶ The message to log if the SMARTS pattern matches the molecule.
-
entire_fragment
= False¶ Whether the SMARTS pattern should match an entire covalent unit.
-
smarts
¶ The SMARTS pattern as a string. Subclasses must implement this.
-
-
class
molvs.validations.
IsNoneValidation
(log)[source]¶ Logs an error if
None
is passed to the Validator.This can happen if RDKit failed to parse an input format. If the molecule is
None
, no subsequent validations will run.
-
class
molvs.validations.
NoAtomValidation
(log)[source]¶ Logs an error if the molecule has zero atoms.
If the molecule has no atoms, no subsequent validations will run.
-
class
molvs.validations.
DichloroethaneValidation
(log)[source]¶ Logs if 1,2-dichloroethane is present.
This is provided as an example of how to subclass
SmartsValidation
to check for the presence of a substructure.
-
class
molvs.validations.
FragmentValidation
(log)[source]¶ Logs if certain fragments are present.
Subclass and override the
fragments
class attribute to customize the list ofFragmentPatterns
.
molvs.errors¶
This module contains exceptions that are raised by MolVS.