2. Tutorial¶
This section gives a comprehensive overview of both the
morphing algorithm itself as well as the
software features currently available
By writing and commenting on an example Python script (located in the Examples
package
and available for download here
)
we present the building blocks that anyone can easily put together to implement their own Molpher.
However, not all of the features and settings will be presented and we encourage you to take a look at the Source Code Documentation if you want to know more about the implementation itself or more advanced topics.
Note
We do not provide usage examples or a description of the C++ API in this tutorial,
but the Python interface actually follows the C++ implementation very closely
and the classes from the core
package are in fact derived from the proxy
classes generated automatically by SWIG when the C++ API is
wrapped for Python.
Table of Contents
2.1. The Morphing Algorithm¶
Let us first shortly describe how molecular morphing works. This should help anyone who has never heard about the method to understand it and get an idea on what this library can do for them.
In order to find a path that connects the source molecule and the target molecule (see Introduction if you are not familiar with these terms), Molpher maintains a data structure called an exploration tree to save and evaluate possible paths in chemical space. It is basically a directed rooted tree (with the edges directed away from the root) where the source molecule acts as the root vertex.
During molecular morphing, the tree is grown by modifying the current leaves with predefined chemical operators. These operators act on an input structure by producing a new molecule as output, which is structurally very close to its parent compound. We can also see chemical operators as the labels of the edges of the tree that connect two compounds between which a transformation occurred, the parent and the child.
By connecting the newly created molecules to the tree as the children of their parents, we can iteratively generate multiple chemical space paths at once and evaluate them according to a certain objective function. At the moment the objective function is simply the structural distance between the newest molecule on a path and the target molecule, but the user of the library is free to define his own criteria. For example, he/she can easily implement a method that will only extend paths that lead to compounds satisfying a given pharmacophore or physicochemical properties.
By continuously morphing the compounds in this way, we can effectively ‘travel’ through chemical space towards various areas of interest. Thanks to the flexible API of the library this ‘journey’ can be realized in many ways and can accommodate almost any exploration strategy one might think of. For example, in this tutorial we will, among other things, show how to implement our own building blocks to use with the library or make two trees cooperate with each other as they search for a single path.
2.2. Creating an Exploration Tree and Setting Morphing Parameters¶
Using the library the tree is represented by
an instance of the molpher.core.ExplorationTree
class and in this tutorial
we will use it to search for a chemical space path
between cocaine (a famous recreational drug) and procaine (compound that replaced cocaine
as local anesthetic). Both of these compounds act as blockers of the sodium channels in the neuronal cell membrane
which results in their anesthetic effects.
Let us now explore the chemical space ‘between’ cocaine and procaine by combining their structural features using the Molpher-lib library. We will initialize the exploration tree first:
1 2 3 4 5 6 | from molpher.core import ExplorationTree as ETree
cocaine = 'CN1[C@H]2CC[C@@H]1[C@@H](C(=O)OC)[C@@H](OC(=O)c1ccccc1)C2'
procaine = 'O=C(OCCN(CC)CC)c1ccc(N)cc1'
tree = ETree.create(source=cocaine, target=procaine) # initialize a tree that searches for a path from cocaine to procaine
|
The code shown in Listing 2.1 simply initializes the tree from the supplied SMILES. At the moment the tree is pretty simple. It only contains the root molecule (cocaine in this particular instance). We can manipulate this instance and read data from it in multiple ways, but let’s start by printing out the source molecule and target molecule of the tree:
1 2 3 | # print the smiles of the source and target
print('Source: ', tree.params['source'])
print('Target: ', tree.params['target'])
|
If we run the code from both Listing 2.1 and Listing 2.2 in a script or command line, it produces the following output:
loading SAScore.dat ... done
Source: COC(=O)C1C2CCC(CC1OC(=O)C1=CC=CC=C1)N2C
Target: CCN(CC)CCOC(=O)C1=CC=C(N)C=C1
Notice that the SMILES strings we supplied to the create()
method of the ExplorationTree
class are not the same as those
that we supplied. That is because Molpher-lib automatically generates canonical SMILES for every
molecule that is created.
Attention
The library also discards any information about the stereocenters in the molecules, because the current implementation does not account for stereochemistry and treats all enantiomers as the same molecule. You might want to keep this in mind when working with the library.
Note
Besides the information about our source and target, we can also see that a data file was loaded successfully.
That means the molpher
package was initialized successfully and is ready for use. The data file itself is
used to compute the synthetic feasibility scores for the generated morphs
(can be read from any molecule by reading its sascore
attribute).
The ExplorationTree.params
dictionary does not just store the
source and target, but also houses other morphing parameters
. Let’s take a look:
print(tree.params)
Output:
{
'max_morphs_total': 1500,
'far_close_threshold': 0.15,
'weight_max': 100000.0,
'close_produce': 150,
'fingerprint': 'MORGAN',
'accept_min': 50,
'source': 'COC(=O)C1C2CCC(CC1OC(=O)C1=CC=CC=C1)N2C',
'target': 'CCN(CC)CCOC(=O)C1=CC=C(N)C=C1',
'weight_min': 0.0,
'non_producing_survive': 5,
'accept_max': 100,
'operators': (
'ADD_ATOM',
'ADD_BOND',
'BOND_CONTRACTION',
'BOND_REROUTE',
'INTERLAY_ATOM',
'MUTATE_ATOM',
'REMOVE_ATOM',
'REMOVE_BOND'
),
'far_produce': 80,
'similarity': 'TANIMOTO'
}
As we can see there is quite a lot of parameters that we can set.
We will explain the most important ones in this tutorial, but you can see the
documentation for the ExplorationData
class
(especially Table 1.1) for a detailed reference.
We can adjust the morphing parameters during runtime as we like. All we need to do is just supply the params
attribute
of our tree instance with a new dictionary:
# change selected parameters using a dictionary
tree.params = {
'non_producing_survive' : 2
'weight_max' : 500.0
}
print(tree.params)
Output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | {
'max_morphs_total': 1500,
'far_close_threshold': 0.15,
'weight_max': 500.0,
'close_produce': 150,
'fingerprint': 'MORGAN',
'accept_min': 50,
'source': 'COC(=O)C1C2CCC(CC1OC(=O)C1=CC=CC=C1)N2C',
'target': 'CCN(CC)CCOC(=O)C1=CC=C(N)C=C1',
'weight_min': 0.0,
'non_producing_survive': 2,
'accept_max': 100,
'operators': (
'ADD_ATOM',
'ADD_BOND',
'BOND_CONTRACTION',
'BOND_REROUTE',
'INTERLAY_ATOM',
'MUTATE_ATOM',
'REMOVE_ATOM',
'REMOVE_BOND'
),
'far_produce': 80,
'similarity': 'TANIMOTO'
}
|
Here we just tightened the constraints on molecular weight and decreased the number of acceptable ‘non-producing’ morph generations to 2 (see Table 1.1 to get more information on what this parameter does). If we supply an incomplete set of parameters (like in the above example), only the parameters specified in the given dictionary will be changed.
Warning
Changing individual values in the params
dictionary will have no effect.
You always need to store a dictionary instance in it. This is because the value
is regenerated every time the attribute is accessed to always reflect the current
set of parameters valid for the current instance.
See also
2.3. Generating Morphs and Extending the Exploration Tree¶
This part of the tutorial outlines all the steps involved in generating new morphs from the current leaves of the tree. It also describes how the generated morphs can be filtered using the value of the default objective function (structural distance from the target molecule), the tree extended and the unfavorable paths (or their parts) removed from the tree.
2.3.1. Generating and Manipulating Morphs¶
Now that we know how to initialize an exploration tree and how to set morphing parameters, we will take a look at how the chemical space exploration works in practice.
Let us generate a few morphs
from the current leaves of the tree first:
1 2 3 4 5 | print(tree.leaves) # show the current leaves of the tree (only the source so far)
print(tree.leaves[0].getSMILES())
tree.generateMorphs() # generate new morphs
print(tree.candidates)
print(len(tree.candidates))
|
Output:
(<molpher.core.MolpherMol.MolpherMol at 0x7f37a8a6d930>,)
COC(=O)C1C2CCC(CC1OC(=O)C1=CC=CC=C1)N2C
Generated 60 morphs.
(<molpher.core.MolpherMol.MolpherMol at 0x7f37a8a6d930>, <molpher.core.MolpherMol.MolpherMol at 0x7f37a8a6d8d0>, <molpher.core.MolpherMol.MolpherMol at 0x7f37a8a6d960>, <molpher.core.MolpherMol.MolpherMol at 0x7f37a8a6d990>, <molpher.core.MolpherMol.MolpherMol at 0x7f37a8a6d9c0>, <molpher.core.MolpherMol.MolpherMol at 0x7f37a8a6d9f0>, <molpher.core.MolpherMol.MolpherMol at 0x7f37a8a6da20>, <molpher.core.MolpherMol.MolpherMol at 0x7f37a8a6da50>, <molpher.core.MolpherMol.MolpherMol at 0x7f37a8a6da80>, <molpher.core.MolpherMol.MolpherMol at 0x7f37a8a6dab0>, <molpher.core.MolpherMol.MolpherMol at 0x7f37a8a6dae0>, <molpher.core.MolpherMol.MolpherMol at 0x7f37a8a6db10>, <molpher.core.MolpherMol.MolpherMol at 0x7f37a8a6db40>, <molpher.core.MolpherMol.MolpherMol at 0x7f37a8a6db70>, <molpher.core.MolpherMol.MolpherMol at 0x7f37a8a6dba0>, <molpher.core.MolpherMol.MolpherMol at 0x7f37a8a6dbd0>, <molpher.core.MolpherMol.MolpherMol at 0x7f37a8a6dc00>, <molpher.core.MolpherMol.MolpherMol at 0x7f37a8a6dc30>, <molpher.core.MolpherMol.MolpherMol at 0x7f37a8a6dc60>, <molpher.core.MolpherMol.MolpherMol at 0x7f37a8a6dc90>, <molpher.core.MolpherMol.MolpherMol at 0x7f37a8a6dcc0>, <molpher.core.MolpherMol.MolpherMol at 0x7f37a8a6dcf0>, <molpher.core.MolpherMol.MolpherMol at 0x7f37a8a6dd20>, <molpher.core.MolpherMol.MolpherMol at 0x7f37a8a6dd50>, <molpher.core.MolpherMol.MolpherMol at 0x7f37a8a6dd80>, <molpher.core.MolpherMol.MolpherMol at 0x7f37a8a6ddb0>, <molpher.core.MolpherMol.MolpherMol at 0x7f37a8a6dde0>, <molpher.core.MolpherMol.MolpherMol at 0x7f37a8a6de10>, <molpher.core.MolpherMol.MolpherMol at 0x7f37a8a6de40>, <molpher.core.MolpherMol.MolpherMol at 0x7f37a8a6de70>, <molpher.core.MolpherMol.MolpherMol at 0x7f37a8a6dea0>, <molpher.core.MolpherMol.MolpherMol at 0x7f37a8a6ded0>, <molpher.core.MolpherMol.MolpherMol at 0x7f37a8a6df00>, <molpher.core.MolpherMol.MolpherMol at 0x7f37a8a6df30>, <molpher.core.MolpherMol.MolpherMol at 0x7f37a8a6df60>, <molpher.core.MolpherMol.MolpherMol at 0x7f37a8a6df90>, <molpher.core.MolpherMol.MolpherMol at 0x7f37a8a6dfc0>, <molpher.core.MolpherMol.MolpherMol at 0x7f37a8af2030>, <molpher.core.MolpherMol.MolpherMol at 0x7f37a8af2060>, <molpher.core.MolpherMol.MolpherMol at 0x7f37a8af2090>, <molpher.core.MolpherMol.MolpherMol at 0x7f37a8af20c0>, <molpher.core.MolpherMol.MolpherMol at 0x7f37a8af20f0>, <molpher.core.MolpherMol.MolpherMol at 0x7f37a8af2120>, <molpher.core.MolpherMol.MolpherMol at 0x7f37a8af2150>, <molpher.core.MolpherMol.MolpherMol at 0x7f37a8af2180>, <molpher.core.MolpherMol.MolpherMol at 0x7f37a8af21b0>, <molpher.core.MolpherMol.MolpherMol at 0x7f37a8af21e0>, <molpher.core.MolpherMol.MolpherMol at 0x7f37a8af2210>, <molpher.core.MolpherMol.MolpherMol at 0x7f37a8af2240>, <molpher.core.MolpherMol.MolpherMol at 0x7f37a8af2270>, <molpher.core.MolpherMol.MolpherMol at 0x7f37a8af22a0>, <molpher.core.MolpherMol.MolpherMol at 0x7f37a8af22d0>, <molpher.core.MolpherMol.MolpherMol at 0x7f37a8af2300>, <molpher.core.MolpherMol.MolpherMol at 0x7f37a8af2330>, <molpher.core.MolpherMol.MolpherMol at 0x7f37a8af2360>, <molpher.core.MolpherMol.MolpherMol at 0x7f37a8af2390>, <molpher.core.MolpherMol.MolpherMol at 0x7f37a8af23c0>, <molpher.core.MolpherMol.MolpherMol at 0x7f37a8af23f0>, <molpher.core.MolpherMol.MolpherMol at 0x7f37a8af2420>, <molpher.core.MolpherMol.MolpherMol at 0x7f37a8af2450>)
60
The code in Listing 2.3 first tells the tree to return its current leaves.
As we only have one molecule in the tree (our cocaine source molecule),
the molpher.core.ExplorationTree.ExplorationTree.leaves
member
only contains one element. We can verify that it is indeed our cocaine by asking
the underlying MolpherMol
instance for its SMILES
using the smiles
attribute.
Hint
The MolpherMol
class has a lot of useful attributes that
can often be written into as well. You can easily replace the computed value of the objective
function with your own, for example. See the documentation for the MolpherMol
class to get an overview of how you can modify the molecules in the tree or the generated candidate morphs.
The generateMorphs()
method tells the tree to generate some morphs
from the current leaves for us. How many morphs will be generated depends
mostly on the current state of the tree
and parameters far_produce
, close_produce
and far_close_threshold
.
However, it also depends on other factors. For example, some structures
might not be parsed correctly and hence might be removed from the list after the morphs are generated.
Also, a different number of morphs can be generated each time the method is run. That si due to
the non-deterministic character of the morphing algorithm which chooses the morphing operators to
use and parts of the structure to modify randomly.
We can access the newly generated morphs from the candidates
member of the tree instance. It is a tuple
of MolpherMol
instances.
MolpherMol
instances serve
as proxy objects for the underlying C++ representation of molecules generated by Molpher.
Hence, these instances can be used to read and manipulate compounds currently present in a tree
or the generated morphs.
Attention
The molecules saved in the candidates
attribute of the tree actually do not
belong to the tree just yet. See Extending and Pruning for more information on
how it is with tree ownership of molecules.
You can tell any molecule to replicate itself by calling its copy()
method.
The replicated instance is never bound to any tree and all the changes made to the copy are only made to the copy
and not the original instance. This is useful when we want to save the molecule from a tree for later use
and make sure that its state does not change as we keep growing the tree in our program.
The following code example illustrates how we can change and copy MolpherMol
instances:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | # get the first morph in the candidate list
candidate = tree.candidates[0]
# print distance to target
print(tree.candidates[0].dist_to_target)
# set new distance to target
candidate.dist_to_target = 0.5
# look in the list of candidates and print new distance
print(tree.candidates[0].dist_to_target)
print()
# make a copy of our molecule
candidate_copy = candidate.copy()
# set a new distance for the copy and verify that the original was not affected
print(candidate_copy.dist_to_target)
candidate_copy.dist_to_target = 0.7
print(candidate_copy.dist_to_target)
print(candidate.dist_to_target)
print(tree.candidates[0].dist_to_target)
|
Output:
0.828125
0.5
0.5
0.7
0.5
0.5
2.3.2. Sorting and Filtering Morphs¶
Sometimes the order of the newly generated molecules in the candidates
list might have a meaning to us;
for example, this order is used by the built-in probability filter (see PROBABILITY
for details)
to compute the probabilities of survival for each candidate.
As of yet, the only way to sort the
generated morphs is by calling the sortMorphs()
method on the tree instance or using the SortMorphsOper
operation
(see Tree Operations for more). This sorts the molecules in the order of increasing value
of the objective function (distance from the target molecule by default).
Let us now sort the candidate morphs in our tree:
1 2 3 4 5 6 7 8 9 10 11 | # sort the candidates in the tree according to their distance from target
tree.sortMorphs()
# verify
print(tree.candidates_mask)
print(
[
(x.smiles, x.dist_to_target)
for idx,x in enumerate(tree.candidates)
]
)
|
Output:
(True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True)
[('COC(=O)C1C2CCC(CC1(O)OC(=O)C1=CC=CC=C1)N2C', 0.5), ('COC(=O)C1C2C3CN2C(C3)CC1OC(=O)C1=CC=CC=C1', 0.7868852459016393), ('CCC1CC(OC(=O)C2=CC=CC=C2)C(C(=O)OC)CN1C', 0.7868852459016393), ('COC(=O)C1C(OC(=O)C2=CC=CC=C2)CC2CCN1N2C', 0.7868852459016393), ('CCC1C(C(=O)OC)C(OC(=O)C2=CC=CC=C2)CCN1C', 0.7868852459016393), ('COC(=O)C1C2C(=O)CC(CC1OCC1=CC=CC=C1)N2C', 0.7903225806451613), ('COC(=O)C1C2CC(=O)C(CC1OCC1=CC=CC=C1)N2C', 0.7936507936507937), ('CN1C2CCC13CC(OC(=O)C1=CC=CC=C1)C2C(=O)OC3', 0.8), ('COC(=O)C1C2CCC(NC1OC(=O)C1=CC=CC=C1)N2C', 0.8064516129032258), ('COC(=O)C1C2CC(N)C(CC1OC(=O)C1=CC=CC=C1)N2C', 0.8064516129032258), ('COCC1C2CCC(CC1OC(=O)C1=CC=CC=C1)N2C', 0.8064516129032258), ('COC(=O)C1C2CCC(C1OC(=O)C1=CC=CC=C1)N2C', 0.8103448275862069), ('COC(=O)CC(CC1CCCN1C)OC(=O)C1=CC=CC=C1', 0.8125), ('COCC(=O)C1C2CCC(CC1OC(=O)C1=CC=CC=C1)N2C', 0.8125), ('CNC1CCCCC(OC(=O)C2=CC=CC=C2)C1C(=O)OC', 0.8135593220338984), ('COC(=O)C1C(C)N(C)C(C)CC1OC(=O)C1=CC=CC=C1', 0.8166666666666667), ('CN1C2CCC1C(C(=O)O)C(OC(=O)C1=CC=CC=C1)C2', 0.8166666666666667), ('COC(=O)C1C2CC(CC1OC(=O)C1=CC=CC=C1)N2C', 0.8166666666666667), ('CN1C2CCC13COC(=O)C3C(OC(=O)C1=CC=CC=C1)C2', 0.8181818181818181), ('COC(=O)C1C2CCC(CC1OC(=O)C1=CC=CC=C1)N2', 0.819672131147541), ('COC(=O)C1C2CCC(CC1OC(=O)C1=CC=CC=C1)C2C', 0.819672131147541), ('COC(=O)C1C(OC(=O)C2=CC=CC=C2)=CC2CCC1N2C', 0.8225806451612903), ('COC(=O)C1C2CCC(OC1OC(=O)C1=CC=CC=C1)N2C', 0.8225806451612903), ('COC(=O)C1C2CC3CN2C3CC1OC(=O)C1=CC=CC=C1', 0.8225806451612903), ('CN1C2CCC1C(C(=O)OO)C(OC(=O)C1=CC=CC=C1)C2', 0.8225806451612903), ('COC(=O)C1C2CCC(CC1OCC1=CC=CC=C1)N2C', 0.8253968253968254), ('COC(=O)C1(C)C2CCC(CC1OC(=O)C1=CC=CC=C1)N2C', 0.8253968253968254), ('COC(=O)C1C2NCC(CC1OC(=O)C1=CC=CC=C1)N2C', 0.8253968253968254), ('COC(=O)C1C2CCC(C(=O)C1OCC1=CC=CC=C1)N2C', 0.828125), ('COC(=O)C1C2OCC(CC1OC(=O)C1=CC=CC=C1)N2C', 0.828125), ('COC(=O)C1C2CCC(CC1OC(=O)C1=CC=NC=C1)N2C', 0.828125), ('COC(=O)C1C2CCC(CC1OOC(=O)C1=CC=CC=C1)N2C', 0.828125), ('COC(=O)C1C2CCC(CC1(C)OC(=O)C1=CC=CC=C1)N2C', 0.828125), ('COC(=O)C1C(OC(=O)C2=CC=CC=C2)C2C3CCC12N3C', 0.828125), ('COC(=O)C1C2CCC(CN2C)CC1OC(=O)C1=CC=CC=C1', 0.828125), ('COC(=O)C1C2CCC(C3C4=C(C=CC=C4)C(=O)OC13)N2C', 0.828125), ('COC(=O)C1C2CCC(COC1OC(=O)C1=CC=CC=C1)N2C', 0.8307692307692307), ('COC(=O)C1C2CCC(NCC1OC(=O)C1=CC=CC=C1)N2C', 0.8307692307692307), ('COC(=O)C1OC(OC(=O)C2=CC=CC=C2)CC2CCC1N2C', 0.8307692307692307), ('COC(=O)C1C(OC(=O)C2=CC=CC=C2)CC2CCOC1N2C', 0.8333333333333334), ('COC12CCC(CC(OC(=O)C3=CC=CC=C3)C1C=O)N2C', 0.8333333333333334), ('COC(=O)C1C2CCC(CC1OC(=O)C1=CC=CNC=C1)N2C', 0.8382352941176471), ('COC(=O)C1C2CCC(CC1ONC(=O)C1=CC=CC=C1)N2C', 0.8461538461538461), ('COC(=O)C1C2CCC(CC1OC(=O)C1=CC=C=C1)N2C', 0.8507462686567164), ('COC(=O)C1C2CCC(CC1OC=O)N2CC1=CC=CC=C1', 0.8507462686567164), ('COC(=O)C1C2CCC(CC1OC(=O)C1=CC=CC(O)=C1)N2C', 0.8529411764705882), ('COC(=O)C1C2CCC(CC1OC(=O)C1=CN=CC=C1)N2C', 0.855072463768116), ('COC(=O)C1C2CCC(CC1OC(=O)C1=CCC(C)C=C1)N2C', 0.855072463768116), ('COC(=O)C1C2CCC(CC1OC(=O)C1=CC=CN=C1)N2C', 0.855072463768116), ('COC(=O)C1C2CCC(CC1OC(=O)C1=CC=CC=C=C1)N2C', 0.8571428571428572), ('COC(=O)C1C2CCC(CC1OC(=O)C1=COC=CC=C1)N2C', 0.8571428571428572), ('COC(=O)C1C2CCC(CC1OC(=O)C1=CC=CC=C1C)N2C', 0.8656716417910448), ('COC(=O)C1C2CCC(CC1OC(=O)C1=CC=CC1)N2C', 0.8656716417910448), ('COC(=O)C1C2CCC(CC1OC(=O)C1=C(O)C=CC=C1)N2C', 0.8656716417910448), ('COC(=O)C1C2CCC(CC1OC(=O)C1=CC=C3C=C31)N2C', 0.8676470588235294), ('COC(=O)C1C2CCC(CC1OC(=O)C1=CC=CC=N1)N2C', 0.8695652173913043), ('CNC1CCC2C3=CC=CC(=C3)C(=O)OC(C1)C2C(=O)OC', 0.8695652173913043), ('COC(=O)C1C2CCC(CC1OC(=O)C1=NC=CC=C1)N2C', 0.8695652173913043), ('CC=CC=CCC(=O)OC1CC2CCC(C1C(=O)OC)N2C', 0.9142857142857143), ('C=CC=CC=CC(=O)OC1CC2CCC(C1C(=O)OC)N2C', 0.9285714285714286)]
When the list of candidates is populated and sorted, we need to choose the morphs that will form the next generation (the next leaves of the tree from which new morphs can be generated). Here is an example implementation of a very simple filtering procedure:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | # print the current candidates mask (all positions are on by default)
print(tree.candidates_mask)
# accept only the first three morphs in the sorted list (those with the lowest distance to target)
mask = [False for x in tree.candidates_mask]
mask[0] = True
mask[1] = True
mask[2] = True
# save the new mask to the tree
tree.candidates_mask = mask
# show results
print(tree.candidates_mask)
print(
[
(x.smiles, x.dist_to_target)
for idx,x in enumerate(tree.candidates)
if tree.candidates_mask[idx] # get accepted molecules only
]
)
|
Output:
(True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True)
(True, True, True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False)
[('COC(=O)C1C2CCC(CC1(O)OC(=O)C1=CC=CC=C1)N2C', 0.5), ('COC(=O)C1C2C3CN2C(C3)CC1OC(=O)C1=CC=CC=C1', 0.7868852459016393), ('CCC1CC(OC(=O)C2=CC=CC=C2)C(C(=O)OC)CN1C', 0.7868852459016393)]
In Listing 2.6 candidates_mask
member can be easily changed by writing
a list
or a tuple
of new values into it. Here we simply select the first three morphs as the new morph generation.
Note that the list of selected morphs is sorted from the molecule closest to the target to the farthest,
because we called the sortMorphs()
method previously.
Note
The new mask must be the same length as the candidates
member. If this requirement
is not satisfied, an instance of RuntimeError
is raised.
Warning
The mask should only be set after the morphs are sorted. If the mask is set and the order of morphs is changed, the mask will stay the same and will have to be updated to follow the new order.
See also
The library implements a few built-in filters. You can use the
filterMorphs()
method to invoke them. See the method’s documentation for more information
on the available filtering options.
2.3.3. Extending and Pruning¶
When we have the morphs we want to attach to the tree selected, we can call extend()
which will connect them to their respective parents making them the new leaves
of our tree:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | # get the number of generations before
print(tree.generation_count)
tree.extend() # connect the accepted morphs to the tree as new leaves
print(
sorted( # grab the new leaves as a list sorted according to their distance from target
[
(x.getSMILES(), x.getDistToTarget())
for x in tree.leaves
], key=lambda item : item[1]
)
)
# get the number of generations after
print(tree.generation_count)
# check if a path was found
print(tree.path_found)
# run the pruning operation on the updated tree
tree.prune()
|
Output:
0
[('COC(=O)C1C2CCC(CC1(O)OC(=O)C1=CC=CC=C1)N2C', 0.5), ('COC(=O)C1C2C3CN2C(C3)CC1OC(=O)C1=CC=CC=C1', 0.7868852459016393), ('CCC1CC(OC(=O)C2=CC=CC=C2)C(C(=O)OC)CN1C', 0.7868852459016393)]
1
False
We can see that after extending the tree, the selected morphs (see Listing 2.6)
had become the new leaves and that the tree’s
morph generation counter (generation_count
) was increased by one.
Because a tree generated in this way grows exponentially, a pruning strategy is needed in order to keep the number of explored putative paths to a minimum by discarding those that are not getting any closer to the target molecule.
We call the molecule that have not generated any morphs closer to the target than itself
a non-producing molecule and we can set
the number of generations to wait before removing its descendents
with the non_producing_survive
parameter.
Tree pruning can be requested anytime by calling the prune()
method. In our example, the method didn’t prune
any paths, because the non_producing_survive
parameter is set to 2 generations in this particular instance.
See also
In addition to the non_producing_survive
parameter, there is the max_morphs_total
parameter,
which imposes a restriction on the maximum number of
descendents of one non-producing molecule. If the number of descendents
reaches this threshold, the molecule is removed along with the descendents.
Hint
We could now repeat the process of generating, filtering, extending and pruning
and have Molpher iteratively explore further in chemical space. We could also read the path_found
member
at every iteration to check for the presence of the target molecule in the tree and terminate the process,
if a it is present. This would give us one complete implementation of a chemical space exploration algorithm.
Now we know everything that is needed to implement a chemical space exploration algorithm with the Molpher-lib library. In the following sections, we describe more advanced topics and introduce some built-in features of the library that can help to make some more complex tasks (such as tree serialization) easier.
2.4. Tree Operations¶
We call every action that is performed on an exploration tree a tree operation.
This concept is represented in the library with the TreeOperation
abstract class and it
becomes useful when we run into a situation where we need to build
several exploration trees at once, want to reuse some existing code or store some interim results
of an ongoing exploration.
We can run any operation on a tree simply by supplying it to the
runOperation()
method of the ExplorationTree
class. Here is how to implement
the same workflow as in the preceding sections using operations:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 | from molpher.core.operations import *
class MyFilterMorphs(TreeOperation):
"""
A custom tree operation that accepts
only the first three morphs
(those with the lowest distance to target).
"""
def __call__(self):
"""
This method is called automatically by the tree.
The tree this operation is being run on is accessible
from `self.tree`.
"""
mask = [False for x in self.tree.candidates_mask]
mask[0] = True
mask[1] = True
mask[2] = True
self.tree.candidates_mask = mask
tree = ETree.create(source=cocaine, target=procaine) # create the tree
# this list of tree operations defines one iteration
iteration = [
GenerateMorphsOper()
, SortMorphsOper()
, MyFilterMorphs()
, ExtendTreeOper()
, PruneTreeOper()
]
# apply the operations in the list one by one
for oper in iteration:
tree.runOperation(oper)
# observe the results
print(tree.generation_count)
print(tree.path_found)
print(
sorted( # grab the new leaves as a list sorted according to their distance from target
[
(x.getSMILES(), x.getDistToTarget())
for x in tree.leaves
], key=lambda x : x[1]
)
)
|
Output:
Generated 67 morphs.
1
False
[('COC(=O)C1C2CCC(CC1OC(=O)C1=CC=C(N)C=C1)N2C', 0.7068965517241379), ('COC(=O)CC1CCC(CCOC(=O)C2=CC=CC=C2)N1C', 0.7142857142857143), ('COC(=O)C1C(COC(=O)C2=CC=CC=C2)C2CCC1N2C', 0.7586206896551724)]
Note
Because the morphing algorithm is not deterministic and we initilized a new tree, the set of obtained morphs is different from the one in the previous examples.
Most of the operations in Listing 2.8 are built-in operations (discussed below), but
we chose to define our own operation for the filtering step
(see the highlighted lines). We simply created a subclass of the TreeOperation
abstract class and we overrode its
__call__()
method with the implementation we want.
Each operation can have a tree associated with it, but it is not necessary.
We can verify if a tree is associated with an operation by calling
its getTree()
method or accessing the TreeOperation.tree
attribute of the class.
If there is no tree associated with the instance, they both return None
.
Note
The built-in operations will raise a RuntimeError
, if invoked without a tree attached to them.
2.4.1. Built-in Operations¶
A few operations are already defined in the library:
They are all dervied from TreeOperation
and contain
the full set of operations performed on a tree in
the original Molpher algorithm as published in [1]. Therefore, the original algorithm can be
implemented using those operations.
In the next part of the tutorial, we will pay particular attention to the
TraverseOper
operation. It differs
from the others, because it uses a callback function to perform actions on molecules
in the tree and is, therefore, very useful for debugging and saving exporting various data (see Traversing the Tree).
For more details on the other operations, see the designated pages in the documentation.
[1] | Hoksza D., Škoda P., Voršilák M., Svozil D. (2014) Molpher: a software framework for systematic chemical space exploration. J Cheminform. 6:7. PubMed, DOI |
2.4.1.1. Traversing the Tree¶
A special place among the operations belongs to the TraverseOper
class. It does not directly implement a part
of a morphing algorithm, but serves as a means of traversing molecules in a tree and reading/modifying them
as needed:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | from molpher.core.operations.callbacks import TraverseCallback
class MyCallback(TraverseCallback):
"""
This callback just prints some information
about the molecules in the tree.
"""
def __call__(self, morph):
"""
Method called on each morph in the tree
-- starting from the root to leaves.
"""
if not morph.getParentSMILES():
print("# Root #")
else:
print('# Morph #')
print('Parent:', morph.getParentSMILES())
print('SMILES: ', morph.getSMILES())
print('Descendents: ', morph.getDescendants())
callback = MyCallback() # initialize a callback
traverse = TraverseOper(callback) # attach it to a tree traversal operation
tree.runOperation(traverse) # run the operation
|
Output:
# Root #
SMILES: COC(=O)C1C2CCC(CC1OC(=O)C1=CC=CC=C1)N2C
Descendents: ('COC(=O)C1C(COC(=O)C2=CC=CC=C2)C2CCC1N2C', 'COC(=O)C1C2CCC(CC1OC(=O)C1=CC=C(N)C=C1)N2C', 'COC(=O)CC1CCC(CCOC(=O)C2=CC=CC=C2)N1C')
# Morph #
Parent: COC(=O)C1C2CCC(CC1OC(=O)C1=CC=CC=C1)N2C
# Morph #
Parent: COC(=O)C1C2CCC(CC1OC(=O)C1=CC=CC=C1)N2C
SMILES: COC(=O)C1C(COC(=O)C2=CC=CC=C2)C2CCC1N2C
Descendents: ()
# Morph #
Parent: COC(=O)C1C2CCC(CC1OC(=O)C1=CC=CC=C1)N2C
SMILES: COC(=O)C1C2CCC(CC1OC(=O)C1=CC=C(N)C=C1)N2C
SMILES: COC(=O)CC1CCC(CCOC(=O)C2=CC=CC=C2)N1C
Descendents: ()
Descendents: ()
Note
The tree traversal algorithm uses multiple threads. Therefore, the output might not be synchronized.
In Listing 2.9 we derive from the TraverseCallback
class, an abstract class with
an abstract method __call__()
. This method takes one argument,
which is a MolpherMol
instance
of a molecule in the tree. We need to override this method in our derived class in order to implement our own
behaviour.
The callback is then associated with a TraverseOper
instance, which can be run on a tree as any other
tree operation. When the operation is run it traverses the tree from the root to the leaves and injects
every molecule it encounters into our implementation of the
__call__()
method.
Note
We can also pass a SMILES string to the TraverseOper
constructor. In that case, a subtree will be traversed
using the specified molecule as the root of the subtree.
There is also a much more convenient way to traverse the tree. Because, the ExplorationTree
class implements
the molpher.core.ExplorationTree.ExplorationTree.traverse()
method, we can simply take any python callable and use it instead of the
__call__()
method.
However, under the hood it does the same thing as we did in Listing 2.9.
Therefore, the above code can be turned into:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | def process(morph):
"""
Prints some information
about the molecules in the tree.
"""
if not morph.getParentSMILES():
print("# Root #")
else:
print('# Morph #')
print('Parent:', morph.getParentSMILES())
print('SMILES: ', morph.getSMILES())
print('Descendents: ', morph.getDescendants())
tree.traverse(process) # use the traverse method to run the callback function
|
Output:
# Root #
SMILES: COC(=O)C1C2CCC(CC1OC(=O)C1=CC=CC=C1)N2C
Descendents: ('COC(=O)C1C(COC(=O)C2=CC=CC=C2)C2CCC1N2C', 'COC(=O)C1C2CCC(CC1OC(=O)C1=CC=C(N)C=C1)N2C', 'COC(=O)CC1CCC(CCOC(=O)C2=CC=CC=C2)N1C')
# Morph #
Parent: COC(=O)C1C2CCC(CC1OC(=O)C1=CC=CC=C1)N2C
SMILES: COC(=O)CC1CCC(CCOC(=O)C2=CC=CC=C2)N1C
Descendents: ()
# Morph #
Parent: COC(=O)C1C2CCC(CC1OC(=O)C1=CC=CC=C1)N2C
SMILES: COC(=O)C1C2CCC(CC1OC(=O)C1=CC=C(N)C=C1)N2C
Descendents: ()
# Morph #
Parent: COC(=O)C1C2CCC(CC1OC(=O)C1=CC=CC=C1)N2C
SMILES: COC(=O)C1C(COC(=O)C2=CC=CC=C2)C2CCC1N2C
Descendents: ()
2.5. Tree Templates and Snapshots¶
We don’t always have to initialize morphing parameters by hand. We can use an XML template instead.
Here is an example of a template file (you can download this one
from here
):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | <!--
Template can be loaded instead of a snapsthot to initilize a tree.
It only contains information about the morphing parameters, however.
IMPORTANT: The name of a template file must end with '-template.xml'.
Otherwise it will be interpreted as a snapshot and the resulting tree
will be invalid.
-->
<iteration>
<source>
<smile>CN1[C@H]2CC[C@@H]1[C@@H](C(=O)OC)[C@@H](OC(=O)c1ccccc1)C2</smile>
</source>
<target>
<smile>O=C(OCCN(CC)CC)c1ccc(N)cc1</smile>
</target>
<fingerprint>FP_MORGAN</fingerprint>
<similarity>SC_TANIMOTO</similarity>
<param>
<nonProducingSurvive>2</nonProducingSurvive>
<acceptMin>50</acceptMin>
<acceptMax>100</acceptMax>
<farProduce>80</farProduce>
<closeProduce>150</closeProduce>
<farCloseThreashold>0.15</farCloseThreashold>
<maxMorhpsTotal>1500</maxMorhpsTotal>
<nonProducingSurvive>2</nonProducingSurvive>
<weightMin>0.0</weightMin>
<weightMax>500.0</weightMax>
</param>
</iteration>
|
An XML template is similar to a configuration file and can be loaded just like a snapshot (see Listing 2.13), but the resulting tree will only contain the source molecule as its root.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | template_file = 'cocaine-procaine-template.xml'
tree = ETree.create(template_file)
print(tree.params)
# apply the tree operations
for oper in iteration:
tree.runOperation(oper)
print(
sorted( # grab the new leaves as a list sorted according to their distance from target
[
(x.getSMILES(), x.getDistToTarget())
for x in tree.leaves
], key=lambda x : x[1]
)
)
# save the tree in a snapshot file
tree.save('snapshot.xml')
|
Output:
The new iteration has been created from template:
source: CN1[C@H]2CC[C@@H]1[C@@H](C(=O)OC)[C@@H](OC(=O)c1ccccc1)C2
target: O=C(OCCN(CC)CC)c1ccc(N)cc1
{
'max_morphs_total': 1500,
'far_close_threshold': 0.15,
'weight_max': 500.0,
'close_produce': 150,
'fingerprint': 'MORGAN',
'accept_min': 50,
'source': 'COC(=O)C1C2CCC(CC1OC(=O)C1=CC=CC=C1)N2C',
'target': 'CCN(CC)CCOC(=O)C1=CC=C(N)C=C1',
'weight_min': 0.0,
'non_producing_survive': 2,
'accept_max': 100,
'operators': (
'ADD_ATOM',
'ADD_BOND',
'BOND_CONTRACTION',
'BOND_REROUTE',
'INTERLAY_ATOM',
'MUTATE_ATOM',
'REMOVE_ATOM',
'REMOVE_BOND'
),
'far_produce': 80,
'similarity': 'TANIMOTO'
}
Generated 66 morphs.
[('CN1C2CCC1C(C(=O)OCN)C(OC(=O)C1=CC=CC=C1)C2', 0.7777777777777778), ('CCN1C2CCC1C(C(=O)OC)C(OC(=O)C1=CC=CC=C1)C2', 0.7936507936507937), ('CN1C2CCC1C(C(=O)ON)C(OC(=O)C1=CC=CC=C1)C2', 0.8064516129032258)]
In the above example we loaded an XML template, created a tree from it, extended the tree and serialized it as a snapshot. We can see that all the parameters are the same as in the XML template and that the resulting tree can be built using the same list of operations as in Listing 2.8.
In Listing 2.12 we also serialized our tree instance to disk
with the save()
method.
The saved tree can be later reconstructed with the
create()
factory method:
1 2 3 4 5 6 7 8 9 10 | new_tree = ETree.create('snapshot.xml') # create a new tree from the saved snapshot
print(new_tree.params)
print(
sorted( # grab the leaves in the created tree (these should be the same as those in the original tree)
[
(x.getSMILES(), x.getDistToTarget())
for x in new_tree.leaves
], key=lambda x : x[1]
)
)
|
Output:
Snapshot successfully created from: snapshot.xml
{
'max_morphs_total': 1500,
'far_close_threshold': 0.15,
'weight_max': 500.0,
'close_produce': 150,
'fingerprint': 'MORGAN',
'accept_min': 50,
'source': 'COC(=O)C1C2CCC(CC1OC(=O)C1=CC=CC=C1)N2C',
'target': 'CCN(CC)CCOC(=O)C1=CC=C(N)C=C1',
'weight_min': 0.0,
'non_producing_survive': 2,
'accept_max': 100,
'operators': (
'ADD_ATOM',
'ADD_BOND',
'BOND_CONTRACTION',
'BOND_REROUTE',
'INTERLAY_ATOM',
'MUTATE_ATOM',
'REMOVE_ATOM',
'REMOVE_BOND'
),
'far_produce': 80,
'similarity': 'TANIMOTO'
}
[('CN1C2CCC1C(C(=O)OCN)C(OC(=O)C1=CC=CC=C1)C2', 0.7777777777777778), ('CCN1C2CCC1C(C(=O)OC)C(OC(=O)C1=CC=CC=C1)C2', 0.7936507936507937), ('CN1C2CCC1C(C(=O)ON)C(OC(=O)C1=CC=CC=C1)C2', 0.8064516129032258)]
2.6. Example Exploration Algorithm Implementations¶
Now we wrap up this tutorial with two simple chemical space exploration implementations (see Listing 2.14 and Listing 2.15).
2.6.1. Using the Tutorial to Implement a Search Algorithm¶
In the tutorial we showed how to implement one step of a simple algorithm that searches for a path in chemical space between cocaine and procaine. Transforming the code into a full exploration algorithm is pretty much straightforward:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 | """
Complete morphing example create from the tutorial code.
"""
from molpher.core import ExplorationTree as ETree
from molpher.core.operations import *
cocaine = 'CN1[C@H]2CC[C@@H]1[C@@H](C(=O)OC)[C@@H](OC(=O)c1ccccc1)C2'
procaine = 'O=C(OCCN(CC)CC)c1ccc(N)cc1'
class MyFilterMorphs(TreeOperation):
def __init__(self):
super(MyFilterMorphs, self).__init__()
def __call__(self):
mask = [False for x in self.tree.candidates_mask]
mask[0] = True
mask[1] = True
mask[2] = True
self.tree.candidates_mask = mask
def main():
iteration = [
GenerateMorphsOper()
, SortMorphsOper()
, MyFilterMorphs()
, ExtendTreeOper()
, PruneTreeOper()
]
tree = ETree.create(source=cocaine, target=procaine)
counter = 0
while not tree.path_found:
for oper in iteration:
tree.runOperation(oper)
counter+=1
print("Iteration", counter)
print(
sorted(
[
(x.getSMILES(), x.getDistToTarget())
for x in tree.leaves
], key=lambda x : x[1]
)
)
path = []
current = tree.fetchMol(tree.params['target'])
path.append(current)
while current != '':
current = current.getParentSMILES()
if current:
current = tree.fetchMol(current)
path.append(current)
path.reverse()
print("Path found: ")
for mol in path:
print(mol.getSMILES(), mol.getDistToTarget())
if __name__ == "__main__":
exit(main())
|
Output:
Loading data from: /home/sichom/Projects/Molpher/Molpher_repo/molpher-lib/python/molpher/swig_wrappers/SAScore.dat
loading SAScore.dat ... done
Parse molecule CN1[C@H]2CC[C@@H]1[C@@H](C(=O)OC)[C@@H](OC(=O)c1ccccc1)C2 >> COC(=O)C1C2CCC(CC1OC(=O)C1=CC=CC=C1)N2C
Parse molecule O=C(OCCN(CC)CC)c1ccc(N)cc1 >> CCN(CC)CCOC(=O)C1=CC=C(N)C=C1
Iteration 1
[('COC(=O)C1C2CCC(CC1OC(=O)C1=CC=C(N)C=C1)N2C', 0.7068965517241379), ('COC(=O)CC1CCC(CCOC(=O)C2=CC=CC=C2)N1C', 0.7142857142857143), ('COC(=O)C1C2CC1N(C)C2CCOC(=O)C1=CC=CC=C1', 0.728813559322034)]
Iteration 2
[('COC(=O)CC1CCC(CCOC(=O)C2=CC=C(N)C=C2)N1C', 0.5769230769230769), ('CCC(CC(=O)OC)N(C)CCCOC(=O)C1=CC=CC=C1', 0.6415094339622642), ('CCC1C(C(=O)OC)C(OC(=O)C2=CC=C(N)C=C2)CCN1C', 0.6666666666666667), ('COC(=O)C1C2CC1N(C)C2CCOC(=O)C1=CC=CC=C1', 0.728813559322034)]
Iteration 3
[('CCC(CCOC(=O)C1=CC=C(N)C=C1)N(C)CCC(=O)OC', 0.4897959183673469), ('CCC(CC(=O)OC)N(C)CCCOC(=O)C1=CC=C(N)C=C1', 0.4897959183673469), ('COCCC1CCC(CCOC(=O)C2=CC=C(N)C=C2)N1C', 0.5416666666666667), ('CCC(CC(=O)OC)N(C)CCCOC(=O)C1=CC=CC=C1', 0.6415094339622642), ('CCC1C(C(=O)OC)C(OC(=O)C2=CC=C(N)C=C2)CCN1C', 0.6666666666666667), ('COC(=O)C1C2CC1N(C)C2CCOC(=O)C1=CC=CC=C1', 0.728813559322034)]
Iteration 4
[('CCC(CCOC(=O)C1=CC=C(N)C=C1)N(CC)CCC(=O)O', 0.44680851063829785), ('CCC(CCOC(=O)C1=CC=C(N)C=C1)N(C)CCC(=O)O', 0.4565217391304348), ('CCC(CC(=O)O)N(C)CCCOC(=O)C1=CC=C(N)C=C1', 0.46808510638297873), ('COCCC1CCC(CCOC(=O)C2=CC=C(N)C=C2)N1C', 0.5416666666666667), ('CCC(CC(=O)OC)N(C)CCCOC(=O)C1=CC=CC=C1', 0.6415094339622642), ('CCC1C(C(=O)OC)C(OC(=O)C2=CC=C(N)C=C2)CCN1C', 0.6666666666666667), ('COC(=O)C1C2CC1N(C)C2CCOC(=O)C1=CC=CC=C1', 0.728813559322034)]
Iteration 5
[('CCN(CCOC(=O)C1=CC=C(N)C=C1)N(C)CCC(=O)O', 0.3571428571428571), ('CCC(CC(=O)O)N(C)CCOC(=O)C1=CC=C(N)C=C1', 0.40909090909090906), ('CCN(CCC(=O)O)C(C)CCOC(=O)C1=CC=C(N)C=C1', 0.4347826086956522), ('COCCC1CCC(CCOC(=O)C2=CC=C(N)C=C2)N1C', 0.5416666666666667), ('CCC(CC(=O)OC)N(C)CCCOC(=O)C1=CC=CC=C1', 0.6415094339622642), ('CCC1C(C(=O)OC)C(OC(=O)C2=CC=C(N)C=C2)CCN1C', 0.6666666666666667), ('COC(=O)C1C2CC1N(C)C2CCOC(=O)C1=CC=CC=C1', 0.728813559322034)]
Pruned (descendents only): COC(=O)C1C2CC1N(C)C2CCOC(=O)C1=CC=CC=C1
Iteration 6
[('CN(CCOC(=O)C1=CC=C(N)C=C1)N(C)CCC(=O)O', 0.3157894736842105), ('CCN(CCCOC(=O)C1=CC=C(N)C=C1)CCC(=O)O', 0.31707317073170727), ('CCN(CCC(=O)O)N(C)CCOC(=O)C1=CC=C(N)C=C1', 0.3414634146341463), ('CCC(CC(=O)O)N(C)CCOC(=O)C1=CC=C(N)C=C1', 0.40909090909090906), ('COCCC1CCC(CCOC(=O)C2=CC=C(N)C=C2)N1C', 0.5416666666666667), ('CCC(CC(=O)OC)N(C)CCCOC(=O)C1=CC=CC=C1', 0.6415094339622642), ('CCC1C(C(=O)OC)C(OC(=O)C2=CC=C(N)C=C2)CCN1C', 0.6666666666666667), ('COC(=O)C1C2CC1N(C)C2CCOC(=O)C1=CC=CC=C1', 0.728813559322034)]
Candidate morph: CCN(CCCOC(=O)C1=CC=C(N)C=C1)CCC(=O)O already present in the tree. Skipping...
Pruned (descendents only): COC(=O)CC1CCC(CCOC(=O)C2=CC=CC=C2)N1C
Pruned (descendents only): CCC1C(C(=O)OC)C(OC(=O)C2=CC=C(N)C=C2)CCN1C
Iteration 7
[('CCN(CCCO)CCCOC(=O)C1=CC=C(N)C=C1', 0.30000000000000004), ('CN(CCOC(=O)C1=CC=C(N)C=C1)N(C)CCC(=O)O', 0.3157894736842105), ('CCN(CCCOC(=O)C1=CC=C(N)C=C1)CC(=O)O', 0.31707317073170727), ('CCN(CCC(=O)O)N(C)CCOC(=O)C1=CC=C(N)C=C1', 0.3414634146341463), ('CCC(CC(=O)O)N(C)CCOC(=O)C1=CC=C(N)C=C1', 0.40909090909090906), ('COCCC1CCC(CCOC(=O)C2=CC=C(N)C=C2)N1C', 0.5416666666666667), ('CCC1C(C(=O)OC)C(OC(=O)C2=CC=C(N)C=C2)CCN1C', 0.6666666666666667), ('COC(=O)CC1CCC(CCOC(=O)C2=CC=CC=C2)N1C', 0.7142857142857143), ('COC(=O)C1C2CC1N(C)C2CCOC(=O)C1=CC=CC=C1', 0.728813559322034)]
Pruned (descendents only): COCCC1CCC(CCOC(=O)C2=CC=C(N)C=C2)N1C
Iteration 8
[('CCN(CCOC(=O)C1=CC=C(N)C=C1)CC(=O)O', 0.18918918918918914), ('CCN(CCCO)CCOC(=O)C1=CC=C(N)C=C1', 0.21052631578947367), ('CCN(CCCOC(=O)C1=CC=C(N)C=C1)CCOO', 0.275), ('CN(CCOC(=O)C1=CC=C(N)C=C1)N(C)CCC(=O)O', 0.3157894736842105), ('CCN(CCC(=O)O)N(C)CCOC(=O)C1=CC=C(N)C=C1', 0.3414634146341463), ('CCC(CC(=O)O)N(C)CCOC(=O)C1=CC=C(N)C=C1', 0.40909090909090906), ('COCCC1CCC(CCOC(=O)C2=CC=C(N)C=C2)N1C', 0.5416666666666667), ('CCC1C(C(=O)OC)C(OC(=O)C2=CC=C(N)C=C2)CCN1C', 0.6666666666666667), ('COC(=O)CC1CCC(CCOC(=O)C2=CC=CC=C2)N1C', 0.7142857142857143), ('COC(=O)C1C2CC1N(C)C2CCOC(=O)C1=CC=CC=C1', 0.728813559322034)]
Iteration 9
[('CCCN(CC)CCOC(=O)C1=CC=C(N)C=C1', 0.1428571428571429), ('CCN(CCOO)CCOC(=O)C1=CC=C(N)C=C1', 0.16666666666666663), ('CCN(CCO)CCOC(=O)C1=CC=C(N)C=C1', 0.16666666666666663), ('CCN(CCOC(=O)C1=CC=C(N)C=C1)CC(=O)O', 0.18918918918918914), ('CN(CCOC(=O)C1=CC=C(N)C=C1)N(C)CCC(=O)O', 0.3157894736842105), ('CCN(CCC(=O)O)N(C)CCOC(=O)C1=CC=C(N)C=C1', 0.3414634146341463), ('CCC(CC(=O)O)N(C)CCOC(=O)C1=CC=C(N)C=C1', 0.40909090909090906), ('COCCC1CCC(CCOC(=O)C2=CC=C(N)C=C2)N1C', 0.5416666666666667), ('CCC1C(C(=O)OC)C(OC(=O)C2=CC=C(N)C=C2)CCN1C', 0.6666666666666667), ('COC(=O)CC1CCC(CCOC(=O)C2=CC=CC=C2)N1C', 0.7142857142857143), ('COC(=O)C1C2CC1N(C)C2CCOC(=O)C1=CC=CC=C1', 0.728813559322034)]
Candidate morph: CCCN(CC)CCOC(=O)C1=CC=C(N)C=C1 already present in the tree. Skipping...
Pruned (descendents only): CCC(CC(=O)OC)N(C)CCCOC(=O)C1=CC=C(N)C=C1
Iteration 10
[('CCN(CC)CCOC(=O)C1=CC=C(N)C=C1', 0.0), ('CCN(CO)CCOC(=O)C1=CC=C(N)C=C1', 0.1428571428571429), ('CCN(CCOO)CCOC(=O)C1=CC=C(N)C=C1', 0.16666666666666663), ('CCN(CCOC(=O)C1=CC=C(N)C=C1)CC(=O)O', 0.18918918918918914), ('CN(CCOC(=O)C1=CC=C(N)C=C1)N(C)CCC(=O)O', 0.3157894736842105), ('CCN(CCC(=O)O)N(C)CCOC(=O)C1=CC=C(N)C=C1', 0.3414634146341463), ('CCC(CC(=O)OC)N(C)CCCOC(=O)C1=CC=C(N)C=C1', 0.4897959183673469), ('COCCC1CCC(CCOC(=O)C2=CC=C(N)C=C2)N1C', 0.5416666666666667), ('CCC1C(C(=O)OC)C(OC(=O)C2=CC=C(N)C=C2)CCN1C', 0.6666666666666667), ('COC(=O)CC1CCC(CCOC(=O)C2=CC=CC=C2)N1C', 0.7142857142857143), ('COC(=O)C1C2CC1N(C)C2CCOC(=O)C1=CC=CC=C1', 0.728813559322034)]
Path found:
COC(=O)C1C2CCC(CC1OC(=O)C1=CC=CC=C1)N2C 1.7976931348623157e+308
COC(=O)C1C2CCC(CC1OC(=O)C1=CC=C(N)C=C1)N2C 0.7068965517241379
COC(=O)CC1CCC(CCOC(=O)C2=CC=C(N)C=C2)N1C 0.5769230769230769
CCC(CCOC(=O)C1=CC=C(N)C=C1)N(C)CCC(=O)OC 0.4897959183673469
CCC(CCOC(=O)C1=CC=C(N)C=C1)N(CC)CCC(=O)O 0.44680851063829785
CCN(CCC(=O)O)C(C)CCOC(=O)C1=CC=C(N)C=C1 0.4347826086956522
CCN(CCCOC(=O)C1=CC=C(N)C=C1)CCC(=O)O 0.31707317073170727
CCN(CCCO)CCCOC(=O)C1=CC=C(N)C=C1 0.30000000000000004
CCN(CCCO)CCOC(=O)C1=CC=C(N)C=C1 0.21052631578947367
CCCN(CC)CCOC(=O)C1=CC=C(N)C=C1 0.1428571428571429
CCN(CC)CCOC(=O)C1=CC=C(N)C=C1 0.0
Process finished with exit code 0
The above implementation is nothing more than just the tutorial code bits inside a loop. The loop checks if a path was found at each iteration. If the path is found, it backtracks through the tree and prints out a sequence of molecules lying on the path.
2.6.2. Implementing a Bidirectional Search Algorithm¶
The second example we have here is a little bit more elaborate, but implements a very simple idea. Instead of one exploration tree, we build two trees that each search for a path to the closest molecule in the other:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 | """
Implementation of the bidirectional search algorithm.
"""
import time
from molpher.core.ExplorationTree import ExplorationTree as ETree
from molpher.core.operations import *
from molpher.core.selectors import *
def timeit(func):
milliseconds = 1000 * time.clock()
func()
return 1000 * time.clock() - milliseconds
class FindClosest:
def __init__(self):
self.closest = None
def __call__(self, morph):
if not self.closest:
self.closest = morph.copy()
return
current_dist = self.closest.getDistToTarget()
morph_dist = morph.getDistToTarget()
if morph_dist < current_dist:
self.closest = morph.copy()
class BidirectionalPathFinder:
def __init__(self, source, target):
options = {
'fingerprint' : FP_ATOM_PAIRS
}
self.source_target = ETree.create(source=source, target=target)
self.target_source = ETree.create(source=target, target=source)
self.source_target.params = options
self.target_source.params = options
self.source_target_min = FindClosest()
self.target_source_min = FindClosest()
self.ITERATION = [
GenerateMorphsOper()
, SortMorphsOper()
, FilterMorphsOper()
, ExtendTreeOper()
, PruneTreeOper()
]
self.path = []
def _find_path(self, tree, connecting_mol):
path = []
current = tree.fetchMol(connecting_mol)
path.append(current.getSMILES())
while current != '':
current = current.getParentSMILES()
if current:
current = tree.fetchMol(current)
path.append(current.getSMILES())
path.reverse()
return path
def __call__(self):
counter = 0
connecting_molecule = None
while True:
counter+=1
print('Iteration {0}:'.format(counter))
for oper in self.ITERATION:
print('Execution times ({0}):'.format(type(oper).__name__))
source_target_time = timeit(lambda : self.source_target.runOperation(oper))
print('\tsource -> target: {0}'.format(source_target_time))
target_source_time = timeit(lambda : self.target_source.runOperation(oper))
print('\ttarget -> source: {0}'.format(target_source_time))
print('\ttotal time: {0}'.format(source_target_time + target_source_time))
print('Traversal times:')
source_target_time = timeit(lambda : self.source_target.traverse(self.source_target_min))
print('\tsource -> target: {0}'.format(source_target_time))
target_source_time = timeit(lambda : self.target_source.traverse(self.target_source_min))
print('\ttarget -> source: {0}'.format(target_source_time))
print('\ttotal time: {0}'.format(source_target_time + target_source_time))
print('Current Targets:')
print('\tsource to target:', self.source_target.params['target'])
print('\ttarget to source:', self.target_source.params['target'])
print('Current Minima:')
print('\tsource to target:', self.source_target_min.closest.getSMILES(), self.source_target_min.closest.getDistToTarget())
print('\ttarget to source:', self.target_source_min.closest.getSMILES(), self.target_source_min.closest.getDistToTarget())
self.source_target.params = {
'target' : self.target_source_min.closest.getSMILES()
}
self.target_source.params = {
'target' : self.source_target_min.closest.getSMILES()
}
print('New Targets:')
print('\tsource to target:', self.source_target.params['target'])
print('\ttarget to source:', self.target_source.params['target'])
if self.source_target.path_found:
print('Path Found in tree going from source to target:')
connecting_molecule = self.source_target.params['target']
print('Connecting molecule:', connecting_molecule)
assert self.source_target.hasMol(connecting_molecule)
assert self.target_source.hasMol(connecting_molecule)
break
if self.target_source.path_found:
print('Path Found in tree going from target to source:')
connecting_molecule = self.target_source.params['target']
print('Connecting molecule:', connecting_molecule)
assert self.target_source.hasMol(connecting_molecule)
assert self.source_target.hasMol(connecting_molecule)
break
source_target_path = self._find_path(self.source_target, connecting_molecule)
target_source_path = self._find_path(self.target_source, connecting_molecule)
assert source_target_path.pop(-1) == connecting_molecule
target_source_path.reverse()
source_target_path.extend(target_source_path)
self.path = source_target_path
def main():
milliseconds_now = 1000 * time.clock()
cocaine = 'CN1[C@H]2CC[C@@H]1[C@@H](C(=O)OC)[C@@H](OC(=O)c1ccccc1)C2'
procaine = 'O=C(OCCN(CC)CC)c1ccc(N)cc1'
pathfinder = BidirectionalPathFinder(cocaine, procaine)
pathfinder()
print(pathfinder.path)
print('Total Execution Time: {0}'.format(1000 * time.clock() - milliseconds_now))
if __name__ == "__main__":
exit(main())
|
Output (only the final print of the path is shown):
[
'COC(=O)C1C2CCC(CC1OC(=O)C1=CC=CC=C1)N2C',
'CCC1C(C(=O)OC)C(OC(=O)C2=CC=CC=C2)CCN1C',
'CCC1C(COC)C(OC(=O)C2=CC=CC=C2)CCN1C',
'CCC1C(COC)C(OC(=O)C2=CC=CC=C2)CCN1CC',
'CCN1CCC(OC(=O)C2=CC=CC=C2)C(COC)C1C',
'CCN1CC(OC(=O)C2=CC=CC=C2)C(COC)C1C',
'CCN1CC(OC(=O)C2=CC=CC=C2)C(CO)C1C',
'CCC1C(C)N(CC)CC1OC(=O)C1=CC=CC=C1',
'CCC1C(C)C(OC(=O)C2=CC=CC=C2)CN1CC',
'CCC1CC(OC(=O)C2=CC=CC=C2)CN1CC',
'CCC1C(OC(=O)C2=CC=CC=C2)CN1CC',
'CCC1C(OC(=O)C2=CC=C(N)C=C2)CN1CC',
'CCN1CC(OC(=O)C2=CC=C(N)C=C2)C1C',
'CCN(CC)CCOC(=O)C1=CC=C(N)C=C1'
]
Total Execution Time: 153964.77300000002 # in milliseconds
This bidirectional algorithm uses the built-in operations to facilitate the search, but does one extra procedure after an iteration is completed – it changes the target molecules of the trees. When the new leaves are connected both trees are traversed and molecules closest to the current target are identified in each. The closest molecule from one tree is then set as the new target for the tree searching in the opposite direction and vice versa.
In Listing 2.15 we also use the time.clock
function to measure the execution
times of each potentially time-consuming operation.
2.6.3. Implementing the Original Molpher Algorithm¶
We can, of course, also implement the original exploration algorithm from the Molpher program:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 | """
Implementation of the classical search algorithm.
"""
import time
import gc
from molpher.core import ExplorationTree as ETree
from molpher.core.operations import *
def timeit(func):
milliseconds = 1000 * time.clock()
func()
return 1000 * time.clock() - milliseconds
class ClassicPathFinder:
def __init__(self, source, target):
options = {
'fingerprint' : 'ATOM_PAIRS'
}
self.tree = ETree.create(source=source, target=target)
self.ITERATION = [
GenerateMorphsOper()
, SortMorphsOper()
, FilterMorphsOper()
, ExtendTreeOper()
, PruneTreeOper()
]
@property
def path(self):
current = self.tree.fetchMol(self.tree.params['target'])
path = list()
path.append(current.smiles)
while current != '':
current = current.parent_smiles
if current:
current = self.tree.fetchMol(current)
path.append(current.smiles)
path.reverse()
return path
def __call__(self):
counter = 0
connecting_molecule = None
while not self.tree.path_found:
counter+=1
print('Iteration {0}:'.format(counter))
for oper in self.ITERATION:
print('Execution time ({0}):'.format(type(oper).__name__))
time_elapsed = timeit(lambda : self.tree.runOperation(oper))
print('\telapsed time: {0}'.format(time_elapsed))
def main():
milliseconds_now = 1000 * time.clock()
cocaine = 'CN1[C@H]2CC[C@@H]1[C@@H](C(=O)OC)[C@@H](OC(=O)c1ccccc1)C2'
procaine = 'O=C(OCCN(CC)CC)c1ccc(N)cc1'
pathfinder = ClassicPathFinder(cocaine, procaine)
pathfinder()
print(pathfinder.path)
print('Total Execution Time: {0}'.format(1000 * time.clock() - milliseconds_now))
if __name__ == "__main__":
exit(main())
|
The code above implements a simple ClassicPathFinder class that just runs some of the default operations defined in the library on a single tree. Because those operations are based on the original code, no change or configuration is needed.
2.7. Summary¶
If you have read the tutorial all the way to here, you now probably have a decent idea on what the library does and how to use it. If you have any suggestions on how to improve it or bug reports, please send them to sichom@vscht.cz or create an issue on the issue tracker. All help on the project is much appreciated.