QSAR-DB file parser in Python¶
A wonderful resource for anyone interested in reproducibility in research and QSARs (there are at least a dozen of us in the world) is QSAR-DB, from Dr. Uko Maran's lab at the University of Tartu.
A note for the unfamiliar: Quantitative Structure-Activity Relationships (QSARs) are mathematical relationships (regression or classification) between aspects of chemical structure and pharmacological or toxic activity. The goal is to be able to predict the effect of differences in chemical structure on biological or physical activity of the compound.
In a classical QSAR, one has a number of chemical structure files, which are divided into a training and test set. One calculates "descriptors" for the chemical structure files, typically using descriptor calculation software such as DRAGON. Most descriptor calculation software will calculate thousands of descriptors for each molecule. If one is trying to predict a quantitative variable, say, the IC50 of a drug, using just a few of these descriptors, then some variable selection is typically done to prune the number of descriptors available to reduce search space -- perhaps a cutoff in shannon entropy, or removing descriptors that are highly correlated with each other. Typically, the number of descriptors left should be around 5x the number of compounds in the training set. At that point an algorithm (commonly a Genetic Algorithm) will be used to then search the space for the combination of descriptors that performs the best in predicting the activity. Classically, the model constructed will be a multiple linear regression with somewhere between 3-8 descriptors as indepedent variables and the biological/physicial activity (e.g. IC50, toxicity to X species, flash point temperature, etc.)
QSAR-DB is a great resource because it is a repository of the data from previous QSARs that have been published. Of course the researcher who constructed the QSAR is responsible for voluntarily uploading the QSAR to QSAR-DB in its format. The format is the great part for us, because there is structure to each dataset and it is in many cases possible to reproduce the QSAR in question with the QSAR-DB files.
The following code is something I wrote during my grad school days at the ebio lab at SoongSil University (credit to my advisor and mentor Dr. Kwang-Hwi Cho) and it may be useful to another person interested in reproducing QSARs.
import os
import pandas as pd
import xml.etree.ElementTree as ET
import urllib
import copy
'''Contains a qdbrep class. Declare using IQSAR.qdb.qdbrep(/absolute/path/to/unzipped/qsar-db/folder/)
and perform getdescs, getyvals, getinchis, getcas functions on that object.'''
class qdbrep(object):
def __init__(self, dir):
self.dir=dir
def _getsub(self):
'''The _getsub method gets all the subfolders in the qdb folder given'''
return [name for name in os.listdir(self.dir)
if os.path.isdir(os.path.join(self.dir, name))]
def getdescs(self):
'''The getdescs method collates the data from the "descriptors" folder into a pandas dataframe. One can use the .to_csv() method from pandas to export to csv.'''
if "descriptors" in self._getsub():
descfolder=self.dir+"descriptors/"
for root, dirs, files in os.walk(descfolder):
if not dirs:
pass
else:
dfs=[]
for directorynum in range(len(dirs)):
dfs.append(pd.read_csv(descfolder+str(dirs[directorynum])+"/values",sep="\t", index_col=0))
return pd.concat(dfs, axis=1)
else:
raise IOError("No descriptors folder present in this particular QSAR-DB!")
def getyvals(self):
'''Gets all the activity values from the "properties" folder and transposes to a pandas DataFrame'''
if "properties" in self._getsub():
propfolder=self.dir+"properties/"
for root, dirs, files in os.walk(propfolder):
if not dirs:
pass
else:
if len(dirs)==1:
return pd.read_csv(propfolder+str(dirs[0])+"/values",sep="\t", index_col=0)
elif len(dirs)>1:
saffa=[]
for directorynum in range(len(dirs)):
saffa.append(pd.read_csv(propfolder+str(dirs[directorynum])+"/values",sep="\t", index_col=0))
return pd.concat(saffa, axis=1)
else:
raise IOError("No properties folder present in this particular QSAR-DB!")
def getcompounds(self):
'''Extracts information from compounds/compounds.xml to a pandas DataFrame, which can be iterated over.'''
if "compounds" in self._getsub():
xmlfile=self.dir+"compounds/compounds.xml"
if xmlfile.endswith(".xml"):
tree = ET.parse(xmlfile)
root = tree.getroot()
childs=[]
for child in root:
tindex=[ele.tag for ele in child]
for n,i in enumerate(tindex):
junk,notjunk=i.split("}")
tindex[n]=notjunk
childinfo=pd.Series([ele.text for ele in child], index=tindex)#, index=pdin)
childs.append(childinfo)
mas=pd.concat(childs, axis=1,sort=True)
mas2=mas.T
return mas2.set_index(keys="Id", drop=True)
else:
raise TypeError("Input file must be of type XML!")
def getmol(self,folderpath):
'''This command automatically downloads .mol files from the NIST websites to folderpath (must be
written as string, i.e. /absolute/path/to/folder/).
for this to work, the original QSAR-DB must have inchi files.
this relies on the getinchi() method of this class.
This may not work if the inchi is ambiguous and there is more than one NIST mol entry.
Check the folder and the print output to check.'''
nisturl="http://webbook.nist.gov/cgi/cbook.cgi?InChIFile="
inchiseries=self.getcompounds()["InChI"]
import math
if type(folderpath)==str:
for i in inchiseries.index:
if type(inchiseries[i])==float:
print(str(i)+".mol not downloaded")
else:
open(folderpath+str(i)+".mol", 'w').close()
urllib.request.urlretrieve(nisturl+inchiseries[i], folderpath+str(i)+".mol")
# for inchi in inchilist:
# #print nisturl+inchi
# urllib.urlretrieve(nisturl+inchi, folderpath+str(inchilist.index(inchi))+".mol")
# print "saved "+str(folderpath)+"1~"+str(inchilist.index(inchi))+".mol"
else:
raise TypeError("Type of folderpath must be a string!")
A simple example of a QSAR parsed¶
First, download your QSAR of interest from QSAR-DB. For this example, we will use Dulin, F., et al. Interpretation of honeybees contact toxicity associated to acetylcholinesterase inhibitors. Ecotoxicol. Environ. Saf. 2012, 79, 13–21.. This looked at 39 molecules as part of the training set, and 6 in the validation, for a total of 45 chemicals. The paper that was written based on this result can be found here.
Click the download button and extract the .zip folder contents in the normal way. Remember the path of the unzipped file, for you will need to provide the path as an argument to this script.
One declares a qdbrep object by giving the directory as an argument to qdbrep().
qdb_o=qdbrep("/path/2012EES13qdb/")
If one is interested in downloading the 45 .mol files that correspond to the 45 tested in this QSAR, we can do so using the getmol() method. First we make our destination folder for the molecule files:
!mkdir /path/2012EES13qdb/molecules
The following will query NIST for the compounds in the QSAR-DB file and download the .mol files from NIST if available. This may not work if the inchi is ambiguous and there is more than one NIST mol entry. However those cases are (thankfully) relatively rare.
qdb_o.getmol("/path/2012EES13qdb/molecules/")
ls /path/2012EES13qdb/molecules
We can check these mol files to see whether they look correct:
!head /path/2012EES13qdb/molecules/S1.mol
If you have RDKit installed, you can of course take a look at a 2-D depiction. For this, we will just look at structure 1.
from rdkit import Chem
m=Chem.MolFromMolFile("/path/2012EES13qdb/molecules/S1.mol")
from rdkit.Chem import Draw
Draw.MolToImage(m)
Wonderful. The getmol function relies on the InChI that the getcompounds function spits out, and the getcompounds in turn gets this from the compounds.xml file in the compounds folder in the QSAR project. The top of the compounds.xml file looks like this:
!head /path/2012EES13qdb/compounds/compounds.xml
And when parsed using the getcompounds() function, looks like this:
qdb_o.getcompounds()
Similarly, we can get the quantitative descriptors provided in the QSAR in the form of a pandas DataFrame.
qdb_o.getdescs()
Lastly, the y-values that we are trying to predict are also available:
qdb_o.getyvals()
I hope that this code would be of use to someone else who's trying to replicate QSARs and doesn't have enough time to write their own parser. Please let me know if you try out this code and whether you can suggest any improvements and changes!