# A parser in Python for the QSAR-DB file format

A parser for QSAR-DB files

# QSAR-DB file parser in Python¶

A wonderful resource for anyone interested in reproducibility in research and QSARs (there are at least a dozen of us in the world) is QSAR-DB, from Dr. Uko Maran's lab at the University of Tartu.

A note for the unfamiliar: Quantitative Structure-Activity Relationships (QSARs) are mathematical relationships (regression or classification) between aspects of chemical structure and pharmacological or toxic activity. The goal is to be able to predict the effect of differences in chemical structure on biological or physical activity of the compound.

In a classical QSAR, one has a number of chemical structure files, which are divided into a training and test set. One calculates "descriptors" for the chemical structure files, typically using descriptor calculation software such as DRAGON. Most descriptor calculation software will calculate thousands of descriptors for each molecule. If one is trying to predict a quantitative variable, say, the IC50 of a drug, using just a few of these descriptors, then some variable selection is typically done to prune the number of descriptors available to reduce search space -- perhaps a cutoff in shannon entropy, or removing descriptors that are highly correlated with each other. Typically, the number of descriptors left should be around 5x the number of compounds in the training set. At that point an algorithm (commonly a Genetic Algorithm) will be used to then search the space for the combination of descriptors that performs the best in predicting the activity. Classically, the model constructed will be a multiple linear regression with somewhere between 3-8 descriptors as indepedent variables and the biological/physicial activity (e.g. IC50, toxicity to X species, flash point temperature, etc.)

QSAR-DB is a great resource because it is a repository of the data from previous QSARs that have been published. Of course the researcher who constructed the QSAR is responsible for voluntarily uploading the QSAR to QSAR-DB in its format. The format is the great part for us, because there is structure to each dataset and it is in many cases possible to reproduce the QSAR in question with the QSAR-DB files.

The following code is something I wrote during my grad school days at the ebio lab at SoongSil University (credit to my advisor and mentor Dr. Kwang-Hwi Cho) and it may be useful to another person interested in reproducing QSARs.

In [2]:
import os
import pandas as pd
import xml.etree.ElementTree as ET
import urllib
import copy
'''Contains a qdbrep class.  Declare using IQSAR.qdb.qdbrep(/absolute/path/to/unzipped/qsar-db/folder/)
and perform getdescs, getyvals, getinchis, getcas functions on that object.'''
class qdbrep(object):

def __init__(self, dir):
self.dir=dir
def _getsub(self):
'''The _getsub method gets all the subfolders in the qdb folder given'''
return [name for name in os.listdir(self.dir)
if os.path.isdir(os.path.join(self.dir, name))]
def getdescs(self):
'''The getdescs method collates the data from the "descriptors" folder into a pandas dataframe.  One can use the .to_csv() method from pandas to export to csv.'''
if "descriptors" in self._getsub():
descfolder=self.dir+"descriptors/"
for root, dirs, files in os.walk(descfolder):
if not dirs:
pass
else:
dfs=[]
for directorynum in range(len(dirs)):
return pd.concat(dfs, axis=1)

else:
raise IOError("No descriptors folder present in this particular QSAR-DB!")

def getyvals(self):
'''Gets all the activity values from the "properties" folder and transposes to a pandas DataFrame'''
if "properties" in self._getsub():
propfolder=self.dir+"properties/"
for root, dirs, files in os.walk(propfolder):
if not dirs:
pass
else:
if len(dirs)==1:
elif len(dirs)>1:
saffa=[]
for directorynum in range(len(dirs)):
return pd.concat(saffa, axis=1)

else:
raise IOError("No properties folder present in this particular QSAR-DB!")
def getcompounds(self):
'''Extracts information from compounds/compounds.xml to a pandas DataFrame, which can be iterated over.'''
if "compounds" in self._getsub():
xmlfile=self.dir+"compounds/compounds.xml"

if xmlfile.endswith(".xml"):
tree = ET.parse(xmlfile)
root = tree.getroot()
childs=[]
for child in root:
tindex=[ele.tag for ele in child]
for n,i in enumerate(tindex):
junk,notjunk=i.split("}")
tindex[n]=notjunk
childinfo=pd.Series([ele.text for ele in child], index=tindex)#, index=pdin)
childs.append(childinfo)
mas=pd.concat(childs, axis=1,sort=True)
mas2=mas.T
return mas2.set_index(keys="Id", drop=True)

else:
raise TypeError("Input file must be of type XML!")
def getmol(self,folderpath):
'''This command automatically downloads .mol files from the NIST websites to folderpath (must be
written as string, i.e. /absolute/path/to/folder/).
for this to work, the original QSAR-DB must have  inchi files.
this relies on the getinchi() method of this class.
This may not work if the inchi is ambiguous and there is more than one NIST mol entry.
Check the folder and the print output to check.'''
nisturl="http://webbook.nist.gov/cgi/cbook.cgi?InChIFile="
inchiseries=self.getcompounds()["InChI"]
import math

if type(folderpath)==str:
for i in inchiseries.index:
if type(inchiseries[i])==float:
else:
open(folderpath+str(i)+".mol", 'w').close()
urllib.request.urlretrieve(nisturl+inchiseries[i], folderpath+str(i)+".mol")

#  for inchi in inchilist:
#      #print nisturl+inchi
#      urllib.urlretrieve(nisturl+inchi, folderpath+str(inchilist.index(inchi))+".mol")
#      print "saved "+str(folderpath)+"1~"+str(inchilist.index(inchi))+".mol"
else:
raise TypeError("Type of folderpath must be a string!")


# A simple example of a QSAR parsed¶

First, download your QSAR of interest from QSAR-DB. For this example, we will use Dulin, F., et al. Interpretation of honeybees contact toxicity associated to acetylcholinesterase inhibitors. Ecotoxicol. Environ. Saf. 2012, 79, 13–21.. This looked at 39 molecules as part of the training set, and 6 in the validation, for a total of 45 chemicals. The paper that was written based on this result can be found here.

Click the download button and extract the .zip folder contents in the normal way. Remember the path of the unzipped file, for you will need to provide the path as an argument to this script.

One declares a qdbrep object by giving the directory as an argument to qdbrep().

In [3]:
qdb_o=qdbrep("/path/2012EES13qdb/")


If one is interested in downloading the 45 .mol files that correspond to the 45 tested in this QSAR, we can do so using the getmol() method. First we make our destination folder for the molecule files:

In [4]:
!mkdir /path/2012EES13qdb/molecules


The following will query NIST for the compounds in the QSAR-DB file and download the .mol files from NIST if available. This may not work if the inchi is ambiguous and there is more than one NIST mol entry. However those cases are (thankfully) relatively rare.

In [5]:
qdb_o.getmol("/path/2012EES13qdb/molecules/")

In [6]:
ls /path/2012EES13qdb/molecules

S10.mol  S15.mol  S1.mol   S24.mol  S29.mol  S33.mol  S38.mol  S42.mol  S5.mol
S11.mol  S16.mol  S20.mol  S25.mol  S2.mol   S34.mol  S39.mol  S43.mol  S6.mol
S12.mol  S17.mol  S21.mol  S26.mol  S30.mol  S35.mol  S3.mol   S44.mol  S7.mol
S13.mol  S18.mol  S22.mol  S27.mol  S31.mol  S36.mol  S40.mol  S45.mol  S8.mol
S14.mol  S19.mol  S23.mol  S28.mol  S32.mol  S37.mol  S41.mol  S4.mol   S9.mol


We can check these mol files to see whether they look correct:

In [7]:
!head /path/2012EES13qdb/molecules/S1.mol





If you have RDKit installed, you can of course take a look at a 2-D depiction. For this, we will just look at structure 1.

In [13]:
from rdkit import Chem

In [16]:
m=Chem.MolFromMolFile("/path/2012EES13qdb/molecules/S1.mol")

In [18]:
from rdkit.Chem import Draw

In [19]:
Draw.MolToImage(m)

Out[19]:

Wonderful. The getmol function relies on the InChI that the getcompounds function spits out, and the getcompounds in turn gets this from the compounds.xml file in the compounds folder in the QSAR project. The top of the compounds.xml file looks like this:

In [8]:
!head /path/2012EES13qdb/compounds/compounds.xml

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<CompoundRegistry xmlns="http://www.qsardb.org/QDB">
<Compound>
<Id>S1</Id>
<Name>Acephate</Name>
<Labels>organophosphate</Labels>
<Cargos></Cargos>
<InChI>InChI=1S/C4H10NO3PS/c1-4(6)5-9(7,8-2)10-3/h1-3H3,(H,5,6,7)</InChI>
</Compound>
<Compound>


And when parsed using the getcompounds() function, looks like this:

In [9]:
qdb_o.getcompounds()

Out[9]:
Cargos Description InChI Labels Name
Id
S1 None NaN InChI=1S/C4H10NO3PS/c1-4(6)5-9(7,8-2)10-3/h1-3... organophosphate Acephate
S2 None Standard InChI is not able to describe the str... InChI=1S/C17H25N3O4S2/c1-5-23-16(21)11-12-20(1... carbamate Alanycarb
S3 None NaN InChI=1S/C7H14N2O2S/c1-7(2,12-4)5-9-11-6(10)8-... carbamate Aldicarb
S4 None NaN InChI=1S/C11H16N2O2/c1-8-7-9(15-11(14)12-2)5-6... carbamate Aminocarb
S5 None Structure given in the article differs from th... InChI=1S/C10H12N3O3PS2/c1-15-17(18,16-2)19-7-1... organophosphate Azinphos-methyl
S6 None NaN InChI=1S/C11H13NO4/c1-11(2)15-8-6-4-5-7(9(8)16... carbamate Bendiocarb
S7 None NaN InChI=1S/C20H30N2O5S/c1-7-25-17(23)11-12-22(14... carbamate Benfuracarb
S8 None InChI has been generated from the structure gi... InChI=1S/C13H19NO2/c1-4-6-10(2)11-7-5-8-12(9-1... carbamate Bufencarb
S9 None NaN InChI=1S/C12H11NO2/c1-13-12(14)15-11-8-4-6-9-5... carbamate Carbaryl
S10 None NaN InChI=1S/C12H15NO3/c1-12(2)7-8-5-4-6-9(10(8)16... carbamate Carbofuran
S11 None NaN InChI=1S/C9H11Cl3NO3PS/c1-3-14-17(18,15-4-2)16... organophosphate Chlorpyrifos
S12 None NaN InChI=1S/C12H21N2O3PS/c1-6-15-18(19,16-7-2)17-... organophosphate Diazinon
S13 None NaN InChI=1S/C4H7Cl2O4P/c1-8-11(7,9-2)10-3-4(5)6/h... organophosphate Dichlorvos
S14 None Structure given in the article differs from th... InChI=1S/C5H12NO3PS2/c1-6-5(7)4-12-10(11,8-2)9... organophosphate Dimethoate
S15 None NaN InChI=1S/C14H14NO4PS/c1-2-18-20(21,14-6-4-3-5-... organophosphate EPN
S16 None NaN InChI=1S/C9H22O4P2S4/c1-5-10-14(16,11-6-2)18-9... organophosphate Ethion
S17 None Original name: Ethoprofos (does not exist in ... InChI=1S/C8H19O2PS2/c1-4-7-12-11(9,10-6-3)13-8... organophosphate Ethoprophos
S18 None NaN InChI=1S/C10H16NO5PS2/c1-11(2)19(12,13)10-7-5-... organophosphate Famphur
S19 None NaN InChI=1S/C13H22NO3PS/c1-6-16-18(15,14-10(2)3)1... organophosphate Fenamiphos
S20 None NaN InChI=1S/C9H12NO5PS/c1-7-6-8(4-5-9(7)10(11)12)... organophosphate Fenitrothion
S21 None NaN InChI=1S/C11H17O4PS2/c1-4-13-16(17,14-5-2)15-1... organophosphate Fensulfothion
S22 None NaN InChI=1S/C10H15O3PS2/c1-8-7-9(5-6-10(8)16-4)13... organophosphate Fenthion
S23 None NaN InChI=1S/C10H15OPS2/c1-3-11-12(13,4-2)14-10-8-... organophosphate Fonofos
S24 None NaN InChI=1S/C11H15N3O2/c1-12-11(15)16-10-6-4-5-9(... carbamate Formetanate
S25 None NaN InChI=1S/C9H18NO3PS2/c1-4-8(3)16-14(12,13-5-2)... organophosphate Fosthiazate
S26 None Structure given in the article differs from th... InChI=1S/C10H19O6PS2/c1-5-15-9(11)7-8(10(12)16... organophosphate Malathion
S27 None Structure given in the article differs from th... InChI=1S/C2H8NO2PS/c1-5-6(3,4)7-2/h1-2H3,(H2,3,4) organophosphate Methamidophos
S28 None NaN InChI=1S/C6H11N2O4PS3/c1-10-5-7-8(6(9)16-5)4-1... organophosphate Methidathion
S29 None NaN InChI=1S/C11H15NO2S/c1-7-5-9(14-11(13)12-3)6-8... carbamate Methiocarb
S30 None NaN InChI=1S/C5H10N2O2S/c1-4(10-3)7-9-5(8)6-2/h1-3... carbamate Methomyl
S31 None NaN InChI=1S/C4H7Br2Cl2O4P/c1-10-13(9,11-2)12-3(5)... organophosphate Naled
S32 None Structure given in the article differs from th... InChI=1S/C7H13N3O3S/c1-8-7(12)13-9-5(14-4)6(11... carbamate Oxamyl
S33 None NaN InChI=1S/C10H14NO5PS/c1-3-14-17(18,15-4-2)16-1... organophosphate Parathion
S34 None NaN InChI=1S/C12H17O4PS2/c1-4-16-12(13)11(10-8-6-5... organophosphate Phenthoate
S35 None NaN InChI=1S/C12H15ClNO4PS2/c1-3-16-19(20,17-4-2)2... organophosphate Phosalone
S36 None NaN InChI=1S/C11H12NO4PS2/c1-15-17(18,16-2)19-7-12... organophosphate Phosmet
S37 None NaN InChI=1S/C10H19ClNO5P/c1-6-12(7-2)10(13)9(11)8... organophosphate Phosphamidon
S38 None NaN InChI=1S/C11H15BrClO3PS/c1-3-7-18-17(14,15-4-2... organophosphate Profenofos
S39 None Structure given in the article differs from th... InChI=1S/C11H15NO3/c1-8(2)14-9-6-4-5-7-10(9)15... carbamate Propoxur
S40 None Structure given in the article differs from th... InChI=1S/C11H18N4O2/c1-7-8(2)12-10(14(3)4)13-9... carbamate Pirimicarb
S41 None NaN InChI=1S/C12H15N2O3PS/c1-3-15-18(19,16-4-2)17-... organophosphate Quinalphos
S42 None NaN InChI=1S/C16H20O6P2S3/c1-17-23(25,18-2)21-13-5... organophosphate Temephos
S43 None NaN InChI=1S/C10H9Cl4O4P/c1-16-19(15,17-2)18-10(5-... organophosphate Tetrachlorvinphos
S44 None NaN InChI=1S/C10H18N4O4S3/c1-7(19-5)11-17-9(15)13(... carbamate Thiodicarb
S45 None NaN InChI=1S/C4H8Cl3O4P/c1-10-12(9,11-2)3(8)4(5,6)... organophosphate Trichlorfon

Similarly, we can get the quantitative descriptors provided in the QSAR in the form of a pandas DataFrame.

In [10]:
qdb_o.getdescs()

Out[10]:
Presence/absence of S - P at topological distance 1 Complementary Information Content of second order index Ghose-Viswanadhan-Wendoloski antiinflammatory-like index at 50% Shadow index component XZ Lowest eigenvalue 3 of Burden matrix weighted by atomic van der Waals volumes Geary autocorrelation of lag 6 weighted by Sanderson electronegativity Moran autocorrelation of lag 1 weighted by mass Mean information index on atomic composition E-state topological parameter Structural Information Content of second order index Presence/absence of C - C at topological distance 5 R-C(=X)-X / R-C#X / X=C=X Shadow index component YZ Geary autocorrelation of lag 1 weighted by mass Dipole moment component X Energy of the lowest unoccupied molecular orbital
Compound Id
S1 1 0.713 0 0.66986 1.317 0.000 0.213 2.023 30.447 0.835 1 1 0.63133 1.103 -2.542640 -0.105160
S2 0 0.890 0 0.66613 1.693 0.997 -0.069 1.744 79.169 0.843 1 2 0.71244 1.025 -1.607280 -0.073090
S3 0 1.039 0 0.54579 1.293 1.286 -0.226 1.741 27.796 0.779 1 0 0.62523 1.195 0.193520 -0.078990
S4 0 1.292 0 0.74537 1.544 1.944 -0.264 1.533 21.392 0.739 1 0 0.79094 1.045 -1.123750 -0.029420
S5 1 0.985 1 0.54227 1.584 0.720 0.583 2.099 38.281 0.803 1 1 0.66500 0.451 -1.338110 -0.132950
S6 0 1.070 0 0.56460 1.612 0.856 -0.286 1.611 22.943 0.780 1 0 0.56905 1.109 0.237190 -0.042090
S7 0 1.450 0 0.59223 1.749 1.551 0.045 1.595 77.611 0.752 1 1 0.64294 0.880 1.631040 -0.069570
S8 0 1.036 0 0.62500 1.608 1.174 -0.126 1.392 27.412 0.798 1 0 0.63984 0.883 0.170210 -0.040570
S9 0 1.304 0 0.71505 1.514 1.425 -0.114 1.505 16.576 0.722 1 0 0.68154 0.835 0.398390 -0.070700
S10 0 0.936 0 0.56702 1.612 1.298 -0.189 1.523 23.727 0.811 1 0 0.58322 0.977 -0.936620 -0.041300
S11 1 1.087 0 0.52100 1.573 0.749 0.075 2.234 40.788 0.776 1 0 0.53846 0.694 0.145400 -0.106460
S12 1 1.594 1 0.56816 1.613 0.861 0.602 1.772 38.633 0.700 1 0 0.62714 0.530 2.081220 -0.099730
S13 0 1.195 0 0.59889 1.352 1.191 -0.319 2.078 25.460 0.713 1 0 0.60528 1.118 -2.138410 -0.098420
S14 1 1.094 0 0.58108 1.352 1.567 0.489 2.027 35.237 0.761 1 1 0.64787 0.589 -0.722660 -0.110150
S15 1 1.373 1 0.46750 1.638 0.214 0.521 1.855 43.455 0.732 1 0 0.50407 0.574 4.123080 -0.158100
S16 1 2.415 0 0.58138 1.573 0.404 0.420 1.830 85.550 0.549 1 0 0.59030 0.612 1.236140 -0.124860
S17 1 1.704 0 0.67490 1.618 0.000 0.464 1.603 47.368 0.659 1 0 0.63043 0.748 -0.380620 -0.118850
S18 1 1.458 0 0.61606 1.610 0.478 0.266 1.963 55.290 0.716 1 0 0.65154 1.036 2.980890 -0.109390
S19 0 1.141 0 0.65606 1.622 0.944 0.034 1.675 46.432 0.787 1 0 0.63715 1.281 0.311890 -0.055240
S20 1 1.070 1 0.54713 1.491 0.797 0.577 1.990 41.204 0.780 1 0 0.62368 0.550 4.000380 -0.155860
S21 1 1.322 0 0.60026 1.573 0.585 0.255 1.771 41.799 0.742 1 0 0.58521 0.921 1.545080 -0.097830
S22 1 1.089 1 0.55325 1.352 1.321 0.237 1.774 25.054 0.780 1 0 0.61659 0.798 0.944890 -0.096630
S23 1 1.306 0 0.59623 1.539 0.712 0.526 1.623 23.938 0.731 1 0 0.64205 0.520 -0.367350 -0.103160
S24 0 1.243 0 0.69007 1.609 0.587 -0.331 1.618 22.658 0.749 1 1 0.69461 1.123 -1.942790 -0.054190
S25 1 1.074 0 0.72344 1.550 0.663 0.102 1.842 48.894 0.789 1 0 0.65992 1.054 2.037700 -0.103910
S26 1 1.395 0 0.61217 1.619 1.033 0.573 1.789 110.450 0.734 1 2 0.61576 0.546 -0.624090 -0.116590
S27 1 0.767 0 0.60677 1.194 0.000 -0.118 2.040 11.752 0.804 0 0 0.60351 1.270 0.617640 -0.086680
S28 1 1.233 1 0.60416 1.499 1.033 0.199 2.190 36.819 0.743 1 0 0.62411 0.800 1.475000 -0.119280
S29 0 1.101 0 0.73474 1.400 1.856 -0.138 1.618 22.657 0.776 1 0 0.77373 1.062 0.182390 -0.046660
S30 0 0.713 0 0.73721 1.268 0.969 -0.302 1.880 14.701 0.835 1 1 0.75568 1.255 0.529070 -0.060510
S31 0 1.075 0 0.57282 1.352 1.149 -0.172 2.339 45.612 0.751 1 0 0.58139 0.793 -1.625980 -0.111580
S32 0 1.307 0 0.74465 1.436 0.538 -0.259 1.893 43.622 0.725 1 2 0.74841 1.227 -0.246420 -0.103380
S33 1 1.360 1 0.58166 1.573 0.400 0.589 1.933 43.570 0.728 1 0 0.54881 0.542 4.435720 -0.160140
S34 1 1.240 1 0.64884 1.619 1.366 0.602 1.767 51.162 0.760 1 1 0.63929 0.474 0.927430 -0.118990
S35 1 0.980 0 0.52665 1.638 0.763 0.360 2.047 46.009 0.812 1 0 0.59382 0.522 -3.649200 -0.115790
S36 1 1.275 0 0.51696 1.562 0.765 0.587 2.016 45.871 0.743 1 2 0.61091 0.454 -1.163310 -0.143840
S37 0 1.399 0 0.66653 1.539 0.572 0.041 1.816 94.281 0.731 1 1 0.66938 0.998 -0.190720 -0.096860
S38 1 0.796 0 0.56898 1.629 0.659 -0.013 1.971 46.119 0.842 1 0 0.53594 0.611 0.305660 -0.115290
S39 0 1.142 0 0.59999 1.611 0.912 -0.240 1.526 23.591 0.767 1 0 0.59933 1.055 -0.619230 -0.042210
S40 0 1.901 0 0.81259 1.605 0.433 -0.458 1.612 30.385 0.629 1 0 0.78524 1.258 0.333200 -0.063970
S41 1 1.280 1 0.56547 1.573 1.117 0.584 1.900 25.795 0.748 1 0 0.54426 0.504 0.757410 -0.126070
S42 1 2.320 0 0.59438 1.641 0.676 0.342 1.880 51.858 0.582 1 0 0.65034 0.710 2.479250 -0.098010
S43 0 1.009 1 0.61384 1.353 1.036 -0.094 2.031 55.948 0.790 1 0 0.61061 0.851 -0.134520 -0.110100
S44 0 1.706 0 0.76611 1.580 0.624 -0.266 1.977 95.805 0.677 1 2 0.75668 1.241 0.115560 -0.093930
S45 0 1.213 0 0.69204 1.348 0.000 -0.398 2.084 43.190 0.719 0 0 0.68099 1.188 -1.923128 -0.103078

Lastly, the y-values that we are trying to predict are also available:

In [11]:
qdb_o.getyvals()

Out[11]:
48-h Honey bee toxicity as -log(LD50)
Compound Id
S1 2.188
S2 2.773
S3 2.830
S4 3.239
S5 2.881
S6 2.717
S7 3.409
S8 3.141
S9 3.158
S10 3.789
S11 3.776
S12 2.918
S13 2.649
S14 3.290
S15 3.126
S16 1.273
S17 1.639
S18 2.898
S19 3.038
S20 4.194
S21 2.967
S22 2.959
S23 1.877
S24 1.199
S25 3.044
S26 3.221
S27 2.019
S28 3.369
S29 2.991
S30 3.006
S31 2.902
S32 2.669
S33 3.228
S34 3.023
S35 1.930
S36 3.162
S37 2.312
S38 3.597
S39 2.190
S40 1.105
S41 3.244
S42 2.479
S43 2.429
S44 2.058
S45 0.637

I hope that this code would be of use to someone else who's trying to replicate QSARs and doesn't have enough time to write their own parser. Please let me know if you try out this code and whether you can suggest any improvements and changes!