We used GPCR SARfari database as our training datasets. GPCR SARfari is a public, web-accessible database of measured binding affinities,
focusing chiefly on the interactions of GPCR proteins considered to be candidate chemical-GPCR with chemicals that are small, drug-like molecules.
Activity data were filtered to keep only activity end-point points that had half-maximum inhibitory concentration (IC50), half-maximum effective
concentration (EC50) or Ki values. Herein, to ensure that enough number of molecules could be used in model building, we previously selected those
targets with larger than 75 biological activity data. Following this procedure,
112,434 compounds associated with 237 target proteins remained with 222,020 activity end-points, which were used for model building.
Preparation of the positive and negative set
For those compounds with more than one activity values, we took the mean value of their activity values as the final activity value.
A compound was considered active when the mean activity value was below 10 uM. All compounds higher than 10 uM are considered inactive.
Following this split, maybe some human proteins have very little number of negative samples. To balance the number between positive samples
and negative samples for each human protein, we randomly selected certain number of compounds from other human proteins to generate the negative
samples for these human proteins. The number of these selected negative samples together with inactive samples should be basically equal to the number
of the active samples for these human proteins. These prepared positive set and negative set were used as the subsequent model building.
The SMILES formats of the compounds involved in the positive set and negative set for each human protein could be downloaded from the GPCRnet website.
Model training and validation
A series of high confidence QSAR models were built using GPCR SARfari. Naïve Bayes models were built with different fingerprint representations for 237 GPCR proteins. The Naïve Bayes method for predicting DTI profiling was chosen as it provided both good performance for noisy data sets and a high speed of calculation. Herein, to obtain the best model performance, we compared 11 types of molecular fingerprints when establishing the prediction models, including FP2, MACCS, FP3, FP4, Daylight, ECFP2-1024, ECFP4-1024, ECFP6-1024, ECFP2-2048, ECFP4-2048, and ECFP6-2048. To obtain the better prediction ability, we also ensemble all fingerprint models to obtain the average output. For each model, we applied
five-fold cross validation and external validation to evaluate the prediction performance of models.
For 5-fold cross validation, the data set is split into 5 roughly equal-sized parts firstly,
and then we fit the model to four parts of the data and calculate the error rate of the other part.
The process is repeated 5 times so that every part can be predicted as a validation set. To observe the stability of models,
we repeated the cross validation program 10 times to report standard deviations of each statistics. For the external validation,
the data were split in two parts for the validation step: compounds were clustered and assigned a cluster number. Clusters with an
odd number were assigned to the test set, and the clusters with an even number were assigned to the training set.
Models were built with the training set, and the test set was scored. Finally, a model was built with all data and scored against
itself – the training set and whole set should provide similar validation statistics. Statistics on the performance
of the models were reported, including commonly used ones in classification schemes: accuracy, sensitivity, specificity,
AUC, Matthews correlation coefficient (MCC) and F-score values. The cut-off providing the best MCC value was adopted, as they
are shown to provide better performance. Furthermore, two analyses were used to assess the performance of the different models.
The first analysis provides an overall score and does not need to specify a cut-off for distinguishing active from inactive compounds.
The area under the receiver operating characteristic (ROC) curve provides an indication of the ability of the model to prioritize active
compounds over inactive compounds.
The ROC curve is the plot of the true positive versus the false positive rate.
Copyright @ 2012-2015 Computational Biology & Drug Design Group,
School of Pharmaceutical Sciences, Central South University.
All rights reserved. The recommended browsers: Safari, Firefox, Chrome, IE(Ver.>8). E-mail: firstname.lastname@example.org