Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Deep Learning for Drug Discovery With Keras

DZone's Guide to

Deep Learning for Drug Discovery With Keras

Learn how you can use Qubole Data Service (QDS) and Keras to minimize the time and operating expenses incurred in maintaining and updating drug discovery models.

· AI Zone ·
Free Resource

Insight for I&O leaders on deploying AIOps platforms to enhance performance monitoring today. Read the Guide.

Drug discovery is the process of identifying molecular compounds that are likely to become the active ingredient in prescription medicine. At a high level, it works by taking a set of candidate compounds (either synthetic or naturally derived) and evaluating their chemical reactions with (an often cloned) molecule that is largely correlated with a particular disease. Machine learning, and deep learning, in particular, have been highly successful in predicting the chemical reactions between candidate compounds and target molecules (see some examples here, here, and here). These models have enabled biomedical engineers to rapidly iterate on the design of new synthetic compounds by querying a trained deep neural network to estimate how a candidate compound would interact with a target molecule. This enables the largest pharmaceutical companies, such as Merck, to significantly reduce their drug discovery costs.

Applications of machine learning such as these typically require a cluster of GPU machines to perform parallel model training, parallel model selection, or both. Cloud providers minimize the capital expenses otherwise incurred on initial deployment of such clusters. Qubole Data Service (QDS) minimizes the time and operating expenses otherwise incurred in maintaining and updating such infrastructure. This demonstration, therefore, utilizes the framework developed in my log post on distributed deep learning within QDS.

Goal: Drug Discovery Using Deep Learning on the Merck Molecular Activity Dataset

Upon completion of this post, an enterprising data scientist should have trained at least one deep neural network to predict molecular activity on the Merck dataset. Within a Qubole notebook, you will cross-validate your DNNs using the Spark ML Pipeline interface while maintaining the big data best practices implemented in QDS. The resulting prediction quality will be reported as an R2 score, as well as visually inspected with a scatter plot overlaying the model outputs with the corresponding labels.

The convergence of training will also be verified by plotting the loss function of the best cross-validated model, as it is retrained on training data plus holdout data. It's important to note that the loss function decreases towards a non-unique minimum. There may be better minima in the cost function topology. 

R2 score on holdout validation data may be estimated using the following example code. Note: This quantitative metric lacks the ability to describe the quality of the trained model (i.e. favoring predictions around 4.3 as seen in the labels, but overestimating molecular activity on aggregate).

##
# `cv_model` and `evaluator` both defined in example code to cross validate
##
df_predictions = cv_model.transform(df_holdout)
r2_score = evaluator.evaluate(df_predictions)
print(r2_score)
0.915

Last but not least, GPU utilization and decreasing loss function values may be viewed during training by inspecting the Spark executor log file output. Instructions for navigating to these logs can be found at the last part of this article.

Example Code to Cross-Validate Your Drug Discovery Models

Working within the framework laid out in my post on distributed deep learning within QDS, you must specify your input dataset and which columns to rename, then exclude from the training features. This example uses the ACT01 dataset from the Merck molecular activity challenge.

base_dir = "//fully/qualified/path/to/root/folder/of/dataset/"

excluded_columns = ('MOLECULE', 'label')
columns_renamed = (('Act', 'label'), )

train_set = "ACT01_training.csv"
test_set = "ACT01_testing.csv"

num_workers = 50

The base directory will be some persistent cloud storage, such as Amazon S3 or Azure Blobs, and the number of workers must be identical to the number of Spark executors currently running in your cluster. The next step is to ingest DataFrames from the CSV data and to define the parameter grid over which you will cross-validate your drug discovery models. Here is a reasonable grid with which to begin on the ACT01 dataset

df_train = process_csv(base_dir + train_set,
                       columns_renamed=columns_renamed,
                       excluded_columns=excluded_columns,
                       num_workers=num_workers)
df_test = process_csv(base_dir + test_set,
                      columns_renamed=columns_renamed,
                      excluded_columns=excluded_columns,
                      num_workers=num_workers)

input_dim = 9491

param_grid = tuning.ParamGridBuilder() \
                   .baseOn(['regularizer', regularizers.l1_l2]) \
                   .addGrid('activations', [['tanh', 'relu']]) \
                   .addGrid('initializers', [['glorot_normal',
                                              'glorot_uniform']]) \
                   .addGrid('layer_dims', [[input_dim, 2000, 300, 1]]) \
                   .addGrid('metrics', [['mae']]) \
                   .baseOn(['learning_rate', 1e-2]) \
                   .baseOn(['reg_strength', 1e-2]) \
                   .baseOn(['reg_decay', 0.25]) \
                   .baseOn(['lr_decay', 0.90]) \
                   .addGrid('dropout_rate', [0.20, 0.35, 0.50, 0.65, 0.80]) \
                   .addGrid('loss', ['mse', 'msle']) \
                   .build()

Lastly, you need to define an estimator and an evaluation metric against which it will be cross-validated. All tuning will be done with the Spark ML pipelines, leveraging the framework here.

estimator = DistKeras(trainers.ADAG,
                      {'batch_size': 64,
                       'communication_window': 3,
                       'num_epoch': 5,
                       'num_workers': num_workers},
                      **param_grid[0])

evaluator = evaluation.RegressionEvaluator(metricName='r2')

cv_estimator = tuning.CrossValidator(estimator=estimator,
                                     estimatorParamMaps=param_grid,
                                     evaluator=evaluator,
                                     numFolds=4)
cv_model = cv_estimator.fit(df_train)

Example Code to Visualize Prediction Quality and Verify Decreasing Training Loss

The Spark ML Pipelines API enables us to compute R2 scores, run a trained model, and obtain predictions and provides us with a reference to the best Dist-Keras model from our cross-validation above. The code below demonstrates how to leverage all of these:

##
# equivalently:
#   dist_keras_model = cv_model.bestModel
#   df_predictions = dist_keras_model.transform(df_holdout)
##
df_predictions = cv_model.transform(df_holdout)
df_predictions = df_predictions.select('label', 'prediction').toPandas()
yt = df_predictions['label'].as_matrix()
yp = df_predictions['prediction'].as_matrix()

Note the selection of only the label and prediction columns prior to collecting into a local Pandas DataFrame. Random sampling 1K labels and predictions yields the qualitative comparison exemplified in the goals section of this demonstration.

fig = plt.figure()

idx = np.random.randint(df_predictions.count(), size=1000)
nx = np.arange(1000) + 1

axa = fig.add_subplot(121)
_ = axa.scatter(nx, yt[idx], s=20, facecolors='none', edgecolors='k')
_ = axa.scatter(nx, yp[idx], s=20, facecolors='none', edgecolors='r')
_ = axa.set_title('Qualitative Comparison: Predictions Versus Labels')
_ = axa.set_xticks([j for j in nx if not j % 100])
_ = axa.set_xticklabels([str(j) for j in nx if not j % 100])
_ = axa.set_xlabel('Molecule Subsample ID')
_ = axa.set_ylabel('Molecular Activity')
_ = axa.legend(('Label', 'Prediction'))

Dist-Keras provides us with the history of loss function values during training, as well as gives us access to the native Keras model trained in a data-parallel fashion. The latter provides a function that outputs the model layout in a format that is easy to plot. The code below leverages these features to complete our intended goal for this demonstration.

loss_and_metrics = dist_keras._trainer.get_averaged_history()
x = np.arange(loss_and_metrics.shape[0]) + 1
y = np.array(map(itemgetter(0), loss_and_metrics))

axb = fig.add_subplot(122)
_ = axb.plot(x, y, 'r-')
_ = axb.set_title('Training Loss Values')
_ = axb.set_xticks([j for j in x if not j % 50])
_ = axb.set_xticklabels([str(j) for j in x if not j % 50])
_ = axb.set_xlabel('training iteration')
_ = axb.set_ylabel('mean absolute percentage error')

fig.tight_layout()
show(fig)

##
# plot structure of best cross validated model to file
##
keras_model = dist_keras_model._keras_model
utils.plot_model(keras_model, to_file='drug_discovery_cv_model.png')

Viewing Logs During Model Training

Within your Qubole notebook, you have the ability to view Spark executor logs while your model is training. To view these logs in an overlaid frame, simply click on Job UI and follow the dark red arrows in the diagram below for subsequent clicks. Clicking on stderr will result in a log file similar to that in the Goals section of this post.

TrueSight is an AIOps platform, powered by machine learning and analytics, that elevates IT operations to address multi-cloud complexity and the speed of digital transformation.

Topics:
deep learning ,keras ,ai ,data science ,predictive analytics ,neural networks ,drug discovery ,qubole notebook ,data visualization ,tutorial

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}