Clustering with md.rmsd() and scipy.cluster.hierarchy()ΒΆ

In this example, we cluster our alanine dipeptide trajectory using the RMSD distance metric and hierarchical clustering.

In [1]:
from __future__ import print_function
%matplotlib inline
import mdtraj as md
import numpy as np
import matplotlib.pyplot as plt
import scipy.cluster.hierarchy
from scipy.spatial.distance import squareform
/home/travis/miniconda3/envs/docenv/lib/python3.5/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
  warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.')
/home/travis/miniconda3/envs/docenv/lib/python3.5/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
  warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.')

Let's load up our trajectory. This is the trajectory that we generated in the "Running a simulation in OpenMM and analyzing the results with mdtraj" example. The first step is to build the rmsd cache, which precalculates some values for the RMSD computation.

In [2]:
traj = md.load('ala2.h5')

Lets compute all pairwise rmsds between conformations.

In [3]:
distances = np.empty((traj.n_frames, traj.n_frames))
for i in range(traj.n_frames):
    distances[i] = md.rmsd(traj, traj, i)
print('Max pairwise rmsd: %f nm' % np.max(distances))
Max pairwise rmsd: 0.188493 nm

scipy.cluster implements the average linkage algorithm (among others)

In [4]:
# Clustering only accepts reduced form. Squareform's checks are too stringent
assert np.all(distances - distances.T < 1e-6)
reduced_distances = squareform(distances, checks=False)
In [5]:
linkage = scipy.cluster.hierarchy.linkage(reduced_distances, method='average')

Lets plot the resulting dendrogram.

In [6]:
plt.title('RMSD Average linkage hierarchical clustering')
_ = scipy.cluster.hierarchy.dendrogram(linkage, no_labels=True, count_sort='descendent')

(clustering.ipynb; clustering_eval.ipynb; clustering.py)