Finding centroids of clustersΒΆ

Finding centroids

In this example, we're going to find a "centroid" (representitive structure) for a group of conformations. This group might potentially come from clustering, using method like Ward hierarchical clustering.

Note that there are many possible ways to define the centroids. This is just one.

In [1]:
from __future__ import print_function
%matplotlib inline
import mdtraj as md
import numpy as np

Load up a trajectory to use for the example.

In [2]:
traj = md.load('ala2.h5')
print(traj)
<mdtraj.Trajectory with 100 frames, 22 atoms, 3 residues, without unitcells>

Lets compute all pairwise rmsds between conformations.

In [3]:
atom_indices = [a.index for a in traj.topology.atoms if a.element.symbol != 'H']
distances = np.empty((traj.n_frames, traj.n_frames))
for i in range(traj.n_frames):
    distances[i] = md.rmsd(traj, traj, i, atom_indices=atom_indices)

The algorithim we're going to use is relatively simple:

  • Compute all of the pairwise RMSDs between the conformations. This is O(N^2), so it's not going to scale extremely well to large datasets.
  • Transform these distances into similarity scores. Our similarities will calculated as $$ s_{ij} = e^{-\beta \cdot d_{ij} / d_\text{scale}} $$ where $s_{ij}$ is the pairwise similarity, $d_{ij}$ is the pairwise distance, and $d_\text{scale}$ is the standard deviation of the values of $d$, to make the computation scale invariant.
  • Then, we define the centroid as $$ \text{argmax}_i \sum_j s_{ij} $$

Using $\beta=1$, this is implemented with the following code:

In [4]:
beta = 1
index = np.exp(-beta*distances / distances.std()).sum(axis=1).argmax()
print(index)
83
In [5]:
centroid = traj[index]
print(centroid)
<mdtraj.Trajectory with 1 frames, 22 atoms, 3 residues, without unitcells>

(centroids.ipynb; centroids_evaluated.ipynb; centroids.py)

Versions