Finding centroids of clusters¶

centroids_evaluated

Finding centroids¶

In this example, we're going to find a "centroid" (representitive structure) for a group of conformations. This group might potentially come from clustering, using method like Ward hierarchical clustering.

Note that there are many possible ways to define the centroids. This is just one.

In [1]:

%pylab inline
import mdtraj as md
import numpy as np

Populating the interactive namespace from numpy and matplotlib

Load up a trajectory to use for the example.

In [2]:

traj = md.load('ala2.h5')
print traj

<mdtraj.Trajectory with 100 frames, 22 atoms, 3 residues, without unitcells>

In [3]:

# Lets compute all pairwise rmsds between conformations.
atom_indices = [a.index for a in traj.topology.atoms if a.element.symbol != 'H']
distances = np.empty((traj.n_frames, traj.n_frames))
for i in range(traj.n_frames):
    distances[i] = md.rmsd(traj, traj, i, atom_indices=atom_indices)

The algorithim we're going to use is relatively simple:

Compute all of the pairwise RMSDs between the conformations. This is O(N^2), so it's not going to scale extremely well to large datasets.
Transform these distances into similarity scores. Our similarities will calculated as $$ s_{ij} = e^{-\beta \cdot d_{ij} / d_\text{scale}} $$ where $s_{ij}$ is the pairwise similarity, $d_{ij}$ is the pairwise distance, and $d_\text{scale}$ is the standard deviation of the values of $d$, to make the computation scale invariant.
Then, we define the centroid as $$ \text{argmax}_i \sum_j s_{ij} $$

Using $\beta=1$, this is implemented with the following code:

In [4]:

beta = 1
index = np.exp(-beta*distances / distances.std()).sum(axis=1).argmax()
print index

In [5]:

centroid = traj[index]
print centroid

<mdtraj.Trajectory with 1 frames, 22 atoms, 3 residues, without unitcells>

In [6]:

(centroids.ipynb; centroids_evaluated.ipynb; centroids.py)