Principle components analysis (PCA) with `scikit-learn`¶

principle-components_evaluated

scikits-learn is a premier machine learning library for python, with a very easy to use API and great documentation.

In [1]:

%pylab inline
import mdtraj as md
from sklearn.decomposition import PCA

Populating the interactive namespace from numpy and matplotlib

In [2]:

# Lets load up our trajectory. This is the trajectory that we generated in
# the "Running a simulation in OpenMM and analyzing the results with mdtraj"
# example.

In [3]:

traj = md.load('ala2.h5')
print traj

<mdtraj.Trajectory with 100 frames, 22 atoms, 3 residues, without unitcells>

/Users/rmcgibbo/miniconda/envs/2.7.9/lib/python2.7/site-packages/mdtraj-0.8.0-py2.7-macosx-10.5-x86_64.egg/mdtraj/formats/hdf5.py:330: UserWarning: No resSeq information found in HDF file, defaulting to zero-based indices
  warnings.warn('No resSeq information found in HDF file, defaulting to zero-based indices')

In [4]:

# Create a two component PCA model, and project our data down into this
# reduced dimensional space. Using just the cartesian coordinates as
# input to PCA, it's important to start with some kind of alignment.

pca1 = PCA(n_components=2)
traj.superpose(traj, 0)

Out[4]:

<mdtraj.Trajectory with 100 frames, 22 atoms, 3 residues, without unitcells at 0x109003590>

In [5]:

reduced_cartesian = pca1.fit_transform(traj.xyz.reshape(traj.n_frames, traj.n_atoms * 3))
print reduced_cartesian.shape

(100, 2)

In [6]:

# Now we can plot the data on this projection.

figure()
scatter(reduced_cartesian[:, 0], reduced_cartesian[:,1], marker='x', c=traj.time)
xlabel('PC1')
ylabel('PC2')
title('Cartesian coordinate PCA: alanine dipeptide')
cbar = colorbar()
cbar.set_label('Time [ps]')

Lets try cross-checking our result by using a different feature space that isn't sensitive to alignment, and instead to "featurize" our trajectory by computing the pairwise distance between every atom in each frame, and using that as our high dimensional input space for PCA.

In [7]:

pca2 = PCA(n_components=2)
from itertools import combinations
# this python function gives you all unique pairs of elements from a list
atom_pairs = list(combinations(range(traj.n_atoms), 2))
pairwise_distances = md.geometry.compute_distances(traj, atom_pairs)
print pairwise_distances.shape
reduced_distances = pca2.fit_transform(pairwise_distances)

(100, 231)

In [8]:

figure()
scatter(reduced_distances[:, 0], reduced_distances[:,1], marker='x', c=traj.time)
xlabel('PC1')
ylabel('PC2')
title('Pairwise distance PCA: alanine dipeptide')
cbar = pp.colorbar()
cbar.set_label('Time [ps]')

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-8-8ee6b491e003> in <module>()
      4 ylabel('PC2')
      5 title('Pairwise distance PCA: alanine dipeptide')
----> 6 cbar = pp.colorbar()
      7 cbar.set_label('Time [ps]')

NameError: name 'pp' is not defined

(principle-components.ipynb; principle-components_evaluated.ipynb; principle-components.py)

Principle components analysis (PCA) with scikit-learn¶

Principle components analysis (PCA) with `scikit-learn`¶