## Cross-validating tICA and MSMs

Some of our new work at the intersection of chemical physics and machine
learning on the construction of Markov models is now out on arXiv and
under review. The title of our manuscript is **"Variational cross-validation of
slow dynamical modes in molecular kinetics"**.

The question that lead down this road was

"How do we perform cross-validation on tICA and MSMs (3, 4, 1)?"

### Brief digression: What's tICA?

tICA is a linear dimensionality reduction method for multivariate
timeseries data that seeks to find the most slowly decelerating linear
functions of a set of input timeseries, $\{X_t\}$. You can see the papers
for more detail, but the gist of it comes down to solving a generalized
eigenvalue equation,
$$
C(\tau) v = \lambda \Sigma v,
$$
where $C(\tau)$ is the *time lagged correlation matrix*, which is a matrix containing
the average correlation of the $i$th component of the signal at time $t$ with
the $j$th component of the signal at time $t+\tau$, and $\Sigma$ is the covariance --
basically same idea, but without the time delay.

In our application domain, protein dynamics, the input signals $\{X_t\}$ are usually some pre-processed (featurized) projection of our real data. We know that there are slowly-decorrelating degrees of freedom (e.g. reaction coordinates) in our data, but since tICA is restricted to identifying linear projections, preprocessing via nonlinear featurizations increases the power of the method.

### Back to cross-validation

The eigenvectors which come out, $v$, are learned parameters that are estimated subject to statistical noise. Empirically, it's obvious that some featurizations perform better than others, and that Tikhonov regularization on $\Sigma$ can be critical (i.e. adding some multiple of the identity to $\Sigma$). So we're definitely going to need some procedure for scoring the set of tICS generated from one training dataset on a new test dataset.

### Why isn't this obvious?

As presented in the papers that introduce the method (3, 4), solving the
tICA problem is a stepwise procedure -- you get one tIC at a time by a stepwise
maximization of a series of autocorrelation functions **with mounting, data
dependent orthogonality constraints**. The significance of this is that it's
not obvious that there is any single unified objective function for the collection
of $n>1$ tICs.

The problem is that during training, the tICs constructed such that
$v_i^T \Sigma v_j = \delta_{ij}$. But since $\Sigma$ is estimated from data, if
you try to go back and *score* a set of tICs on new data, you need to reestimate
$\Sigma$, at which point the tICs no longer orthogonal, which is essential.
To quote from the paper.

While [previous] formulation[s] involves the stepwise optimization individual

ansatzeigenfunctions with mounting orthogonality constraints, our approach arrives at the same result during training via the optimization of a single scalar functional of a collection of $m$ansatzeigenfunctions simultaneously. This formulation uniquely enables the evaluation of the proposed eigenfunctions on new data which was held out during the fitting step, which we show to be essential to avoid overfitting.

### So what's the solution?

You'll have to check out the manuscript :) Or perhaps check back here for another post.

**
**