Sunday, January 27, 2013

on 6 comments

A Mixture Model for Peak Alignment

1. Peak Identification

1.1. Summary

The same identification behaviour (figure 1 below) is now obtained between mzMatch and MzMine — as it should be, because the two software packages essentially perform identification based on the same algorithm: looping through all peaks and matching against the database entries based on their m/z and rt tolerance. The differences from my previous results were due to:
  1. Bugs in my codes when counting identified peaks, doing file conversion, etc.
  2. Slight differences in mzMatch vs. MzMine when computing mass tolerance during identification.
What I should do next for our peak identification experiment:
  1. In MzMine, I still have to click around the GUI too many times when doing the experiment. This is too tedious, so need to automate this process.
  2. Also need to make sure to repeat each identification step several times to get their average behaviour, particularly once we’ve incorporated Ronan’s code.
  3. Maybe come up with some better ways to count/quantify performance ?

Figure 1 Comparing the identification performance between mzMatch and MzMine. The two software tools work based on the same algorithm of matching peaks based on their m/z values and retention time. Here, retention time matching is disabled. The plot shows how true positives, false positives and the F1-score are affected as the number of additional ’noisy’ entries in the database is increased. We observe some differences between the two plots in the figure due to randomisation effect; I’d need to average each data point several times later.

1.2 The Long Story

Carrying on from last week, we want to know why the identification results from mzMatch vs. MzMine are different ? Apart from fixing the bugs in my codes, I also traced what happened when we do identification based on just one molecule in the database.

<compounds>
 <compound>
  <id>StdMix1_1</id>
  <name>Spermidine</name>
  <formula>C7H19N3</formula>
  <inchi/>
  <description></description>
  <synonyms/>
  <polarity>+</polarity>
  <retentiontime>44.76</retentiontime>
  <monoisotopicmass>145.157898</monoisotopicmass>
 </compound>
</compounds>
Appendix 1. mzMatch’s identification database in XML format with just one metabolite.
Consider just one adduct [M+H] to search during the identification step. We’re also going to ignore all the retention information. Compound StdMix1_1 in the database has a monoisotopic mass of 145.157898 (see appendix 1 above). mzMatch calculated that it’s protonated adduct ion has the m/z value of 146.165174. mzMatch converts the mass tolerance from relative to absolute value by calculating
range = ppm*(0.000001*mass)
With the relative mass tolerance at ppm=3, the absolute mass tolerance is 4.385E-4 or 0.0004385. This gives us the absolute mass range of (146.164736, 146.165613) for matching peaks in the data.
The mzMine custom database search function — the one we’re comparing against — has the options shown in figure 2 below. It accepts a CSV file of the list of compounds to identify. I wrote a small utility (CompoundToCsv) to convert from the XML format used by mzMatch into the CSV format expected by MzMine, with their appropriate adduct ions added ([M+H], [M-H] for now). I also modified mzMine to also accept an additional ’noisy’ database. This is used during identification experiments.


Figure 2 MzMine’s custom DB search. Red box is my addition.
Similar to mzMatch, MzMine processes the input CSV file line by line, matching all peaks against the specified m/z and rt values. For each peak row (which may consists of several aligned peaks), it constructs the tolerance range based on the average m/z and retention time. The mass tolerance during identification in MzMine is specified in both absolute (m/z) and relative (ppm) values. For our StdMix1_1 compound, MzMine calculated the mass tolerance — at 3 ppm or 0.001 m/z — to be (146.164161, 146.166161). This is slightly different from the tolerance obtained in mzMatch, which is (146.164736~146.165613).
Poking around the source codes, I realised that whichever value greater between the absolute and relative tolerance parameter is used for matching peaks during identification. The default values for these two parameters are 0.001 m/z and 5 ppm respectively. To equalise things for the purpose of experimental settings, I decided to set the absolute mass tolerance to 0 m/z, relative tolerance to 3 ppm and retention tolerance to some big number (since it can’t be disabled).
Extra: this function does not search for adducts peaks. Instead, searches for adduct peaks is usually another step performed before the identification. The output of this adduct search function is a set of annotations on the peaks that tells us which peak is the adduct peaks of other peaks (figure 3). This information is not used during identification at all.

Figure 3 MzMine adduct search output

2. Peak Alignment

2.1. Summary

Implemented a preliminary prototype of the EM algorithm in Matlab/Octave — used to infer the parameters of a mixture model for our (hopefully) better alignment method of peaks across different replicates/samples. Seems to work okay. Our method might perform better against current alignment methods, which rely on computing the retention time drift of peaks across replicates.

2.2. Binning of Peaks across Replicates

We want to align peaks. A peak is characterised by its m/z value, intensity and retention time. Peaks derived from the same metabolite (’derivatives’) can be clustered by their mass chromatograms shapes correlations. We can use all these information to help in aligning peaks across different replicates. A picture (figure 4 below) is worth a thousand words here.

Figure 4 Alignment algorithm

2.3. Mixture of Multinomial Distributions

Given N clusters of correlated peaks and consistent bins of length M from the discretised the clusters of peaks: our input for this peak alignment model is a set of vectors X(x1, x2, ..., xn), where xnm is the cluster of peaks represented in its bin form, i.e. n vectors of size m x 1. If we know that we have K metabolites, we can assume that we would end up with K aligned clusters — produced as the result of peak alignment. Each xnm can be generated by the following multinomial distribution over m consistent bins:
(1)p(xn|βk) = sn!mxnm!mβxnmkm
where βkm is the probability of ’success’ of a particular bin to occur (mβkm = 1), xnm is the frequency count of discretised peak in that bin, and sn = mxnm. Assuming that each xnm is independent, the likelihood of the observed data X is the following multinomial mixture model is
(2)p(xn|β1, β2, ...) = nkπkCmβxnmkm
where C is the constant s!mxnm! and πk is the prior probability (mixture coefficient) of each multinomial component.

2.4. EM Algorithm

We can use the EM algorithm to fit the parameters of the model (βk, πk). It is easier to work with the log likelihood of equation (2), so
(3)B = log p(xn|β1, β2, ...) = nlog kπkCmβxnmkm
We want to maximise the likelihood in equation (3), but it’s hard due to the summation inside the log. Via Jensen's inequality [1], we know that log(E{X}) ≥ E{log(X)}, i.e. the log of an expectation >= the expectation of a log. By getting (3) into the form of a log of an expectation, it is easier to maximise its lower bound (the expectation of a log), which doesn’t contain any summation. To do this, we introduce qnk where qnk = 1, which is a probability distribution.
(4)B = nlog kqnkqnkπkCmβxnmkm = nlog kqnkπkCmβxnmkmqnk = nlog EqnkπkCmβxnmkmqnk
Applying Jensen’s inequality,
(5)B ≥ nEqnklogπkCmβxnmkmqnk = nkqnk logπkCmβxnmkmqnk
Expanding everything in the last expression in (5), we get the lower bound of B in (6) as
(6)kqnk log πk + nkqnk log C + nkqnk log mβxnmkm − nkqnk log qnk
Next, we get the partial derivatives of equation (6) with respect to the model parameters βk, πk, qnk and set them to 0. This gives us the update equation (7), (8) and (9) used in the EM algorithm.
(7)πk = 1Nnqnk
(8)βkm = nqnkxnmmnqnkxnm
(9)qnk = πkp(xn|βk)Kj = 1πjp(xn|βj)

2.5. Result

I tested the implementation on some easy test data. The result in figure 5 below looks sensible ?

Figure 5 EM result. Top left is the heatmap of the test data X: containing mostly 0s, except the coloured regions. Top right is the clustering assignments in the matrix Q. EM algorithm seems to discover the three (very obvious) clusters in the data correctly . Bottom left is the log likelihood during EM iterations, which converges very quickly.

3. Administrative Stuffs

A few administrative stuffs here and there. Just writing it down for my own reminder.
  1. Machine Learning Summer School 2013. SICSA states in their website that they will fund the registration and accomodation fees up to £500. I’d need to provide details & breakdown of the funds required. The application is not open yet (11 Feb onwards), so I have no idea how much the registration fee is. Will try to send email asking the organisers, or perhaps just wait until 11 February 2013 to find out. Would be from 25 August - 8 September 2013.
  2. SICSA Summer School 2013 on Big Data and Visualisation at St. Andrews, application closed by 15 March 2013. Would be from 8 - 12 July 2013.
  3. University of Glasgow’s Industry Day 28 Feb 2013, is it worth attending ? Need to submit poster, nothing to posterize yet ?

6 comments:

  1. For your test case, K = 3 right (i.e you had 3 clusters)? I'm just trying to clarify the notation.

    ReplyDelete
  2. Joe:

    I'd be grateful if you could point me to a paper/reference where this likelihood function is given? Also the EM Q function and the expression for the parameters.

    ReplyDelete
    Replies
    1. Multinomial mixture model is used a lot in document (text) clustering in information retrieval. Here's a good paper that describes it: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.108.5422&rep=rep1&type=pdf.

      For derivations of the math bits in the EM algorithm, I worked through it myself using pen & paper. Sorry, didn't type it though .. this pdf seems to be a good start :P http://www.cs.ubbcluj.ro/~csatol/gep_tan/Bishop-CUED-2006.pdf

      Delete
  3. Thanks! I had not seen anyone use both $pi_k$ and $q_{nk}$ which was why I was wondering if someone had published a derivation. The set-up discussed in the Bishop tutorial you linked to, for instance, only uses $pi_k$ and then Dirichlet priors for $pi$ and $\beta$.
    I also found this paper which discusses unsupervised clustering using EM with a multinomial likelihood: http://arxiv.org/abs/cs/0606069.

    ReplyDelete

Powered by Blogger.