on Monday, March 25, 2013
Investigating the use of derivative peaks in the alignment of LC-MS data.

Joe Wandy, Rónán Daly, Rainer Breitling & Simon Rogers

Liquid chromatography (LC), followed by mass spectrometry (MS), has become one of the most commonly used tools in untargeted metabolomic studies. In almost all experimental designs, it is necessary to compare chromatograms across biological or technical replicates and across sample groups. Central to this step is the alignment of peaks across various chromatograms.

Alignment of peaks is a challenging problem due to missing data (peaks not observed in some chromatograms) and the potentially non-linear shifts in retention time between chromatograms. Existing tools use the information for peaks, without taking into account dependencies between peaks. But fragment and isotope peaks derived from the same metabolite should show the same chromatographic behaviour, and combining the information from their chromatograms (using e.g. chromatographic peak shape) could potentially be used to improve alignment quality.

In this work, we investigate ways to incorporate the information from derivative peaks into the alignment process, using a graph-based approach. In particular, we investigate whether concordant alignments within groups of derivative peaks can result in increased confidence about the quality of those alignments. Based on experiments with real metabolomic data [1], we find that using this approach allows us to explicitly trade recall for precision when producing a final set of alignments.

[1] Lange, E., Tautenhahn, R., Neumann, S., & Gröpl, C. (2008). Critical assessment of alignment procedures for LC-MS proteomics and metabolomics measurements. BMC Bioinformatics, 9, 375. doi:10.1186/1471-2105-9-375

----

To be presented in Mathematical and Statistical Aspects of Molecular Biology, 23rd annual MASAMB workshop.

on Friday, March 1, 2013
Just throwing around random ideas for research here ... still ongoing work.

1. Assumption

peak group is a set of related peaks that have been clustered together. Peaks within a peak group can be aligned to other peaks in different peak groups coming from other replicates. Since a peak group is a set of related peaks, we can expect the member peaks of say, two different groups to be ’similar’ to each other. Consequently, when aligning peaks, we would expect to see more alignments between ’similar’ peak groups that between dissimilar ones.
Assumption #1 A good peak alignment result has more peaks between two peak groups (of related peaks) that are aligned to each other.
Corollary #1 A bad peak alignment result has peaks (in a peak group) that are aligned to many other peaks in various different groups.
We need to find ways to justify and quantify the assumption above — although I feel that intuitively it makes sense.

2. Representation

We can represent a peak group as a vertex in an undirected graph, and the alignment of peaks as edges between the vertices. Multiple alignments of peaks between the same groups can be represented as a weight along the edge. To differentiate between replicates, we colour the vertices based on their originating replicates. For an alignment of peaks across N replicates, we connect all the N peaks groups (containing the aligned peaks) to each other in the graph.
The degree of a vertex in a graph is the number of edges that touches it. For the purpose of peak alignment and provided assumption #1 is correct, a peak group (vertex) with a small degree means it’s not aligned to many other peak groups. This is indicative of good peak alignments. Furthermore, since we represent multiple alignments between two peak groups as a single edge with increasing weight, greater weight means more peaks are aligned (correctly) between the two groups of related peaks. The presence of edges with larger weight is also indicative of good alignment results.
Figure 1↓ illustrates this. The left box shows some possible pairwise alignments (spanning two files), while the right box shows some possible triplets alignments (spanning three files). Whenever peaks are aligned across three replicates, we connect all their groups to each other to maintain symmetry and make analysis easier. The good alignment, shown at the top of each boxes, between peak groups would have vertices with low degrees but larger weight along the edge. The bad alignment, shown at the bottom, has a vertex with a high degree where its the adjacent edges has smaller weight.


Figure 1 A vertex is a peak group. Each vertex colour represents a different replicate (input file). The edge linking two vertices represents an alignment between the peaks in the groups. Edges have weight, indicating the number of alignments between the vertices. Top are the good alignments, having vertices with low degrees and greater edge weights. Bottom are the ’bad’ alignments, having vertices with higher degrees and lower edge weights.

3. Methods

The concepts defined above let us us crudely compare and evaluate alignment results. For baseline comparison, I chose the usual mzMatch greedy alignment algorithm. The algorithm works as follows (some corrections from what’s in the previous report):
First, sort all peaks by their intensities in descending order. Starting from the most intense chromatogram, try to match it with other chromatograms in the other samples (i.e. file) that falls within the mass tolerance (ppm) and having the nearest distance. Distance is defined over two peaks, say p1 and p2 according to algorithm 1.

diff = the absolute difference between p1.rt and p2.rt
​
# first, check if the rt difference is outside the retention 
# time window
​
If diff > rtWindow then 
 return distance = -1
​
# if inside retention time window but p1 and p2 are greater 
# than 30s apart, check if their signals overlap
​
s1 = get the signals (m/z, intensity) of p1
s2 = get the signals (m/z intensity) of p2
if diff > 30 then 
 check if the start-end m/z of s1 overlaps with the start-end m/z of s2
 if not, return distance = -1
​
# if less than 30s apart, then compare the difference in the 
# area under the curves of s1 & s2 
​
return distance = compare the signals (m/z, intensity) of p1 and p2
Algorithm 1 Computing the distance between two peaks in mzMatch alignment
’Comparing the signal’ above is done by calculating the difference in the area under the curve between the two peaks. Copy-n-paste from the code comment: Compares the given signal to this signal. Essentially the difference in area under the two curves is calculated, resulting in a single value which if small indicates the two signals are very similar and if large indicates the two signals are very unlike. The method starts by copying both signals and normalizing them to 1. The minimum and maximum x-value is calculated for the combination of both the signals. By starting at the minimum value and incrementing by 1 all the y-value differences between both signals are calculated and summed. This is the returned value.

4. Methods & Preliminary Results

4.1. A Simple Example

Aligning my test data {Std_3_1Std_3_2Std_3_3} from last week, with rt tolerance=30, ppm=3 (the default values), we get table 1↓. Assuming that a three-peaks alignment (across three files) is more reliable than a two-peaks alignment (across two files), so we’re just going to focus on the triplets for now. Figure 2 shows the resulting alignment counts. 939 alignments, each linking three peaks across three files, are found. We pick the top 100 alignments (ordered by the peak intensities) for visualisation, shown in figure 2. This gives us a graph with 78 vertices (peak groups) and 96 edges (where each alignment adds +1 to the weight of the edge).

No. of peaks per alignment# Alignment Count
Single-peak alignment6662
Two-peaks alignment1514
Three-peaks alignment939
Total9115
Table 1 Counts of alignments and their number of peaks


Figure 2 Top 100 alignment (by intensities) between peak groups visualised. Edges with large weight (>10) is coloured in red.

4.2. Comparing Good and Bad Alignment

We can now try to vary the ’strictness’ of the alignment parameters (m/z, rt tolerance) and see how they affect the resulting alignment results in graph representation. For that purpose, we define the two levels of strictness of the alignment parameters.

Strictness Levelppm tolerancert tolerance
Low10300
High110
Table 2 Strictness of alignments
Note: results produced from different runs of the program, so results might vary abit due to differences during the peak grouping stage.


Figure 3 Low (left) and high (right) strictness of alignment parameters. Edges in red have weight > 10.
Visual inspections of figure 3 shows some observable differences. The left box figure 3↑ shows that low strictness of alignment parameters produces a graph with more connected components (clusters) and have edges with large weights. This suggests that peaks across some ’similar’ groups are aligned frequently, since each alignment increases the weight along an edge. The average degree of all vertices is 3.23. We can see from figure 3 (left) that certain edges have very large weights (52, 53, ...). This happens because the alignment algorithm is linking only related peak groups together, so there are more peaks aligned multiple times across the same groups (vertices). In constrast, the bad alignment result in figure 3 (right) have slighly fewer connected components, because everything is sucked into the giant component in the middle. Vertices also have a higher degree (average 3.74) due to spurious links between unrelated peak groups. The results here are consistent with our assumptions #1 above. While this doesn’t prove the assumption is true, it’s a nice validation.
Key finding: using peak groups information and looking at the structure of the alignment results (in graph form) provides us with a way to compare different alignment results? We can also design algorithms that take advantage of this information to produce better alignment results, i.e. our EM multinomial mixture model.

4.3. What to do next ? Random thoughts ...

  1. Need to think of smarter ways of quantifying the quality of a pairwise alignment of peaks (or between peak groups).
  2. Need to look at the actual peaks and try to justify/validate assumption #1 above.
  3. If we look at alignment process as ’clustering’ of related peak groups, we can think about ways to validate the clustering results. I’ll have to investigate which appropriate internal (Davies–Bouldin index, Dunn Index, etc) or external evaluations are suitable. Simon also suggested simply computing the mean square errors of intensities between peaks along edges with large weights. Will try that too. Additionally, there are maybe things from graph theory (concerning the structural information/edge weight), which might be useful too. The goal is to get a single (or at least several) measures of the ’goodness’ of an alignment results, which can be used to compare various alignment algorithms (mzMatch, XCMS, MzMine, ours, etc.).
  4. Potentially we might have a way to ’detect’ a bad alignment entry ? From casual observation, a bad pairwise alignment of peaks seems to connect components together. Removing bad pairwise alignment reduces the number of connected components, i.e. it’s essentially to a bridge edge. We can try to run a bridge finding algorithm on the graph and try to see what happens. If we have a way to determine the quality of a pairwise alignment, then we can decide if it’s truly a good or bad pairwise alignment.
on Monday, February 4, 2013

1. Peak Identification

1.1. Summary

I retrieved a lot more ’noisy’ database entries from the massive online chemical database Pubchem. With Ronan’s new identification algorithm on hand, this allowed me to run experiments to compare the behaviour of mzMatch’s identification code versus our clustering-based one (figure 1↓).


Figure 1  ROC Curve comparing mzMatch identification vs. the new clustering-based one, at various levels of identification (id1, id2, ...) 
Note:

1) ROC curve is wrong, vary the threshold, not the DB size.

2) F1 score, why does MZMATCH always start at 1 ?

1.2. The Long Story

Previously, I retrieved the additional ’noisy’ database entries — to use during identification experiments — by searching our local snapshot of the KEGG database (total ~14,000 compounds). For each compound in the standard database, I searched for other compounds in the KEGG database that have similar masses, within a close tolerance. This approach worked fine. However, it didn’t really give us enough ’noisy’ entries to add to the database that can severely affect identification performance during performance evaluation. Specifically to construct the ROC curve, we want to be able to compute values for True Positive Rate, False Positive Rate and also the F-1 score reflecting an algorithm’s performance spanning all the way from 0 to 1.
This week, I tried a different approach: searching for compounds based on the m/z values of the peaks in the datasets (Std1Std2Std3). Specifically, I looped through all the peaks in a dataset, computing the monoisotopic neutral mass as follows:mneutral = (m ⁄ z * zmadduct. This value is used as the mass parameter (at ppm=3 for tolerance) to query from the online PubChem database. The ionisation mode specified defines the sign of ±, e.g. for the hydrogen adduct,±madduct is -1.00728 in the positive mode and ±madduct is +1.00728 in the negative mode. I limited my query to retrieving just the top 10 results from PubChem. The whole process of retrieving compounds from PubChem actually takes a long time (2 HTTP GETs per peak, plus network latency etc.), would need to look into a faster way of issuing batch queries to PubChem
After excluding isomers to the standard database compounds. I still ended up with a lot (tens of millions) of formulae that can be used for identification experiments. The retrieved formulae were randomised; the same increasing subsets used as additional ’noisy’ database entries for each performance evaluation process.
mzMatch’s identification results in a set of peaks, annotated based on the formulae matching a peak’ mass/retention time tolerance. Each formula annotation contributes some weight to the overall ’vote’ of that formula in the whole dataset. I had to set some threshold values for votes to separate between the positive and negative identification results. For this, I chose the threshold to be either 1 (sensible, or maybe not?).
Similarly, our new clustering-based algorithm also produces a set of probability values (for the different identification level) for every formula — corresponding to the ’support’ of peaks assigned to it. To separate between positive and negative identification results, I set the threshold to be 0.75 (plucked out of nowhere, but seems good).

2. Peak Alignment

2.1. Summary

As advised by Simon, I put a Diritchlet prior on the model from last week. This changed just one line of the Matlab code: the update equation for βk now has a regularisation parameter α — coming from the Diritchlet prior Dir(α). What’s the effect of setting a large/small value for α ?

2.2. Mixture of Multinomial Distributions

From last week, we have:
(1)p(xn|β1, β2, ...) = nkπkCmβxnmkm
where βkm is the probability of ’success’ of a particular bin to occur (mβkm = 1)xnm is the frequency count of discretised peak in that bin, and sn = mxnm. C is the multinomial constant s!mxnm! and πk is the prior probability (mixture coefficient) of each multinomial component.
Now, we put a Dirichlet prior on βk , Dir(α) = 1B(α)kβα − 1k. Since Dirichlet distribution is a conjugate prior to the multinomial distribution in equation (1), we can use the EM algorithm to get the maximum a posteriori (MAP) estimation. The M-step now becomes:
(2)πk = 1Nnqnk
(3)βkm = n(qnkxnm) + α − 1m(n(qnkxnm) + α − 1))
The E-step is still the same as before:
(4)qnk = πkp(xn|βk)Kj = 1πjp(xn|βj)

3. Administrative Stuffs

A few administrative stuffs here and there. Just writing it down for my own reminder.
  1. Machine Learning Summer School 2013 will need to wait until the registration is open by 11 Feb to know the exact registration fee. SICSA states in their website that they will fund the registration and accomodation fees up to £500. Got in touch with Dr. Philipp Hennig (one of the organisers), he said hotel rates in Tübingen _start_ around 60 Euro per night. For the whole 2 weeks duration of the summer school, that would be £700 for accomodations alone. Not including registration and travel fees (out of budget ?).
  2. I helped to organise the OBR Glasgow launch event. This event was also featured in the corner of SICSA’s front page, the news section (although not many people from DCS were interested). Not directly related to our research, but I found it to be a good way to keep in touch and make industrial contacts with the life science people. Also got to know a few other PhD students — life scientists working on proteomics/metabolomics researches making use of LC/GC-MS data — who are the ’users’ of the sort of ’product’ we are developing. Might be useful contacts in the future, who knows.
  3. 23rd Annual Workshop of Mathematical and Statistical Aspects of Molecular Biology, Imperial College London, April 11/12 2013. Registration is open now, abstract submission by mid-March.