Just throwing around random ideas for research here ... still ongoing work.
1. Assumption
A peak group is a set of related peaks that have been clustered together. Peaks within a peak group can be aligned to other peaks in different peak groups coming from other replicates. Since a peak group is a set of related peaks, we can expect the member peaks of say, two different groups to be ’similar’ to each other. Consequently, when aligning peaks, we would expect to see more alignments between ’similar’ peak groups that between dissimilar ones.
Assumption #1 A good peak alignment result has more peaks between two peak groups (of related peaks) that are aligned to each other.
Corollary #1 A bad peak alignment result has peaks (in a peak group) that are aligned to many other peaks in various different groups.
We need to find ways to justify and quantify the assumption above — although I feel that intuitively it makes sense.
2. Representation
We can represent a peak group as a vertex in an undirected graph, and the alignment of peaks as edges between the vertices. Multiple alignments of peaks between the same groups can be represented as a weight along the edge. To differentiate between replicates, we colour the vertices based on their originating replicates. For an alignment of peaks across N replicates, we connect all the N peaks groups (containing the aligned peaks) to each other in the graph.
The
degree of a vertex in a graph is the number of edges that touches it. For the purpose of peak alignment and
provided assumption #1 is correct, a peak group (vertex) with a small degree means it’s not aligned to many other peak groups. This is indicative of good peak alignments. Furthermore, since we represent multiple alignments between two peak groups as a single edge with increasing weight, greater weight means more peaks are aligned (correctly) between the two groups of related peaks. The presence of edges with larger weight is also indicative of good alignment results.
Figure
1↓ illustrates this. The left box shows some possible pairwise alignments (spanning two files), while the right box shows some possible triplets alignments (spanning three files). Whenever peaks are aligned across three replicates, we connect all their groups to each other to maintain symmetry and make analysis easier. The good alignment, shown at the top of each boxes, between peak groups would have vertices with low degrees but larger weight along the edge. The bad alignment, shown at the bottom, has a vertex with a high degree where its the adjacent edges has smaller weight.
3. Methods
The concepts defined above let us us crudely compare and evaluate alignment results. For baseline comparison, I chose the usual
mzMatch greedy alignment algorithm. The algorithm works as follows (
some corrections from what’s in the previous report):
First, sort all peaks by their intensities in descending order. Starting from the most intense chromatogram, try to match it with other chromatograms in the other samples (i.e. file) that falls within the mass tolerance (ppm) and having the nearest distance. Distance is defined over two peaks, say p1 and p2 according to algorithm 1.
diff = the absolute difference between p1.rt and p2.rt
# first, check if the rt difference is outside the retention
# time window
If diff > rtWindow then
return distance = -1
# if inside retention time window but p1 and p2 are greater
# than 30s apart, check if their signals overlap
s1 = get the signals (m/z, intensity) of p1
s2 = get the signals (m/z intensity) of p2
if diff > 30 then
check if the start-end m/z of s1 overlaps with the start-end m/z of s2
if not, return distance = -1
# if less than 30s apart, then compare the difference in the
# area under the curves of s1 & s2
return distance = compare the signals (m/z, intensity) of p1 and p2
Algorithm 1 Computing the distance between two peaks in mzMatch alignment
’Comparing the signal’ above is done by calculating the difference in the area under the curve between the two peaks. Copy-n-paste from the code comment: Compares the given signal to this signal. Essentially the difference in area under the two curves is calculated, resulting in a single value which if small indicates the two signals are very similar and if large indicates the two signals are very unlike. The method starts by copying both signals and normalizing them to 1. The minimum and maximum x-value is calculated for the combination of both the signals. By starting at the minimum value and incrementing by 1 all the y-value differences between both signals are calculated and summed. This is the returned value.
4. Methods & Preliminary Results
4.1. A Simple Example
Aligning my test data
{Std_3_1,
Std_3_2,
Std_3_3} from last week, with rt tolerance=30, ppm=3 (the default values), we get table
1↓. Assuming that a three-peaks alignment (across three files) is more reliable than a two-peaks alignment (across two files), so we’re just going to focus on the triplets for now. Figure 2 shows the resulting alignment counts. 939 alignments, each linking three peaks across three files, are found. We pick the top 100 alignments (ordered by the peak intensities) for visualisation, shown in figure 2. This gives us a graph with 78 vertices (peak groups) and 96 edges (where each alignment adds +1 to the weight of the edge).
| No. of peaks per alignment | # Alignment Count |
| Single-peak alignment | 6662 |
| Two-peaks alignment | 1514 |
| Three-peaks alignment | 939 |
| Total | 9115 |
Table 1 Counts of alignments and their number of peaks
4.2. Comparing Good and Bad Alignment
We can now try to vary the ’strictness’ of the alignment parameters (m/z, rt tolerance) and see how they affect the resulting alignment results in graph representation. For that purpose, we define the two levels of strictness of the alignment parameters.
| Strictness Level | ppm tolerance | rt tolerance |
| Low | 10 | 300 |
| High | 1 | 10 |
Table 2 Strictness of alignments
Note: results produced from different runs of the program, so results might vary abit due to differences during the peak grouping stage.
Visual inspections of figure 3 shows some observable differences. The left box figure
3↑ shows that low strictness of alignment parameters produces a graph with more connected components (clusters) and have edges with large weights. This suggests that peaks across some ’similar’ groups are aligned frequently, since each alignment increases the weight along an edge. The average degree of all vertices is 3.23. We can see from figure 3 (left) that certain edges have very large weights (52, 53, ...). This happens because the alignment algorithm is linking only related peak groups together, so there are more peaks aligned multiple times across the same groups (vertices). In constrast, the bad alignment result in figure 3 (right) have slighly fewer connected components, because everything is sucked into the giant component in the middle. Vertices also have a higher degree (average 3.74) due to spurious links between unrelated peak groups. The results here are consistent with our assumptions #1 above. While this doesn’t prove the assumption is true, it’s a nice validation.
Key finding: using peak groups information and looking at the structure of the alignment results (in graph form) provides us with a way to compare different alignment results? We can also design algorithms that take advantage of this information to produce better alignment results, i.e. our EM multinomial mixture model.
4.3. What to do next ? Random thoughts ...
- Need to think of smarter ways of quantifying the quality of a pairwise alignment of peaks (or between peak groups).
- Need to look at the actual peaks and try to justify/validate assumption #1 above.
- If we look at alignment process as ’clustering’ of related peak groups, we can think about ways to validate the clustering results. I’ll have to investigate which appropriate internal (Davies–Bouldin index, Dunn Index, etc) or external evaluations are suitable. Simon also suggested simply computing the mean square errors of intensities between peaks along edges with large weights. Will try that too. Additionally, there are maybe things from graph theory (concerning the structural information/edge weight), which might be useful too. The goal is to get a single (or at least several) measures of the ’goodness’ of an alignment results, which can be used to compare various alignment algorithms (mzMatch, XCMS, MzMine, ours, etc.).
- Potentially we might have a way to ’detect’ a bad alignment entry ? From casual observation, a bad pairwise alignment of peaks seems to connect components together. Removing bad pairwise alignment reduces the number of connected components, i.e. it’s essentially to a bridge edge. We can try to run a bridge finding algorithm on the graph and try to see what happens. If we have a way to determine the quality of a pairwise alignment, then we can decide if it’s truly a good or bad pairwise alignment.