on Saturday, May 24, 2014
Hierarchical Dirichlet process (HDP) clustering is a non-parametric Bayesian method for clustering data while sharing those clusters across groups. Dirichlet process mixture model itself is commonly used in Bayesian non-parametric clustering of data which allow us to avoid having to specify the number of clusters a priori. The DP mixture model are explained in more details here and here for the Gaussian case. When we have multiple groups (e.g. files or datasets) where we wish to share the clustering across, the standard DP mixture is extended such that the DP used for all groups share a base distribution which is in turn drawn from a DP. This results in the hierarchical DP model (fully explained here).

Sudderth, et al. (2005) extended the standard HDP for clustering by incorporating the notion of transformation. Kim & Smyth (2006) proposed an extension of HDP with random effects, where each group is assumed to be generated from a template mixture model that has group-level variability in the mixing proportions and the component parameters.

There aren't too many codes available online to illustrate how to actually implement the HDP. Yee Whye Teh's site has lots of links to tutorial slides and actual matlab codes. I played around with implementing the HDP for the purpose of clustering metabolite peaks across files, and the matlab code + PDF can be found here.
on Monday, March 25, 2013
Investigating the use of derivative peaks in the alignment of LC-MS data.

Joe Wandy, Rónán Daly, Rainer Breitling & Simon Rogers

Liquid chromatography (LC), followed by mass spectrometry (MS), has become one of the most commonly used tools in untargeted metabolomic studies. In almost all experimental designs, it is necessary to compare chromatograms across biological or technical replicates and across sample groups. Central to this step is the alignment of peaks across various chromatograms.

Alignment of peaks is a challenging problem due to missing data (peaks not observed in some chromatograms) and the potentially non-linear shifts in retention time between chromatograms. Existing tools use the information for peaks, without taking into account dependencies between peaks. But fragment and isotope peaks derived from the same metabolite should show the same chromatographic behaviour, and combining the information from their chromatograms (using e.g. chromatographic peak shape) could potentially be used to improve alignment quality.

In this work, we investigate ways to incorporate the information from derivative peaks into the alignment process, using a graph-based approach. In particular, we investigate whether concordant alignments within groups of derivative peaks can result in increased confidence about the quality of those alignments. Based on experiments with real metabolomic data [1], we find that using this approach allows us to explicitly trade recall for precision when producing a final set of alignments.

[1] Lange, E., Tautenhahn, R., Neumann, S., & Gröpl, C. (2008). Critical assessment of alignment procedures for LC-MS proteomics and metabolomics measurements. BMC Bioinformatics, 9, 375. doi:10.1186/1471-2105-9-375

----

To be presented in Mathematical and Statistical Aspects of Molecular Biology, 23rd annual MASAMB workshop.

on Friday, March 1, 2013
Just throwing around random ideas for research here ... still ongoing work.

1. Assumption

peak group is a set of related peaks that have been clustered together. Peaks within a peak group can be aligned to other peaks in different peak groups coming from other replicates. Since a peak group is a set of related peaks, we can expect the member peaks of say, two different groups to be ’similar’ to each other. Consequently, when aligning peaks, we would expect to see more alignments between ’similar’ peak groups that between dissimilar ones.
Assumption #1 A good peak alignment result has more peaks between two peak groups (of related peaks) that are aligned to each other.
Corollary #1 A bad peak alignment result has peaks (in a peak group) that are aligned to many other peaks in various different groups.
We need to find ways to justify and quantify the assumption above — although I feel that intuitively it makes sense.

2. Representation

We can represent a peak group as a vertex in an undirected graph, and the alignment of peaks as edges between the vertices. Multiple alignments of peaks between the same groups can be represented as a weight along the edge. To differentiate between replicates, we colour the vertices based on their originating replicates. For an alignment of peaks across N replicates, we connect all the N peaks groups (containing the aligned peaks) to each other in the graph.
The degree of a vertex in a graph is the number of edges that touches it. For the purpose of peak alignment and provided assumption #1 is correct, a peak group (vertex) with a small degree means it’s not aligned to many other peak groups. This is indicative of good peak alignments. Furthermore, since we represent multiple alignments between two peak groups as a single edge with increasing weight, greater weight means more peaks are aligned (correctly) between the two groups of related peaks. The presence of edges with larger weight is also indicative of good alignment results.
Figure 1↓ illustrates this. The left box shows some possible pairwise alignments (spanning two files), while the right box shows some possible triplets alignments (spanning three files). Whenever peaks are aligned across three replicates, we connect all their groups to each other to maintain symmetry and make analysis easier. The good alignment, shown at the top of each boxes, between peak groups would have vertices with low degrees but larger weight along the edge. The bad alignment, shown at the bottom, has a vertex with a high degree where its the adjacent edges has smaller weight.


Figure 1 A vertex is a peak group. Each vertex colour represents a different replicate (input file). The edge linking two vertices represents an alignment between the peaks in the groups. Edges have weight, indicating the number of alignments between the vertices. Top are the good alignments, having vertices with low degrees and greater edge weights. Bottom are the ’bad’ alignments, having vertices with higher degrees and lower edge weights.

3. Methods

The concepts defined above let us us crudely compare and evaluate alignment results. For baseline comparison, I chose the usual mzMatch greedy alignment algorithm. The algorithm works as follows (some corrections from what’s in the previous report):
First, sort all peaks by their intensities in descending order. Starting from the most intense chromatogram, try to match it with other chromatograms in the other samples (i.e. file) that falls within the mass tolerance (ppm) and having the nearest distance. Distance is defined over two peaks, say p1 and p2 according to algorithm 1.

diff = the absolute difference between p1.rt and p2.rt
​
# first, check if the rt difference is outside the retention 
# time window
​
If diff > rtWindow then 
 return distance = -1
​
# if inside retention time window but p1 and p2 are greater 
# than 30s apart, check if their signals overlap
​
s1 = get the signals (m/z, intensity) of p1
s2 = get the signals (m/z intensity) of p2
if diff > 30 then 
 check if the start-end m/z of s1 overlaps with the start-end m/z of s2
 if not, return distance = -1
​
# if less than 30s apart, then compare the difference in the 
# area under the curves of s1 & s2 
​
return distance = compare the signals (m/z, intensity) of p1 and p2
Algorithm 1 Computing the distance between two peaks in mzMatch alignment
’Comparing the signal’ above is done by calculating the difference in the area under the curve between the two peaks. Copy-n-paste from the code comment: Compares the given signal to this signal. Essentially the difference in area under the two curves is calculated, resulting in a single value which if small indicates the two signals are very similar and if large indicates the two signals are very unlike. The method starts by copying both signals and normalizing them to 1. The minimum and maximum x-value is calculated for the combination of both the signals. By starting at the minimum value and incrementing by 1 all the y-value differences between both signals are calculated and summed. This is the returned value.

4. Methods & Preliminary Results

4.1. A Simple Example

Aligning my test data {Std_3_1Std_3_2Std_3_3} from last week, with rt tolerance=30, ppm=3 (the default values), we get table 1↓. Assuming that a three-peaks alignment (across three files) is more reliable than a two-peaks alignment (across two files), so we’re just going to focus on the triplets for now. Figure 2 shows the resulting alignment counts. 939 alignments, each linking three peaks across three files, are found. We pick the top 100 alignments (ordered by the peak intensities) for visualisation, shown in figure 2. This gives us a graph with 78 vertices (peak groups) and 96 edges (where each alignment adds +1 to the weight of the edge).

No. of peaks per alignment# Alignment Count
Single-peak alignment6662
Two-peaks alignment1514
Three-peaks alignment939
Total9115
Table 1 Counts of alignments and their number of peaks


Figure 2 Top 100 alignment (by intensities) between peak groups visualised. Edges with large weight (>10) is coloured in red.

4.2. Comparing Good and Bad Alignment

We can now try to vary the ’strictness’ of the alignment parameters (m/z, rt tolerance) and see how they affect the resulting alignment results in graph representation. For that purpose, we define the two levels of strictness of the alignment parameters.

Strictness Levelppm tolerancert tolerance
Low10300
High110
Table 2 Strictness of alignments
Note: results produced from different runs of the program, so results might vary abit due to differences during the peak grouping stage.


Figure 3 Low (left) and high (right) strictness of alignment parameters. Edges in red have weight > 10.
Visual inspections of figure 3 shows some observable differences. The left box figure 3↑ shows that low strictness of alignment parameters produces a graph with more connected components (clusters) and have edges with large weights. This suggests that peaks across some ’similar’ groups are aligned frequently, since each alignment increases the weight along an edge. The average degree of all vertices is 3.23. We can see from figure 3 (left) that certain edges have very large weights (52, 53, ...). This happens because the alignment algorithm is linking only related peak groups together, so there are more peaks aligned multiple times across the same groups (vertices). In constrast, the bad alignment result in figure 3 (right) have slighly fewer connected components, because everything is sucked into the giant component in the middle. Vertices also have a higher degree (average 3.74) due to spurious links between unrelated peak groups. The results here are consistent with our assumptions #1 above. While this doesn’t prove the assumption is true, it’s a nice validation.
Key finding: using peak groups information and looking at the structure of the alignment results (in graph form) provides us with a way to compare different alignment results? We can also design algorithms that take advantage of this information to produce better alignment results, i.e. our EM multinomial mixture model.

4.3. What to do next ? Random thoughts ...

  1. Need to think of smarter ways of quantifying the quality of a pairwise alignment of peaks (or between peak groups).
  2. Need to look at the actual peaks and try to justify/validate assumption #1 above.
  3. If we look at alignment process as ’clustering’ of related peak groups, we can think about ways to validate the clustering results. I’ll have to investigate which appropriate internal (Davies–Bouldin index, Dunn Index, etc) or external evaluations are suitable. Simon also suggested simply computing the mean square errors of intensities between peaks along edges with large weights. Will try that too. Additionally, there are maybe things from graph theory (concerning the structural information/edge weight), which might be useful too. The goal is to get a single (or at least several) measures of the ’goodness’ of an alignment results, which can be used to compare various alignment algorithms (mzMatch, XCMS, MzMine, ours, etc.).
  4. Potentially we might have a way to ’detect’ a bad alignment entry ? From casual observation, a bad pairwise alignment of peaks seems to connect components together. Removing bad pairwise alignment reduces the number of connected components, i.e. it’s essentially to a bridge edge. We can try to run a bridge finding algorithm on the graph and try to see what happens. If we have a way to determine the quality of a pairwise alignment, then we can decide if it’s truly a good or bad pairwise alignment.