on Friday, February 20, 2015
The paper "HDP-Align: Hierarchical Dirichlet Process Clustering for Multiple Peak Alignment of LC-MS Data" has been regretably rejected from the 23rd Annual International Conference on Intelligent Systems for Molecular Biology (ISMB 2015). We're still working on fixing the issues raised by the reviewers, but to whet your appetite, here's a sneak preview of the abstract.


Matching peak features across multiple LC-MS runs (alignment) is an integral part of all LC-MS data processing pipelines. Alignment is challenging due to variations in the retention time of peak features across runs and the large number of peak features produced by a single compound in the analyte. In this paper, we propose a Bayesian non-parametric model that aligns peaks via a hierarchical cluster model using both peak mass and retention time. Crucially, this method provides confidence values in the form of posterior probabilities allowing the user to distinguish between aligned peaksets of high and low confidence.

The results from our experiments on a diverse set of proteomic, glycomic and metabolomic data show that the proposed model is able to produce alignment results competitive to other widely-used benchmark methods, while at the same time, provide a probabilistic measure of confidence in the alignment results, thus allowing the possibility to trade precision and recall.

Availability: Our method has been implemented as a stand-alone application in Java, available for download at http://github.com/joewandy/HDP-Align.

The paper "Incorporating peak grouping information for alignment of multiple liquid chromatography mass spectrometry datasets" has been accepted for publication at Bioinformatics. Hooray! Link to follow shortly!
on Saturday, May 24, 2014
Hierarchical Dirichlet process (HDP) clustering is a non-parametric Bayesian method for clustering data while sharing those clusters across groups. Dirichlet process mixture model itself is commonly used in Bayesian non-parametric clustering of data which allow us to avoid having to specify the number of clusters a priori. The DP mixture model are explained in more details here and here for the Gaussian case. When we have multiple groups (e.g. files or datasets) where we wish to share the clustering across, the standard DP mixture is extended such that the DP used for all groups share a base distribution which is in turn drawn from a DP. This results in the hierarchical DP model (fully explained here).

Sudderth, et al. (2005) extended the standard HDP for clustering by incorporating the notion of transformation. Kim & Smyth (2006) proposed an extension of HDP with random effects, where each group is assumed to be generated from a template mixture model that has group-level variability in the mixing proportions and the component parameters.

There aren't too many codes available online to illustrate how to actually implement the HDP. Yee Whye Teh's site has lots of links to tutorial slides and actual matlab codes. I played around with implementing the HDP for the purpose of clustering metabolite peaks across files. A short documentation is available here, while the not-so-ready-for-production can be found at my Github.