on Saturday, April 11, 2015
A lot of basic stuff seems to be missing / under-developed for natural language processing in the Indonesian language ?? Following are some links that I found and might be useful:

For sentiment analysis, due to the lack of training corpuses, unsupervised/semi-supervised approaches might be the way to go. [1] presents Latent Sentiment Model, a special case of the standard LDA with 3 topics only: positive, negative, neutral, and uses this for sentiment analysis on Chinese text. Results seem to be pretty decent .. Might have a play with implementing it later.

[1] He, Yulan. "Latent sentiment model for weakly-supervised cross-lingual sentiment classification." Advances in Information Retrieval. Springer Berlin Heidelberg, 2011. 214-225.

The poster for HDP-Align can be found here. The poster was presented as a submission to SICSA Medical Imaging Workshop in Dundee, Scotland on the 27th of March, 2015.

In the poster, we present a method to perform the probabilistic alignment of peak features across multiple input files via the assignment to latent variables shared across runs -- essentially clustering with a hierarchical Dirichlet process scheme with some modification to suit the nature of our problem.

The proposed model is probably generic enough to be applicable for other sort of data, however the current implementation of the inference procedure, via Gibbs sampling, is rather tightly coupled to existing codes and might not be easily reusable for other purposes. The code also doesn't scale well to realistic data size encountered in daily usage. I'm working on simplifying the model (discretising some stuff) and trying out other inference procedure (e.g. variational inference) that would let us scale the method to realistic-sized data. 
on Friday, February 20, 2015
The paper "HDP-Align: Hierarchical Dirichlet Process Clustering for Multiple Peak Alignment of LC-MS Data" has been regretably rejected from the 23rd Annual International Conference on Intelligent Systems for Molecular Biology (ISMB 2015). We're still working on fixing the issues raised by the reviewers, but to whet your appetite, here's a sneak preview of the abstract.


Matching peak features across multiple LC-MS runs (alignment) is an integral part of all LC-MS data processing pipelines. Alignment is challenging due to variations in the retention time of peak features across runs and the large number of peak features produced by a single compound in the analyte. In this paper, we propose a Bayesian non-parametric model that aligns peaks via a hierarchical cluster model using both peak mass and retention time. Crucially, this method provides confidence values in the form of posterior probabilities allowing the user to distinguish between aligned peaksets of high and low confidence.

The results from our experiments on a diverse set of proteomic, glycomic and metabolomic data show that the proposed model is able to produce alignment results competitive to other widely-used benchmark methods, while at the same time, provide a probabilistic measure of confidence in the alignment results, thus allowing the possibility to trade precision and recall.

Availability: Our method has been implemented as a stand-alone application in Java, available for download at http://github.com/joewandy/HDP-Align.