ゲノム情報科学研究教育機構 / アブストラクト (2003年7月14日)

As the amount of DNA data in biological databases keeps growing at an increasing pace, the remote protein homology detection problem remains a major issue in bioinformatics. We propose in this talk a novel kernel for sequences - with applications to protein homology detection - to be used with usual kernel methods such as SVM. Borrowing ideas and techniques from the field of data compression, we show that positive definite kernels can be derived from arithmetic coding algorithms, when the coding probability is a Bayesian mixture related to the empirical mutual information between two strings.

When the coding probability is obtained by averaging finite memory sources, with mixture of Dirichlet distributions as conditional probabilities, the kernel can be written as a triple-mixture of probabilities and be computed with a linear complexity in time and space. Encouraging results will be shown on the problem of remote protein homology detection.