As the amount of DNA data in biological databases keeps growing at an increasing pace, the remote protein homology detection problem remains a major issue in bioinformatics. We propose in this talk a novel kernel for sequences - with applications to protein homology detection - to be used with usual kernel methods such as SVM. Borrowing
ideas and techniques from the field of data compression, we show
that positive definite kernels can be derived from arithmetic coding
algorithms, when the coding probability is a Bayesian mixture related
to the empirical mutual information between two strings.
When the coding probability is obtained by averaging finite memory sources, with mixture of Dirichlet distributions as conditional probabilities, the kernel can be written as a triple-mixture of probabilities and be computed with a linear complexity in time and space. Encouraging
results will be shown on the problem of remote protein homology detection.