Bio-Knowledge Engineering Research Laboratory

[ English | Japanese ]

Bioinformatics Center, Institute for Chemical Research, Kyoto University
Gokasho, Uji, Kyoto 611-0011, Japan
Phone: +81-774-38-3023
Fax: +81-774-38-3037

https://www.bic.kyoto-u.ac.jp/pathway/

Current Research Topics

We have been working on developing new techniques in machine learning and data mining for analyzing biological/chemical data, particularly focusing on those represented by graphs and networks, occasionally combining them with table-type datasets such as gene expression. Our approaches can be classified into the following roughly four types, according to general categories of machine learning.

Classification
Semi-Supervised Learning
Clustering
Frequent Pattern Mining

We will show you a brief overview of our work on each of these four topics, which are categorized by methodological viewpoints, while our work can be classified by application viewpoints, which are also shown below as well. Enjoy!

Classification

Classification is a typical framework in machine learning, while labeled examples are very valuable to obtain in biology. Due to this reason mainly, we have not necessarily thought about classification over graphs intensively. We have focused on a setting with a labeled table input, in which examples can be represented as nodes in networks. In other words, a label is given by considering the structure embedded behind the variables of each example. In reality this setting corresponds to the classification on labeled expression data, such as case/control, considering the network structures behind genes. On these two inputs we have developed a classification method based on probabilistic model learning with a bit classical concept, a hierarchical mixture of experts. We have further gone on this work to a more advanced network classifier which combines an incremental performance improvement technique, boosting, and a popular propagation scheme, such as message passing or expectation propagation, over a network. The following is a list of relevant publications of our laboratory on this topic:

A Markov Classification Model for Metabolic Pathways.

Hancock, T. and Mamitsuka, H.

Algorithms for Molecular Biology, 5(1), 10, 2010.

Special Issue: Selected papers from WABI 2009

[DOI].

Boosted Optimization for Network Classification.

Hancock, T. and Mamitsuka, H.,

Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS 2010) (JMLR: Workshop and Conference Proceedings, Vol. 9) , 305-312, Sardinia, Italy, May 2010, MIT Press.

[PDF (MIT Press)].

Semi-Supervised Learning

Semi-supervised learning over graphs can be divided into two types: clustering and classification. Semi-supervised clustering is node clustering under two types of constraints: must-link (two connected nodes must be in the same cluster) and cannot-link (two nodes should be in different clusters), while semi-supervised classification is label-propagation, where part of all nodes are labeled. For semi-supervised clustering we have developed a probabilistic model and a spectral approach. On the other hand, we have combined, for semi-supervised classification, graph features with an efficient supervised optimization. More concretely, we have formulated label-propagation into an efficient algorithm with graph embedding and multiple kernel learning. Another setting we have considered for label propagation is multiple graphs, sharing the same node set and some graph features possibly being missed among them. The following is a list of relevant publications of our laboratory on this topic:

Annotating Gene Function by Combining Expression Data with a Modular Gene Network.

Shiga, M., Takigawa, I. and Mamitsuka, H.

Proceedings of the Fifteenth International Conference on Intelligent Systems for Molecular Biology (ISMB/ECCB 2007), (Bioinformatics, 23(13), 2007), i468-i478, Vienna, Austria, July, 2007, Oxford University Press.

[DOI].

A Spectral Clustering Approach to Optimally Combining Numerical Vectors with a Modular Network.

Shiga, M., Takigawa, I. and Mamitsuka, H.,

Proceedings of the Thirteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2007), 647-656, San Jose, CA, USA, August 2007, ACM Press.

[DOI, PDF (Preprints)].

A Spectral Approach to Clustering Numerical Vectors as Nodes in a Network.

Shiga, M., Takigawa, I. and Mamitsuka, H.

Pattern Recognition, 44(2), 236-251, 2011.

[DOI].

Discriminative Graph Embedding for Label Propagation.

Nguyen, C. H. and Mamitsuka, H.

IEEE Transactions on Neural Networks, 22(9), 1395-1495, 2011.

[DOI].

Efficient Semi-Supervised Learning on Locally Informative Multiple Graphs

Shiga, M. and Mamitsuka, H.

Pattern Recognition, 45(3), 1035-1049, 2012.

[DOI].

Clustering

We have long time worked upon designing a variety of generative models and developing efficient learning schemes, i.e. estimating probabilistic parameters. The most representative work along this line is a series of generative models for labeled ordered trees. We have used Markov property assumption over trees and elaborated dynamic programming-based expectation propagation procedures. Similarly we have built some models for clustering scientific documents, which can be typically divided into several parts, like abstracts, main texts etc. More recently, for graphs, we have developed a generative model and a variational Bayesian learning scheme, under the setting with multiple graphs of all unique nodes, sharing clusters but some clusters being missed. The following is a list of relevant publications of our laboratory on this topic:

Probabilistic Model for Mining Labeled Ordered Trees: Capturing Patterns in Carbohydrate Sugar Chains.

Ueda, N., Aoki-Kinoshita, K. F., Yamaguchi, A., Akutsu, T. and Mamitsuka, H.

IEEE Transactions on Knowledge and Data Engineering, 17(8), 1051-1064, 2005.

[DOI].

A New Efficient Probabilistic Model for Mining Labeled Ordered Trees Applied to Glycobiology.

Hashimoto, K., Aoki-Kinoshita, K. F., Ueda, N., Kanehisa, M. and Mamitsuka, H.

ACM Transactions on Knowledge Discovery from Data, 2(1), Article No. 6, 2008.

[DOI].

Field Independent Probabilistic Model for Clustering Multi-Field Documents.

Zhu, S., Takigawa, I., Zeng, J. and Mamitsuka, H.

Information Processing and Management, 45(5), 555-570, 2009.

[DOI, PDF (Advance access)].

A Variational Bayesian Framework for Clustering with Multiple Graphs.

Shiga, M. and Mamitsuka, H.

To appear in IEEE Transactions on Knowledge and Data Engineering, 2011.

[DOI].

Frequent Pattern Mining

Mining frequent patterns is the mainstream framework in data mining. A well-known problem of frequent patterns is redundancy. To overcome this issue, we have developed the idea of δ-tolerance (or α-) closed frequent patterns (which are in between closed and maximal frequent patterns) for graphs and trees, and elaborated efficient algorithms of mining those patterns. A more application oriented technique we have worked upon is mining frequent subgraph-substring pairs from drug-target interactions. The following is a list of relevant publications of our laboratory on this topic:

Mining Significant Tree Patterns in Carbohydrate Sugar Chains.

Hashimoto, K., Takigawa, I., Shiga, M., Kanehisa, M. and Mamitsuka, H.

Proceedings of the Seventh European Conference on Computational Biology (ECCB 2008), (Bioinformatics, 24(16), 2008), i167-i173, Cagliari, Sardinia-Italy, September, 2008, Oxford University Press.

[DOI].

Efficiently Mining δ-Tolerance Closed Frequent Subgraphs.

Takigawa, I. and Mamitsuka, H.

Machine Learning, 82(2), 95-121, 2011.

[DOI].

Mining Significant Substructure Pairs for Interpreting Polypharmacology in Drug-Target Network.

Takigawa, I., Tsuda, K. and Mamitsuka, H.

PLoS One, 6(2), e16999, 2011.

[DOI].

Applications

We here show a variety of applications, which are part of all biological topics we have tackled or are tackling by using not only our original but also existing techniques in machine learning and data mining. The applications we introduce here are Annotating Gene Functions, Gene Expression Analysis-driven Systems Biology, Mining Histone Modification Patterns, Mining Proteolysis Patterns, Chemoinformatics, Detecting Switching Mechanisms in Gene Expression, Glycoinformatics, Immunoinformatics, Biomedical Text Mining.

Annotating Gene Functions

An important objective of molecular biology is to know the behavior of genes in cells, which is the first step of the extreme goal of understanding the entire biological mechanism. Gene functions can be assigned by various ways. A relatively straight-forward idea is to do clustering genes by which resultant clusters correspond to functional groups of genes, and then to assign functions to unknown genes by using known genes in the same cluster. Currently gene clustering can be done by a variety of information, including gene expression, network and literature. Thus relevant machine learning techniques include clustering and semi-supervised learning, for which we have developed original techniques which improve the existing performance and have applied to gene function annotation. The following is a list of relevant publications by our group on this topic:

Clustering Genes with Expression and Beyond.

Shiga, M. and Mamitsuka, H.

To appear in WIRES Data Mining and Knowledge Discovery, 2011.

Invited Review Paper.

[DOI].

Annotating Gene Function by Combining Expression Data with a Modular Gene Network.

Shiga, M., Takigawa, I. and Mamitsuka, H.

[DOI].

A Spectral Approach to Clustering Numerical Vectors as Nodes in a Network.

Shiga, M., Takigawa, I. and Mamitsuka, H.

Pattern Recognition, 44(2), 236-251, 2011.

[DOI].

A Variational Bayesian Framework for Clustering with Multiple Graphs.

Shiga, M. and Mamitsuka, H.

To appear in IEEE Transactions on Knowledge and Data Engineering, 2011.

[DOI].

Efficient Semi-Supervised Learning on Locally Informative Multiple Graphs

Shiga, M. and Mamitsuka, H.

Pattern Recognition, 45(3), 1035-1049, 2012.

[DOI].

Gene Expression Analysis-driven Systems Biology

Currently we have a various types of biological networks. A typical example is metabolic networks, which are a collection of chemical reactions, having been found in biology. A simple, possible biological question is what part of metabolic networks are made active (or inactive) under some condition, such as a serious disorder like Alzheimer's. We have explored approaches of capturing active paths in metabolic networks under some gene expression dataset. This work can be part of network biology or more generally systems biology. We are now the most interested in this direction, and have started embarking upon the machine learning-based journey across the vast biochemistry ocean. The following is a list of relevant publications by our group on this topic:

Probabilistic Path Ranking Based on Adjacent Pairwise Coexpression for Metabolic Transcripts Analysis.

Takigawa, I. and Mamitsuka, H.

Bioinformatics, 24(2), 250-257, 2008.

[DOI].

Mining Metabolic Pathways through Gene Expression.

Hancock, T., Takigawa, I. and Mamitsuka, H.

Bioinformatics, 26(17), 2128-2135, 2010.

[DOI].

Mining Histone Modification Patterns

Epigenetics is now a highly attention-paid concept, meaning a direct and outer control of DNA sequences at a somewhat macroscopic level, causing expression regulation. The main player of epigenetics is histones, around which DNA can wind up, and the modification of histones affects DNA winding, resulting in suppression or enhancement of gene expression. We have just started working on this project of capturing relevant patterns of various factors in epigenetics, mainly those including histone acetylations, transcription factors and gene expression. Our current approach is based on clustering, but would be changed depending upon the data we have to think about. The following is the first paper we have published on this topic:

Genome-wide Integration on Transcription Factors, Histone Acetylation and Gene Expression Reveals Genes Co-regulated by Histone Modification Patterns.

Natsume-Kitatani, Y., Shiga, M. and Mamitsuka, H.

PLoS One, 6(7), e22281, 2011.

[DOI].

Mining Proteolysis Patterns

The standard role of proteases is to digest target amino acid sequences into the pile of amino acid junk, while there exist special proteases, whose functions are to activate or modify the function of other proteins. So far it is not still well analyzed what substrates and what part of substrates are likely to be digested by these special proteases, and these problems must be a key issue to understand the biological phenomena, in which these special proteases are involved. We have formalized this topic as a sequence classification problem, which is a typical setting of bioinformatics, and applied a multiple kernel learning approach to predict the cleavage site with a variety of information related with sequences. The following is a list of relevant publications by our group on this topic:

A Review of Statistical Methods for Prediction of Proteolytic Cleavage

duVerle, D. and Mamitsuka, H.

To appear in Briefings in Bioinformatics, 2011.

[DOI].

Calpain Cleavage Prediction Using Multiple Kernel Learning.

duVerle, D., Ono, Y., Sorimachi, H. and Mamitsuka, H.

PLoS One, 6(5), e19035, 2011.

[DOI].

Chemoinformatics

One aspect of chemoinformatics is an old, classical issue known as QSAR (Quantitative Structure Activity Relationships) in computer chemistry, while another aspect is that a lot of modern machine learning techniques have been applied to this field quite recently. Extremely the objective is to capture principles behind drug-target interactions, for which we have conducted mainly two approaches: frequent pattern mining over graphs and generative model-based learning for co-occurrent drug-target events in documents. In the first approach, we have dealt with chemical compounds as graphs by using chemical structures, and considered basically two problem settings, both being based on frequent pattern mining. One setting is to generate frequent (basically, but changeable to significant) subgraph features from a given set of chemical compounds. The other setting is to consider pairs of chemical compounds and targets, which can be graphs and strings, and retrieve significant pairs of subgraphs and substring. The pairs obtained in the second approach can be clustered into groups, which we think can be a key to understand the polypharmacology of drug-target interactions. The following is a list of relevant publications by our group on this topic:

Efficiently Mining δ-Tolerance Closed Frequent Subgraphs.

Takigawa, I. and Mamitsuka, H.

Machine Learning, 82(2), 95-121, 2011.

[DOI].

Mining Significant Substructure Pairs for Interpreting Polypharmacology in Drug-Target Network.

Takigawa, I., Tsuda, K. and Mamitsuka, H.

PLoS One, 6(2), e16999, 2011.

[DOI].

A Probabilistic Model for Mining Implicit ``Chemical Compound - Gene'' Relations from Literature.

Zhu, S., Okuno, Y., Tsujimoto, G. and Mamitsuka, H.

Proceedings of the Fourth European Conference on Computational Biology (ECCB/JBI 2005), (Bioinformatics, 21, Supplement 2, ii245-ii251), Madrid, Spain, September, 2005, Oxford University Press.

[DOI].

Detecting Switching Mechanisms in Gene Expression

We have considered a real biological setting, which we call "switching mechanism", of gene expression, where one gene is correlated and the other gene is anti-correlated with some specific gene. In fact the switching mechanism is an important regulation mechanism over three-way interaction in expression, which naturally needs a considerable amount of computation time when considering all possible combinations of three genes. We have developed an efficient and robust approach for detecting switching mechanisms from gene expression, based on pruning techniques and newly developed hypothesis testing. The following is a list of relevant publications by our group on this topic:

Efficiently Finding Genome-wide Three-way Gene Interactions from Transcript- and Genotype-Data.

Kayano, M., Takigawa, I., Shiga, M., Tsuda, K. and Mamitsuka, H.

Bioinformatics, 25(21), 2735-2743, 2009.

[DOI].

ROS-DET: Robust Detector of Switching Mechanisms in Gene Expression.

Kayano, M., Takigawa, I., Shiga, M., Tsuda, K. and Mamitsuka, H.

Nucleic Acids Research, 39(11), e74, 2011.

[PubMed, DOI].

Glycoinformatics

Glycans, as sequences, have branch-shaped tree structures, which have made high-speed sequencing hard, resulting in generating rich sequence databases very slow. However, we have already more than 10,000 known glycans, awaiting for computational analysis. Tree-shaped glycans can be labeled ordered trees in a computer science sense, and we have developed a variety of techniques for mining from glycans, including generative models for labeled ordered trees and efficient frequent subtree mining. The following is a list of relevant publications by our group on this topic:

Probabilistic Model for Mining Labeled Ordered Trees: Capturing Patterns in Carbohydrate Sugar Chains.

Ueda, N., Aoki-Kinoshita, K. F., Yamaguchi, A., Akutsu, T. and Mamitsuka, H.

IEEE Transactions on Knowledge and Data Engineering, 17(8), 1051-1064, 2005.

[DOI].

ProfilePSTMM: Capturing Tree-structure Motifs in Carbohydrate Sugar Chains.

Aoki, K. F., Ueda, N., Mamitsuka, H. and Kanehisa, M.

Proceedings of the Fourteenth International Conference on Intelligent Systems for Molecular Biology (ISMB 2006), (Bioinformatics, 22(14), e25-e34), Fortaleza, Brazil, August, 2006, Oxford University Press.

[DOI].

Mining Significant Tree Patterns in Carbohydrate Sugar Chains.

Hashimoto, K., Takigawa, I., Shiga, M., Kanehisa, M. and Mamitsuka, H.

Proceedings of the Seventh European Conference on Computational Biology (ECCB 2008), (Bioinformatics, 24(16), 2008), i167-i173, Cagliari, Sardinia-Italy, September, 2008, Oxford University Press.

[DOI].

A New Efficient Probabilistic Model for Mining Labeled Ordered Trees Applied to Glycobiology.

Hashimoto, K., Aoki-Kinoshita, K. F., Ueda, N., Kanehisa, M. and Mamitsuka, H.

ACM Transactions on Knowledge Discovery from Data, 2(1), Article No. 6, 2008.

[DOI].

Informatic Innovations in Glycobiology: Relevance to Drug Discovery.

Mamitsuka, H.

Drug Discovery Today, 13(3/4), 118-123, 2008.

Invited Review Paper.

[PubMed, DOI].

Glycoinformatics: Data Mining-based Approaches.

Mamitsuka, H.

Chimia, 65(1/2), 10-13, 2011.

Invited Review Paper.

[DOI].

Immunoinformatics

Immunology would be one of the most advanced fields in biology, and the immunological mechanism is now well elucidated, resulting in a lot of good amount of databases to be analyzed. We have focused on MHC (Major Histocompatibility Complex) molecules which work for the initial stage of the immune system, and considered the problem of predicting peptides binding to MHC molecules. A simple prediction is by using a single predictor trained by a sorted set of peptides which bind to a certain MHC molecule. However our strategy is more flexible by incorporating a lot of other available information, such as auxiliary MHC molecules and ensemble classifiers. The following is a list of relevant publications by our group on this topic:

Improving MHC Binding Peptide Prediction by Incorporating Binding Data of Auxiliary MHC Molecules.

Zhu, S., Udaka, K., Sidney, J., Sette, A., Aoki-Kinoshita, K. F. and Mamitsuka, H.

Bioinformatics, 22(13), 1648-1655, 2006.

[DOI].

MetaMHC: A Meta Approach to Predict Peptides Binding to MHC Molecules

Hu, X., Zhou, W., Udaka, K., Mamitsuka, H. and Zhu, S.

Nucleic Acids Research, 38, W474-W479, 2010.

[DOI].

Ensemble Approaches for Improving HLA Class I-peptide Binding Prediction.

Hu, X., Mamitsuka, H. and Zhu, S.

Journal of Immunological Methods, 374(1/2), 47-52, 2011.

[PubMed, DOI].

Biomedical Text Mining

Clustering documents is a major issue in natural language processing, because of a huge number of currently accumulated documents for which labels cannot be assigned manually. This can be absolutely true of biomedical documents. Currently over 19 million articles from life science journals are indexed in MEDLINE, a public database of National Library of Medicine in the U.S., and clustering them into groups, which can be the most appropriate medical subfields, would be a great benefit for biological/medical scientists to find publications in which they might be interested the most. For clustering biomedical documents, we have conducted several different strategies, such as spectral learning and generative models. The following is a list of relevant publications by our group on this topic:

Enhancing MEDLINE Document Clustering by Incorporating MeSH Semantic Similarity.

Zhu, S., Zeng, J. and Mamitsuka, H.

Bioinformatics, 25(15), 1944-1951, 2009.

[DOI].

Field Independent Probabilistic Model for Clustering Multi-Field Documents.

Zhu, S., Takigawa, I., Zeng, J. and Mamitsuka, H.

Information Processing and Management, 45(5), 555-570, 2009.

[DOI, PDF (Advance access)].