Bioinformatics Center, Institute for Chemical Research, Kyoto University
Gokasho, Uji, Kyoto 611-0011, Japan
Phone: +81-774-38-3023
Fax: +81-774-38-3037
https://www.bic.kyoto-u.ac.jp/pathway/
Current Research Topics
We have been working on developing new techniques in machine learning
and data mining for analyzing biological/chemical data, particularly
focusing on those represented by graphs and networks, occasionally
combining them with table-type datasets such as gene expression. Our
approaches can be classified into the following roughly four types,
according to general categories of machine learning.
We will show you a brief overview of our work on each of these four
topics, which are categorized by methodological viewpoints, while our
work can be classified by application viewpoints, which are also shown
below as well. Enjoy!
Classification
Classification is a typical framework in machine learning, while
labeled examples are very valuable to obtain in biology. Due to this
reason mainly, we have not necessarily thought about classification
over graphs intensively.
We have focused on a setting with a labeled table input, in which
examples can be represented as nodes in networks. In other words, a
label is given by considering the structure embedded behind the
variables of each example. In reality this setting corresponds to the
classification on labeled expression data, such as case/control,
considering the network structures behind genes.
On these two inputs we have developed a classification method based
on probabilistic model learning with a bit classical concept, a
hierarchical mixture of experts.
We have further gone on this work to a more advanced network
classifier which combines an incremental performance improvement
technique, boosting, and a popular propagation scheme, such as message
passing or expectation propagation, over a network.
The following is a list of relevant publications of our laboratory
on this topic:
A Markov Classification Model for Metabolic Pathways.
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS 2010) (JMLR: Workshop and Conference Proceedings, Vol. 9) , 305-312, Sardinia, Italy, May 2010, MIT Press.
Semi-supervised learning over graphs can be divided into two types:
clustering and classification.
Semi-supervised clustering is node clustering under two types of
constraints: must-link (two connected nodes must be in the same
cluster) and cannot-link (two nodes should be in different clusters),
while semi-supervised classification is label-propagation, where part
of all nodes are labeled.
For semi-supervised clustering we have developed a probabilistic model
and a spectral approach.
On the other hand, we have combined, for semi-supervised
classification, graph features with an efficient supervised optimization.
More concretely, we have formulated label-propagation into an
efficient algorithm with graph embedding and multiple kernel learning.
Another setting we have considered for label propagation is multiple
graphs, sharing the same node set and some graph features possibly
being missed among them.
The following is a list of relevant publications of our laboratory
on this topic:
Annotating Gene Function by Combining Expression Data
with a Modular Gene Network.
Shiga, M., Takigawa, I. and Mamitsuka, H.
Proceedings of the Fifteenth
International Conference on Intelligent Systems for
Molecular Biology (ISMB/ECCB 2007),
(Bioinformatics, 23(13), 2007), i468-i478, Vienna,
Austria, July, 2007, Oxford University Press.
A Spectral Clustering Approach to Optimally Combining
Numerical Vectors with a Modular Network.
Shiga, M., Takigawa, I. and Mamitsuka, H.,
Proceedings of the Thirteenth ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining (KDD
2007), 647-656, San Jose, CA, USA, August 2007, ACM
Press.
We have long time worked upon designing a variety of
generative models and developing efficient learning schemes,
i.e. estimating probabilistic parameters.
The most representative work along this line is a series of
generative models for labeled ordered trees.
We have used Markov property assumption over trees and elaborated
dynamic programming-based expectation propagation procedures.
Similarly we have built some models for clustering scientific
documents, which can be typically divided into several parts, like
abstracts, main texts etc.
More recently, for graphs, we have developed a generative model and
a variational Bayesian learning scheme, under the setting with
multiple graphs of all unique nodes, sharing clusters but some
clusters being missed.
The following is a list of relevant publications of our laboratory
on this topic:
Probabilistic Model for Mining Labeled Ordered Trees: Capturing
Patterns in Carbohydrate Sugar Chains.
Ueda, N., Aoki-Kinoshita, K. F., Yamaguchi, A., Akutsu, T. and Mamitsuka, H.
IEEE Transactions on Knowledge and Data Engineering, 17(8),
1051-1064, 2005.
Mining frequent patterns is the mainstream framework in data mining.
A well-known problem of frequent patterns is redundancy.
To overcome this issue, we have developed the idea of
δ-tolerance (or α-) closed frequent patterns (which
are in between closed and maximal frequent patterns) for graphs and
trees, and elaborated efficient algorithms of mining those patterns.
A more application oriented technique we have worked upon is mining
frequent subgraph-substring pairs from drug-target interactions.
The following is a list of relevant publications of our laboratory
on this topic:
Mining Significant Tree Patterns in Carbohydrate Sugar Chains.
Hashimoto, K., Takigawa, I., Shiga, M., Kanehisa, M. and Mamitsuka, H.
Proceedings of the Seventh
European Conference on Computational Biology (ECCB 2008),
(Bioinformatics, 24(16), 2008),
i167-i173, Cagliari, Sardinia-Italy, September, 2008, Oxford University Press.
An important objective of molecular biology is to know the behavior of
genes in cells, which is the first step of the extreme goal of
understanding the entire biological mechanism. Gene functions can be
assigned by various ways. A relatively straight-forward idea is
to do clustering genes by which resultant clusters correspond to
functional groups of genes, and then to assign functions to unknown
genes by using known genes in the same cluster.
Currently gene clustering can be done by a variety
of information, including gene expression, network and literature.
Thus relevant machine learning techniques include clustering and
semi-supervised learning, for which we have developed original
techniques which improve the existing performance and have applied
to gene function annotation.
The following is a list of relevant publications by our group on
this topic:
Clustering Genes with Expression and Beyond.
Shiga, M. and Mamitsuka, H.
To appear in WIRES Data Mining and Knowledge Discovery,
2011.
Annotating Gene Function by Combining Expression Data
with a Modular Gene Network.
Shiga, M., Takigawa, I. and Mamitsuka, H.
Proceedings of the Fifteenth
International Conference on Intelligent Systems for
Molecular Biology (ISMB/ECCB 2007),
(Bioinformatics, 23(13), 2007), i468-i478, Vienna,
Austria, July, 2007, Oxford University Press.
Currently we have a various types of biological networks.
A typical example is metabolic networks, which are a collection of
chemical reactions, having been found in biology.
A simple, possible biological question is what part of metabolic
networks are made active (or inactive) under some condition, such as
a serious disorder like Alzheimer's.
We have explored approaches of capturing active paths in metabolic
networks under some gene expression dataset.
This work can be part of network biology or more generally systems
biology.
We are now the most interested in this direction, and have started
embarking upon the machine learning-based journey across the vast
biochemistry ocean.
The following is a list of relevant publications by our group on
this topic:
Probabilistic Path Ranking Based on Adjacent Pairwise
Coexpression for Metabolic Transcripts Analysis.
Epigenetics is now a highly attention-paid concept, meaning a direct and
outer control of DNA sequences at a somewhat macroscopic level,
causing expression regulation.
The main player of epigenetics is histones, around which DNA can wind
up, and the modification of histones affects DNA winding,
resulting in suppression or enhancement of gene expression.
We have just started working on this project of capturing relevant
patterns of various factors in epigenetics, mainly those including
histone acetylations, transcription factors and gene expression.
Our current approach is based on clustering, but would be changed
depending upon the data we have to think about.
The following is the first paper we have published on this topic:
Genome-wide Integration on Transcription Factors, Histone Acetylation and Gene Expression Reveals Genes Co-regulated by Histone Modification Patterns.
The standard role of proteases is to digest target amino acid
sequences into the pile of amino acid junk, while there exist special
proteases, whose functions are to activate or modify the function of other
proteins.
So far it is not still well analyzed what substrates and what part of
substrates are likely to be digested by these special proteases, and
these problems must be a key issue to understand the biological
phenomena, in which these special proteases are involved.
We have formalized this topic as a sequence classification problem,
which is a typical setting of bioinformatics, and applied a multiple
kernel learning approach to predict the cleavage
site with a variety of information related with sequences.
The following is a list of relevant publications by our group on
this topic:
A Review of Statistical Methods for Prediction of Proteolytic Cleavage
One aspect of chemoinformatics is an old, classical issue known as
QSAR (Quantitative Structure Activity Relationships) in computer
chemistry, while another aspect is that a lot of modern machine
learning techniques have been applied to this field quite recently.
Extremely the objective is to capture principles behind drug-target
interactions, for which we have conducted mainly two approaches:
frequent pattern mining over graphs and generative model-based
learning for co-occurrent drug-target events in documents.
In the first approach, we have dealt with chemical compounds as
graphs by using chemical structures, and considered basically two
problem settings, both being based on frequent pattern mining.
One setting is to generate frequent (basically, but changeable to
significant) subgraph features from a given set of chemical compounds.
The other setting is to consider pairs of chemical compounds and
targets, which can be graphs and strings, and retrieve significant
pairs of subgraphs and substring.
The pairs obtained in the second approach can be clustered into
groups, which we think can be a key to understand the polypharmacology
of drug-target interactions.
The following is a list of relevant publications by our group on
this topic:
A Probabilistic Model for Mining Implicit ``Chemical Compound - Gene'' Relations from Literature.
Zhu, S., Okuno, Y., Tsujimoto, G. and Mamitsuka, H.
Proceedings of the Fourth
European Conference on Computational Biology (ECCB/JBI 2005),
(Bioinformatics, 21, Supplement 2, ii245-ii251),
Madrid, Spain, September, 2005, Oxford University Press.
We have considered a real biological setting, which we call "switching
mechanism", of gene expression, where one gene is correlated and the
other gene is anti-correlated with some specific gene.
In fact the switching mechanism is an important regulation mechanism
over
three-way interaction in
expression, which naturally needs a considerable amount of computation
time when considering all possible combinations of three genes.
We have developed an efficient and robust approach for detecting
switching mechanisms from gene expression, based on pruning techniques
and newly developed hypothesis testing.
The following is a list of relevant publications by our group on
this topic:
Efficiently Finding Genome-wide Three-way Gene Interactions from Transcript- and Genotype-Data.
Kayano, M., Takigawa, I., Shiga, M., Tsuda, K. and Mamitsuka, H.
Glycans, as sequences, have branch-shaped tree structures, which have
made high-speed sequencing hard, resulting in generating rich
sequence databases very slow.
However, we have already more than 10,000 known glycans, awaiting for
computational analysis.
Tree-shaped glycans can be labeled ordered trees in a computer science
sense, and we have developed a variety of techniques for mining from
glycans, including generative models for labeled ordered trees and
efficient frequent subtree mining.
The following is a list of relevant publications by our group on
this topic:
Probabilistic Model for Mining Labeled Ordered Trees: Capturing
Patterns in Carbohydrate Sugar Chains.
Ueda, N., Aoki-Kinoshita, K. F., Yamaguchi, A., Akutsu, T. and Mamitsuka, H.
IEEE Transactions on Knowledge and Data Engineering, 17(8),
1051-1064, 2005.
ProfilePSTMM: Capturing Tree-structure Motifs in
Carbohydrate Sugar Chains.
Aoki, K. F., Ueda, N., Mamitsuka, H. and Kanehisa, M.
Proceedings of the Fourteenth
International Conference on Intelligent Systems for
Molecular Biology (ISMB 2006),
(Bioinformatics, 22(14), e25-e34),
Fortaleza, Brazil, August, 2006, Oxford University Press.
Mining Significant Tree Patterns in Carbohydrate Sugar Chains.
Hashimoto, K., Takigawa, I., Shiga, M., Kanehisa, M. and Mamitsuka, H.
Proceedings of the Seventh
European Conference on Computational Biology (ECCB 2008),
(Bioinformatics, 24(16), 2008),
i167-i173, Cagliari, Sardinia-Italy, September, 2008, Oxford University Press.
Immunology would be one of the most advanced fields in biology, and
the immunological mechanism is now well elucidated, resulting in a lot
of good amount of databases to be analyzed.
We have focused on MHC (Major Histocompatibility Complex) molecules
which work for the initial stage of the immune system, and considered the
problem of predicting peptides binding to MHC molecules.
A simple prediction is by using a single predictor trained by a sorted
set of peptides which bind to a certain MHC molecule.
However our strategy is more flexible by incorporating a lot of other
available information, such as auxiliary MHC molecules and ensemble
classifiers.
The following is a list of relevant publications by our group on
this topic:
Improving MHC Binding Peptide Prediction by
Incorporating Binding Data of Auxiliary MHC Molecules.
Zhu, S., Udaka, K., Sidney, J., Sette, A., Aoki-Kinoshita, K. F. and
Mamitsuka, H.
Clustering documents is a major issue in natural language processing,
because of a huge number of currently accumulated documents for which
labels cannot be assigned manually.
This can be absolutely true of biomedical documents.
Currently over 19 million articles from life science journals are
indexed in MEDLINE, a public database of National Library of Medicine
in the U.S., and clustering them into groups, which can be the most
appropriate medical subfields, would be a great benefit for
biological/medical scientists to find publications in which they might
be interested the most.
For clustering biomedical documents, we have conducted several
different strategies, such as spectral learning and generative models.
The following is a list of relevant publications by our group on
this topic:
Enhancing MEDLINE Document
Clustering by Incorporating MeSH Semantic Similarity.