Reference:
T. Ioerger, L. Rendell, and S. Subramaniam (1995). Searching for representations to improve protein sequence fold-class prediction. Machine Learning, 21:151-175.
Abstract:
Predicting the fold, or approximate 3D structure, of a protein from its amino
acid sequence is an important problem in biology. The homology modeling
approach uses a protein database to identify fold-class relationships by
sequence similarity. The main limitation of this method is that some proteins
with similar structures appear to have very different sequences, which we call
the "hidden-homology problem." As in other real-world domains for machine
learning, this difficulty may be caused by a low-level representation.
Learning in such domains can be improved by using domain knowledge to search
for representations that better match the inductive bias of a preferred
algorithm. In this domain, knowledge of amino acid properties can be used to
construct higher-level representations of protein sequences. In one
experiment using a 179-protein data set, the accuracy of fold-class prediction
was increased from 77.7% to 81.0%. The search results are analyzed to refine
the grouping of small residues suggested by Dayhoff. Finally, an extension to
the representation incorporates sequential context directly into the
representation, which can express finer relationships among the amino acids.
The methods developed in this domain are generalized into a framework that
suggests several systematic roles for domain knowledge in machine learning.
Knowledge may define both a space of alternative representations, as well as a
strategy for searching this space. The search results may be summarized to
extract feedback for revising the domain knowledge.