The structural elements that are important in the formation of the threedimensional
structure of a protein, i.e. the secondary structure of a protein,
are segments of consecutive amino acids forming regular structural patterns
called -helices and -strands. An α-helix is a winding spiral that stabilizes by
forming hydrogen bonds between neighboring turns, and a β-strand is a straight
segment of the amino acids sequence that stabilizes by forming hydrogen bonds
between neighboring -strands that run in parallel forming β-sheets.
Predicting protein secondary structure is usually addressed by pattern recognition
methods such as neural networks [164, 168] or hiddenMarkov models .
The general idea when using hiddenMarkov models to predict protein secondary
structure is to construct a model in which the most likely sequence of states
that generates a given amino acid sequence can be interpreted as deciding for
each amino acid whether it is located on a α-helix or on a β-strand. Knowledge
about the secondary structure of a protein is useful, but knowing its tertiary
structure, i.e. the three-dimensional structure of the native state of the protein,
is much more useful. The tertiary structure contains the full description of
the secondary structure, and is an important step towards understanding the
quaternary structure, and hopefully the functionality of the protein.
Modeling Protein Structure
To computationally predict the structure of a protein using a free energy model,
we must decide how to model the protein, how to model the possible conformations
of the protein, and how to model the free energy of a conformation. These
decisions influence the level of detail of the predicted protein structure. Figure
4.2 shows three possible levels of detail. In the following we will consider
possible free energy models for protein structure prediction which gradually
decrease the level of detail of the predicted protein structure.
In chemistry a molecule is usually viewed as a collection of atoms connected
by bonds. Using this as the model for proteins, we can specify the tertiary struc-
Figure 4.2: The two leftmost gures shows the structure of a hemoglobin protein
from the same angle but with a dierent level of detail. In the gure to the
left all atoms except hydrogen atoms are present. In the middle gure only the
backbone of the protein is shown. The gure to the right is an example of a
structure in the two-dimensional hydrophobic-hydrophilic lattice model.
ture of a protein by stating the angle, the length, and the torsion of every bond
in its structure. This is a very complex description that involves information
about every atom in the protein. To reduce the complexity of the description
some atoms can be omitted, or some atoms can be grouped together into larger
units which are treated as single atoms in the model. Such reductions aect the
level of detail by reducing the visual equivalence between native conformations
in the model and native conformations in the real world.
A model with a detailed description of the protein structure which involves
information about individual atoms, is often referred to as an analytic model.
The free energy function in an analytic model is most often specied by terms
contributed by bonded and non-bonded atoms. For bonded atoms the terms
depend on the bond lengths, the bond angles, and the bond torsions. For nonbonded
atoms the terms depend on physical principles such as Coulombic and
van der Walls forces, or statistical information inferred from known structures,
such as mean force potentials . The detailed description of the protein
structure, and the many parameters of the free energy function, hint that the
structure prediction problem in an analytic model is computationally hard. It
is therefore not surprising that Ngo and Marks in  prove that the structure
prediction problem in a typical analytic model is NP hard.
One way to reduce the complexity of an analytic model is to limit the bond
lengths, angles and torsions to xed sets of legal values. This makes it possible
to enumerate all legal conformations of a protein, and, if time permits, to solve
the structure prediction problem by considering each of the nitely many legal
conformations of a protein. Pedersen and Moult in  suggest that he sets of
legal values should be compiled from known protein structures.
Another approach to protein structure prediction is the threading problem
suggested by Jones, Taylor and Thornton in . This approach can be
thought of as an analytic model in which bond lengths, angles and torsions
are limited to those occurring in a xed set of known structures. The general
idea of threading is to compare a protein with unknown structure against a set
of known structures, and to predict its structure as the known structure that
ts best according to some energy function. The legal conformations of a protein
is thus limited to a set of known structures. The name threading follows
from the mental picture of threading the amino acids of the protein with unknown
structure onto every one of the known structures in order to determine
the known structure that ts best. Threading is motivated by the belief that
the number of essential dierent protein structures is small compared to the
number of dierent proteins , which should make it possible to collect a representative
collection of known protein structures to compare against. Treading
methods using branch-and-bound techniques  and genetic algorithms 
have been used successfully. A polynomial time algorithm has been proposed for
a specic class of threading problems , but the general threading problem
has been proven NP complete by Lathrop in .
For computational and modeling purposes it might be desirable to limit the
legal bond lengths, angles and torsions to the point where legal conformations
of a protein are reduced to embeddings of its atoms into a lattice. A model with
this property is often referred to as a lattice model. A lattice model is most often
subject to further simplications than an analytic model with similar bounds
on legal bond lengths, angles and torsions. Typically the protein is modeled
only as a sequence of bonded amino acids, the legal conformations limited to
self-avoiding embeddings of the sequence of amino acids into a lattice, where
bonded amino acids are required to occupy neighboring lattice points, and the
free energy of a conformation specied only in terms of the pairs of non-bonded
amino acids that occupy neighboring lattice points. In short, lattice models
aim to simplify analytic models by avoiding the atomic details.
The most widely used type of lattice is the two- or three-dimensional square
lattice. The rightmost gure in Figure 4.2 shows a conformation of a protein
in a two-dimensional square lattice. It is obvious that there is little visual
equivalence between the native conformation in a square lattice model and
the native conformation in the real world. But Dill et al. in  describe
experiments that support some behavioral equivalence between the structure
formation process in square lattice models and the structure formation process
in the real world. The behavioral equivalence is also addressed by Sali et al.
in , who based on experiments using square lattice models suggest that
only proteins for which the structure of minimum free energy has signicantly
less free energy than other structures of the protein, can be expected to fold
into the structure of minimum free energy. Lattice models have become popular
because of their simplicity which allows the structure prediction problem to be
solved by considering each of the nitely many legal conformations of a protein.
This, of course, is only feasible for small proteins, but still useful for studying
the behavioral equivalences. The structure prediction problem has been proven
NP complete in several square lattice models, e.g. [57, 157, 190, 26, 40].
The hydrophobic-hydrophilic model proposed by Dill in  is one of the
simplest, but most popular square lattice models. It models the belief that a
major contribution to the free energy of the native conformation of a protein
is due to interactions between hydrophobic amino acids, which tend to form a
core in the spatial structure that is shielded from the surrounding solvent by
hydrophilic amino acids. The model is called the HP model, where H stands for
hydrophobic, and P stands for polar. In the HP model a protein is abstracted
as a sequence of amino acids where each amino acid is classied as being either
hydrophobic or hydrophilic. If we use \0″ to denote a hydrophilic amino acid,
and \1″ to denote a hydrophobic amino acid, then we can describe this abstraction
of a protein as a string over the alphabet f0; 1g. A legal conformation, or
folding, of a protein in the HP model is a self-avoiding embedding of its abstraction
as a string into the two- or three-dimensional square lattice such that
adjacent characters in the string, i.e. adjacent amino acids in the sequence, occupy
adjacent lattice points. The free energy of a conformation is proportional
to the number of non-local hydrophobic bonds. A non-local hydrophobic bond
is a pair of non-adjacent hydrophobic amino acids that occupy adjacent lattice
points. The HP model is described in more details in Section 10.2.
The rightmost gure in Figure 4.2 shows a conformation in the 2D HP model
with nine non-local hydrophobic bonds. Several heuristics have been applied to
predict the native conformation of a protein in the HP model, e.g. [191, 207],
but most interestingly from our point of view is that the HP model was the
rst reasonable model for protein structure formation in which approximation
algorithms for the protein structure prediction problem were formulated. This
was done by Hart and Istrail in . For a while it was even believed that
the protein structure prediction problem in the HP model would be solvable in
polynomial time, but recently it has been shown NP complete in [26, 40].
The hardness of the protein structure prediction problem, even in very simple
free energy models such as the two-dimensional HP model, makes it apparent
that in order to computationally investigate protein structure formation
one has to think about heuristics, approximation algorithms, or other modeling
approaches. If someone claims that an ecient method for predicting the
atomic details of a protein structure will be available soon, he or she certainly
risks being classied according to the quote beginning this chapter.