Homology modeling

Principles

The experimental techniques for the determination of the 3D structure of biological macromolecules have significantly progressed recently. As a consequence, the number of known 3D structures increases continuously, with about 35,000 structures currently available.

However, this still only covers a small fraction of the proteome. Therefore, it is of major interest to use in silico approaches to create theoretical models of protein structures that will be used to study the protein's structure/function relationships and direct further experimental work.

One class of methods that can be used to generate an atom-based structural model of a protein from its amino-acid sequence is called homology modeling. This technique is based on the observation that protein tertiary structure is better conserved than amino acid sequence. The consequence of this is that proteins sharing a significant similarity of sequence can be expected to share also a significant similarity of structure.

The homology modeling procedure can be broken down into several steps. First, template structures are selected. These templates consist of proteins sharing a significant similarity of sequence with the targeted protein (hopefully more than 30% of identity of sequence) and for which experimental 3D structures are available. Then the sequences of the targeted protein and templates are aligned. Based on the sequence alignments and 3D structure of the template, geometrical criteria can be generated that are used to generate a 3D structural model of the targeted protein. Finally, this structural model is assessed according to statistical potentials or physics-based energy calculations.




Principles of Homology Modeling


Limitations

The accuracy that can be expected from homology modeling is highly dependent on the sequence identity between target and templates. A sequence identity above 50% generally leads to reliable models, with only limited errors in side chain and loops positioning. This typical error is comparable to the typical resolution of a structure solved by NMR. More important errors can be expected if the identity range is around 30 to 50%. Below 30% identity, serious errors occur, that can lead to a misfolded protein model. Important errors can also happen in regions of the protein that share little sequence identity with the templates, even though the rest of the protein exhibits a high sequence identity.