Estimation of Precision and Accuracy in Protein Structure Refinement from X-ray Data
A BBSRC Grant Application by
David S Moss, Ian J Tickle and Roman Laskowski
Birkbeck College, University of London
Objectives of the research
Biological relevance
The accuracy and precision required of an experimentally determined model of a macromolecule, such as a protein or DNA, depends on the biological questions being asked of the structure. Questions involving, say, the overall fold of a protein, or its topological similarity to other proteins, can be answered by structures of fairly low precision such as those obtained from very low resolution X-ray crystal diffraction data. For example, three structures of lactose operon repressor - one on its own, one complexed with its inducer, and one complexed with DNA – solved to 4.8Å resolution gave accurate positions for only the protein's alpha carbon atoms (Lewis et al., 1996). However, despite the low resolution, these structures were able to show the overall conformation of the protein in both its induced and repressed states and provide a framework for understanding the interactions it makes in performing its biological function .
Questions involving reaction mechanisms, on the other hand, require much greater accuracy and precision as obtained from well-refined, high-resolution X-ray structures, including proper statistical analyses of the standard uncertainties (s.u.'s) of atomic positions and bond lengths. The most accurate and precise structures are those solved by X-ray crystallography to atomic resolution (i.e. better than 1.2Å), and the number of such macromolecular structures is rapidly increasing (Dauter et al., 1997). Structures at this level of accuracy can begin to address detailed functional biological questions, as some of the following examples illustrate.
The protonation states of certain side-chains can often be determined from atomic resolution data. This is particularly significant for those side-chains that are part of a biochemical mechanism. One such example is the histidine in the Ser-His-Asp catalytic triad found in the serine proteases, triacylglycerol lipases and cutinase (Blow et al., 1969; Wright et al., 1969). In the atomic resolution structure of cutinase the hydrogen on the histidine is clearly visible in the electron density (Longhi et al., 1997). Such details provide a means of understanding the mechanism involved in the enzyme's activity. The protonation states of carboxylate groups have also been determined in an atomic resolution structure of triclinic lysozyme (Dauter et al., 1997). This was achieved by a combination of direct observation of the difference Fourier peaks and the C–O bond lengths, and again has functional implications.
The proteins rubredoxin and ferredoxin both contain iron-sulphur clusters to exploit their electron-transfer redox properties; rubredoxin has a single FeS4 cluster, while ferredoxin uses two Fe4S4 clusters. Measurement of the distortions of these clusters from ideal symmetry can be related to their redox potentials and help understand how these metallo-proteins achieve their biological functions (Dauter et al., 1997).
In other metallo-proteins it is often important to identify the metal bound in the active site. This may require very precise metal-ligand distances if metals with similar co-ordinating geometries are to be distinguished (e.g. zinc and cadmium). Or it may require reliance on other experimental techniques, such as atomic absorption spectroscopy, as used to distinguish between iron and manganese in super oxide dismutase (Bunting et al., 1997). Alternatively, the question may be what a metal's ionisation state is rather than which metal is involved. In plastocyanin, for example, which is a protein involved in electron transfer in photosystem I, the geometry of the ligands to the copper changes upon oxidation. This involves very small changes; namely, 2-3 degrees in the metal-ligand bonds and 0.04-0.21Å in the metal-ligand distances (Guss et al., 1986; Holm et al., 1996). Changes in the geometry of the copper site are also observed at different pH levels (Guss et al., 1986). In haemoglobin the low oxygen affinity of the T-state haemoglobin is manifested in tension in the iron-proximal histidine bond (Paoli et al., 1997).
Accurate atomic positions are also important for identifying hydrogen bonds and hydrogen bonding networks - which are particularly important in enzyme active sites - and for confirming the presence of unusual hydrogen bonds. For example, unusually short hydrogen bonds (of < 2.45 Å), also known as low barrier hydrogen bonds, had been postulated as having a major role in enzyme catalysis. The existence of such bonds was finally convincingly confirmed by two atomic resolution structures solved at high resolution (Wang et al., 1997).
The measurement of pore and channel sizes within protein structures also relies on accurate atomic positions and can be particularly important in distinguishing between competing theories of biological mechanism. For example, in cytochrome-c oxidase the locations of an O2 and a H2O channel have been proposed on the basis of a 2.8Å structure (Tsukihara et al., 1996), but their existence and functional significance have yet to be proved (Ferguson-Miller & Babcock, 1996).
Of course, not all macromolecular structures can be solved to atomic resolution, particularly where the structures or complexes are large. However, what is important for any experimentally determined structure is that an estimate of its accuracy and precision be given so that it is possible to decide what kind of biological conclusions can be justifiably drawn from it. Are differences/similarities in active site geometry significant? How accurate are metal-ligand or metal-metal bonds? Can protonation states be reliably determined? Do the given side-chain conformations have any meaning?
Earlier work on errors in protein structures
In accessing how reliable a structure may be, there are two broad categories of error to be considered: systematic errors which affect how accurate the overall structure is, and random errors which affect its precision. Systematic errors are not easy to detect, even in an apparently fully refined structure, particularly at lower resolution. The agreement between the model of the molecular structure and the X-ray diffraction data from which it has been derived is measured by the crystallographic R-factor, but it is well known that structures with acceptable values of this parameter can have significant errors (Brändén & Jones, 1990; Kleywegt & Jones, 1995). The R-factor is susceptible to manipulation by the leaving out of weak data or by overfitting of data with too many parameters and so is not a completely reliable guide to accuracy. In small-molecule crystallography, where the number of X-ray intensity observations usually exceeds the number of parameters in the model by at least an order of magnitude, the R-factor is a more sure guide to both accuracy and precision.
In 1992 Brünger introduced the idea of an Rfree (Brünger, 1992, 1993), based on the standard statistical modelling technique of jack-knifing or cross-validatory residuals (McCullagh & Nelder, 1983). Rfree is the same as the conventional R-factor, but based on a test set consisting of a small percentage (usually 5-10%) of reflections excluded from a structure refinement. The remaining reflections included in the refinement are known as the working set. The Rfree value, unlike the R-factor, cannot be driven down by refinement because the reflections on which it is based are excluded from this process. Consequently, a high value of this statistic and a concomitant low value of R may indicate an over-fitted or inaccurate model. The procedure assumes that the reflections removed for the cross-validation test have been randomly selected and have errors uncorrelated with those that remain in the set used in the refinement. This assumption may be partly invalidated by the presence of non-crystallographic symmetry. Ideally, the refinement should be repeated several times removing non-overlapping sets of reflections each time.
The Rfree is highly correlated with the phase accuracy of the atomic model (Brünger, 1992, 1993) and can detect various types of errors in the structure including phase errors and partial mistracing of the structure. It has also be used in evaluating different refinement protocols, such as the optimisation of the weights used during refinement. It is particularly useful in preventing the overfitting of data (Kleywegt & Brünger, 1996).
The use of Rfree is thus a valuable guide to the progress of refinement, particularly for low-resolution data, and its use and publication are widely encouraged. A recent review (Kleywegt & Brünger, 1996) indicated that the use of the measure is becoming more widespread with it being reported in 44% of articles describing macromolecular X-ray structures.
However, the usefulness of Rfree is limited by the fact that what is an "acceptable" value is often not evident. One would expect Rfree to always be higher than R even when there are no systematic errors in the model structure, but it is not clear how much higher it should be. At present we merely have a number of rules of thumb (Kleywegt & Brünger, 1996). Bacchi, Lamzin & Wilson (1996) use an extension of the self-validation Hamilton test to assess the significance of any observed drop in Rfree during refinement.
The random errors associated with the parameters of a refined crystal structure can be expressed in terms of standard uncertainties (s.u.'s). These can be calculated and quoted for the x-, y- and z co-ordinates and root mean-square displacements (U-values) of each atom in the model structure. In small-molecule refinement s.u.'s are routinely calculated and indeed are required by some journals as a precondition for publication. For large molecules such as proteins, however, the calculations have not been regularly performed to date, being considered too memory- and computationally-intensive.
Most often s.u.'s are obtained during structure refinement by calculating the least-squares covariance matrix from the inverse of the normal equations matrix (Cruickshank, 1965), as is implemented, for example, in the small-molecule refinement package SHELX (Sheldrick, 1985; Robinson & Sheldrick, 1988). For proteins the matrices involved are very large (typically several thousand rows and columns), and the accumulation of the normal matrix, which is an order n4 process, soon becomes unfeasible. Over the past ten years protein crystallographers have used various techniques for estimating the random errors in the co-ordinates of their model structures. These include the Luzzati plot (Luzzati, 1952) and the
sA plot of Read (1986) which both provide an estimate of the average positional error of a structure's co-ordinates. The Luzzati method assumes that the positional errors are normally distributed and that they alone account for the differences between the Fo and Fc for all reflections (Cruickshank, 1996). Theoretical relationships between scattering angle and R-factor, assuming different values of the average error, are then used to estimate the average error from the observed data.More recent attempts to estimate errors have included: the `residue R-factor' of Jones et al. (1991); the tabulated R indices of Elango & Parthasarathy (1990); the refinement protocol of Carson et al. (1994) which uses temperature factors, real-space fit residuals, geometric strains, dihedral angles and shifts from the previous refinement cycle; the discriminator of Sevcik et al. (1993) which assesses the likely errors on each atom in terms of its temperature factor divided by its electron density in the final 2Fo - Fc map; an empirically derived six-parameter equation of Stroud and Fauman (1995), and the use of the diagonal elements of the inverse normal matrix in a final cycle of unrestrained least-squares refinement to give an estimate of the radial errors in atomic positions (Holland et al., 1990).
Various studies have been performed to assess the
accuracy of these estimation procedures. Fields et al. (1994)
performed two independent refinements, using different refinement
programs, using a single set of synchrotron X-ray data at 1.6Å
resolution. The r.m.s. differences between the final models
were 0.08Å for all backbone atoms and the estimated maximum
average error from the Luzzati plot was found to be
0.13Å. Daopin et al. (1994) compared four different
estimation methods, two for calculating local errors and two (the
Luzzati and s
A plots) for overall errors, finding the methods to be in
good agreement. On the other hand, Ohlendorf (1994) compared four
independently-refined X-ray crystal structures of human interleukin
1b , first
re-refining them against a common data set to minimize the effects of
different data sets and refinement protocols. He found that the final
structures differed from one another by 0.84Å, which was roughly
three times the error predicted by the Luzzati plots. Murshudov and
Dodson (1997) have developed the theory of Cruickshank (1996) and used
it to explore the relationship between temperature factors and
positional errors estimated from a diagonal second derivative
matrix.
Previous work of the applicants
Influence of errors on Rfree and the Rratio
Dodson, Kleywegt & Wilson (1996) have highlighted the need for
more understanding of the behaviour of Rfree. In
spite of the enthusiasm for its use, actual applications of
Rfree have remained somewhat subjective without an
understanding of its statistical basis. For example, if
non-crystallographic symmetry constraints are relaxed during a
structure refinement, how much should Rfree rise
during subsequent refinement if the restrained model is correct?
Without understanding how Rfree varies as a function
of the number of restraints and/or number of parameters it is only
possible to make rather subjective judgements. In order to use Rfree to
discriminate against wrong structures, it is important to know how
much larger than R it should be when the model is
correct. Rfree is expected to be larger than the
R-factor from the working set (Rinc) even
when a correct model is fully refined. We have addressed some of these
issues (Tickle, Laskowski & Moss, 1998b) and have shown
that the expected value of the ratio of the R-factors
calculated from the working set and the test set in an unrestrained
refinement is given by Rratio =
Rfree/Rinc = sqrt [(f +
m) / ( f - m)] and the
corresponding quantity for restrained refinement is given by Rratio =
Rfree/Rinc = sqrt [(f + (m
-r + Drest)) / ( f - (m - r +
Drest))] where f is the
number of X-ray observations, m is the number of parameters,
r is the number of restraints and Drest is
the contribution to the residual from the restraint terms. We analysed
725 structures in the Protein Data Bank (Bernstein et al.,
1977) and showed that the observed Rratio's varied
according to the number of parameters and restraints in a way
predicted by theory.
Random errors
Many crystallographers have used PROCHECK (Laskowski R A, MacArthur M W, Moss D S & Thornton J M (1993)) to judge the quality of their protein structure models. PROCHECK allows the crystallographer to compare the variation in protein geometry with similar geometry observed in carefully refined small-molecule structures but gives no guidance about how this comparison should be made. In a recent paper (Tickle, Laskowski & Moss, 1998a) we derived the statistically expected values of geometric residuals arising from the least-squares refinement of macromolecular structures. We have shown that at the convergence of a refinement the expected value of a squared distance residual is given by
s
2 - aT H-1 awhere
s is the observed standard deviation of the bond length or angle taken from small molecule structures, H is the normal matrix and a is a row from the geometrical contribution to the normal matrix.In the same paper we calculated s.u.'s of positional parameters for two proteins, using the full matrix of the normal equations of least-squares refinement. We described how these calculations have been implemented in a least-squares refinement program and presented the results for two trial proteins Gamma-B and Beta-B2 crystallin. In particular, we analysed the relationship between the positional s.u.'s, restraints and atomic B-values.
Proposed Investigation
In order that protein crystallography should come of age, there is a clear need for quantitative estimates of all model errors. We need to be able to check that model structures are free from gross errors, particularly at lower resolution. We also need to know the precision of individual atomic co-ordinates at the end of a restrained refinement. These two needs correspond to the two parts of our work programme.
Systematic errors in X-ray structure determination
The first part of our investigation would explore the use of the Rratio for detecting problems during structure refinement. We would investigate
Our work would use structures determined in house and would also draw on our experience of incorrect protein models that have occurred during structure determinations.
Estimation of random errors in protein co-ordinates
In the second part of our investigation we would investigate methods of determining errors in individual protein atomic co-ordinates which do not require full matrix methods. The latter methods may not be routinely feasible for medium or large protein structures for a number of years. Many hours of computing time on a fast workstation are needed to calculate the variance-covariance matrix for such structures. We shall therefore investigate approximate methods of co-ordinate error determination and check them against full-matrix estimates. We shall also look into statistics such as the Rfree which may warn of overfitting. The following would be the objectives of our work.
All software developed under this project would be made available to the crystallographic community through CCP4.
Resources requestedWe are requesting funds for a post-doctoral research assistant who would be able to undertake code development for the above work. This work requires a person with a good background in protein refinement, knowledge of statistics and significant programming skills. This project is intensely computational and therefore we request small contributions to the salaries of our computer and network support staff. To date our cpu-intensive work has taken place on a Silicon Graphics Power Challenge at the University of Southampton, courtesy of Dr Steve Wood. This arrangement is not satisfactory and we are therefore requesting funds for a DEC Workstation with 1.5 gigabytes of central memory. This would allow us to undertake the full-matrix error calculations on small and medium size proteins that would be necessary to help validate the s.u. estimates from block-matrix methods. We will need to back up large matrices for further analysis and are therefore requesting 40 DAT tapes.
References