Estimation of Precision and Accuracy in Protein Structure Refinement from X-ray Data

A BBSRC Grant Application by

David S Moss, Ian J Tickle and Roman Laskowski

Birkbeck College, University of London

Objectives of the research

To develop ways of using the R_free ratio to detect significantly wrong protein models
To develop practicable methods for determining the precision of the co-ordinates of given atoms in the X-ray analysis of macromolecular structures

Biological relevance

The accuracy and precision required of an experimentally determined model of a macromolecule, such as a protein or DNA, depends on the biological questions being asked of the structure. Questions involving, say, the overall fold of a protein, or its topological similarity to other proteins, can be answered by structures of fairly low precision such as those obtained from very low resolution X-ray crystal diffraction data. For example, three structures of lactose operon repressor - one on its own, one complexed with its inducer, and one complexed with DNA – solved to 4.8Å resolution gave accurate positions for only the protein's alpha carbon atoms (Lewis et al., 1996). However, despite the low resolution, these structures were able to show the overall conformation of the protein in both its induced and repressed states and provide a framework for understanding the interactions it makes in performing its biological function .

Questions involving reaction mechanisms, on the other hand, require much greater accuracy and precision as obtained from well-refined, high-resolution X-ray structures, including proper statistical analyses of the standard uncertainties (s.u.'s) of atomic positions and bond lengths. The most accurate and precise structures are those solved by X-ray crystallography to atomic resolution (i.e. better than 1.2Å), and the number of such macromolecular structures is rapidly increasing (Dauter et al., 1997). Structures at this level of accuracy can begin to address detailed functional biological questions, as some of the following examples illustrate.

The protonation states of certain side-chains can often be determined from atomic resolution data. This is particularly significant for those side-chains that are part of a biochemical mechanism. One such example is the histidine in the Ser-His-Asp catalytic triad found in the serine proteases, triacylglycerol lipases and cutinase (Blow et al., 1969; Wright et al., 1969). In the atomic resolution structure of cutinase the hydrogen on the histidine is clearly visible in the electron density (Longhi et al., 1997). Such details provide a means of understanding the mechanism involved in the enzyme's activity. The protonation states of carboxylate groups have also been determined in an atomic resolution structure of triclinic lysozyme (Dauter et al., 1997). This was achieved by a combination of direct observation of the difference Fourier peaks and the C–O bond lengths, and again has functional implications.

The proteins rubredoxin and ferredoxin both contain iron-sulphur clusters to exploit their electron-transfer redox properties; rubredoxin has a single FeS4 cluster, while ferredoxin uses two Fe4S4 clusters. Measurement of the distortions of these clusters from ideal symmetry can be related to their redox potentials and help understand how these metallo-proteins achieve their biological functions (Dauter et al., 1997).

In other metallo-proteins it is often important to identify the metal bound in the active site. This may require very precise metal-ligand distances if metals with similar co-ordinating geometries are to be distinguished (e.g. zinc and cadmium). Or it may require reliance on other experimental techniques, such as atomic absorption spectroscopy, as used to distinguish between iron and manganese in super oxide dismutase (Bunting et al., 1997). Alternatively, the question may be what a metal's ionisation state is rather than which metal is involved. In plastocyanin, for example, which is a protein involved in electron transfer in photosystem I, the geometry of the ligands to the copper changes upon oxidation. This involves very small changes; namely, 2-3 degrees in the metal-ligand bonds and 0.04-0.21Å in the metal-ligand distances (Guss et al., 1986; Holm et al., 1996). Changes in the geometry of the copper site are also observed at different pH levels (Guss et al., 1986). In haemoglobin the low oxygen affinity of the T-state haemoglobin is manifested in tension in the iron-proximal histidine bond (Paoli et al., 1997).

Accurate atomic positions are also important for identifying hydrogen bonds and hydrogen bonding networks - which are particularly important in enzyme active sites - and for confirming the presence of unusual hydrogen bonds. For example, unusually short hydrogen bonds (of < 2.45 Å), also known as low barrier hydrogen bonds, had been postulated as having a major role in enzyme catalysis. The existence of such bonds was finally convincingly confirmed by two atomic resolution structures solved at high resolution (Wang et al., 1997).

The measurement of pore and channel sizes within protein structures also relies on accurate atomic positions and can be particularly important in distinguishing between competing theories of biological mechanism. For example, in cytochrome-c oxidase the locations of an O2 and a H2O channel have been proposed on the basis of a 2.8Å structure (Tsukihara et al., 1996), but their existence and functional significance have yet to be proved (Ferguson-Miller & Babcock, 1996).

Of course, not all macromolecular structures can be solved to atomic resolution, particularly where the structures or complexes are large. However, what is important for any experimentally determined structure is that an estimate of its accuracy and precision be given so that it is possible to decide what kind of biological conclusions can be justifiably drawn from it. Are differences/similarities in active site geometry significant? How accurate are metal-ligand or metal-metal bonds? Can protonation states be reliably determined? Do the given side-chain conformations have any meaning?

Earlier work on errors in protein structures

In accessing how reliable a structure may be, there are two broad categories of error to be considered: systematic errors which affect how accurate the overall structure is, and random errors which affect its precision. Systematic errors are not easy to detect, even in an apparently fully refined structure, particularly at lower resolution. The agreement between the model of the molecular structure and the X-ray diffraction data from which it has been derived is measured by the crystallographic R-factor, but it is well known that structures with acceptable values of this parameter can have significant errors (Brändén & Jones, 1990; Kleywegt & Jones, 1995). The R-factor is susceptible to manipulation by the leaving out of weak data or by overfitting of data with too many parameters and so is not a completely reliable guide to accuracy. In small-molecule crystallography, where the number of X-ray intensity observations usually exceeds the number of parameters in the model by at least an order of magnitude, the R-factor is a more sure guide to both accuracy and precision.

In 1992 Brünger introduced the idea of an R_free (Brünger, 1992, 1993), based on the standard statistical modelling technique of jack-knifing or cross-validatory residuals (McCullagh & Nelder, 1983). R_free is the same as the conventional R-factor, but based on a test set consisting of a small percentage (usually 5-10%) of reflections excluded from a structure refinement. The remaining reflections included in the refinement are known as the working set. The R_free value, unlike the R-factor, cannot be driven down by refinement because the reflections on which it is based are excluded from this process. Consequently, a high value of this statistic and a concomitant low value of R may indicate an over-fitted or inaccurate model. The procedure assumes that the reflections removed for the cross-validation test have been randomly selected and have errors uncorrelated with those that remain in the set used in the refinement. This assumption may be partly invalidated by the presence of non-crystallographic symmetry. Ideally, the refinement should be repeated several times removing non-overlapping sets of reflections each time.

The R_free is highly correlated with the phase accuracy of the atomic model (Brünger, 1992, 1993) and can detect various types of errors in the structure including phase errors and partial mistracing of the structure. It has also be used in evaluating different refinement protocols, such as the optimisation of the weights used during refinement. It is particularly useful in preventing the overfitting of data (Kleywegt & Brünger, 1996).

The use of R_free is thus a valuable guide to the progress of refinement, particularly for low-resolution data, and its use and publication are widely encouraged. A recent review (Kleywegt & Brünger, 1996) indicated that the use of the measure is becoming more widespread with it being reported in 44% of articles describing macromolecular X-ray structures.

However, the usefulness of R_free is limited by the fact that what is an "acceptable" value is often not evident. One would expect R_free to always be higher than R even when there are no systematic errors in the model structure, but it is not clear how much higher it should be. At present we merely have a number of rules of thumb (Kleywegt & Brünger, 1996). Bacchi, Lamzin & Wilson (1996) use an extension of the self-validation Hamilton test to assess the significance of any observed drop in R_free during refinement.

The random errors associated with the parameters of a refined crystal structure can be expressed in terms of standard uncertainties (s.u.'s). These can be calculated and quoted for the x-, y- and z co-ordinates and root mean-square displacements (U-values) of each atom in the model structure. In small-molecule refinement s.u.'s are routinely calculated and indeed are required by some journals as a precondition for publication. For large molecules such as proteins, however, the calculations have not been regularly performed to date, being considered too memory- and computationally-intensive.

Most often s.u.'s are obtained during structure refinement by calculating the least-squares covariance matrix from the inverse of the normal equations matrix (Cruickshank, 1965), as is implemented, for example, in the small-molecule refinement package SHELX (Sheldrick, 1985; Robinson & Sheldrick, 1988). For proteins the matrices involved are very large (typically several thousand rows and columns), and the accumulation of the normal matrix, which is an order n⁴ process, soon becomes unfeasible. Over the past ten years protein crystallographers have used various techniques for estimating the random errors in the co-ordinates of their model structures. These include the Luzzati plot (Luzzati, 1952) and the s_A plot of Read (1986) which both provide an estimate of the average positional error of a structure's co-ordinates. The Luzzati method assumes that the positional errors are normally distributed and that they alone account for the differences between the F_o and F_c for all reflections (Cruickshank, 1996). Theoretical relationships between scattering angle and R-factor, assuming different values of the average error, are then used to estimate the average error from the observed data.

More recent attempts to estimate errors have included: the `residue R-factor' of Jones et al. (1991); the tabulated R indices of Elango & Parthasarathy (1990); the refinement protocol of Carson et al. (1994) which uses temperature factors, real-space fit residuals, geometric strains, dihedral angles and shifts from the previous refinement cycle; the discriminator of Sevcik et al. (1993) which assesses the likely errors on each atom in terms of its temperature factor divided by its electron density in the final 2F_o - F_c map; an empirically derived six-parameter equation of Stroud and Fauman (1995), and the use of the diagonal elements of the inverse normal matrix in a final cycle of unrestrained least-squares refinement to give an estimate of the radial errors in atomic positions (Holland et al., 1990).

Various studies have been performed to assess the accuracy of these estimation procedures. Fields et al. (1994) performed two independent refinements, using different refinement programs, using a single set of synchrotron X-ray data at 1.6Å resolution. The r.m.s. differences between the final models were 0.08Å for all backbone atoms and the estimated maximum average error from the Luzzati plot was found to be 0.13Å. Daopin et al. (1994) compared four different estimation methods, two for calculating local errors and two (the Luzzati and s _A plots) for overall errors, finding the methods to be in good agreement. On the other hand, Ohlendorf (1994) compared four independently-refined X-ray crystal structures of human interleukin 1b , first re-refining them against a common data set to minimize the effects of different data sets and refinement protocols. He found that the final structures differed from one another by 0.84Å, which was roughly three times the error predicted by the Luzzati plots. Murshudov and Dodson (1997) have developed the theory of Cruickshank (1996) and used it to explore the relationship between temperature factors and positional errors estimated from a diagonal second derivative matrix.

Previous work of the applicants

Influence of errors on R_free and the R_ratio

Dodson, Kleywegt & Wilson (1996) have highlighted the need for more understanding of the behaviour of R_free. In spite of the enthusiasm for its use, actual applications of R_free have remained somewhat subjective without an understanding of its statistical basis. For example, if non-crystallographic symmetry constraints are relaxed during a structure refinement, how much should R_free rise during subsequent refinement if the restrained model is correct? Without understanding how R_free varies as a function of the number of restraints and/or number of parameters it is only possible to make rather subjective judgements.

In order to use R_free to discriminate against wrong structures, it is important to know how much larger than R it should be when the model is correct. R_free is expected to be larger than the R-factor from the working set (R_inc) even when a correct model is fully refined. We have addressed some of these issues (Tickle, Laskowski & Moss, 1998b) and have shown that the expected value of the ratio of the R-factors calculated from the working set and the test set in an unrestrained refinement is given by

R_ratio = R_free/R_inc = sqrt [(f + m) / ( f - m)]

and the corresponding quantity for restrained refinement is given by

R_ratio = R_free/R_inc = sqrt [(f + (m -r + D_rest)) / ( f - (m - r + D_rest))]

where f is the number of X-ray observations, m is the number of parameters, r is the number of restraints and D_rest is the contribution to the residual from the restraint terms. We analysed 725 structures in the Protein Data Bank (Bernstein et al., 1977) and showed that the observed R_ratio's varied according to the number of parameters and restraints in a way predicted by theory.

Random errors

Many crystallographers have used PROCHECK (Laskowski R A, MacArthur M W, Moss D S & Thornton J M (1993)) to judge the quality of their protein structure models. PROCHECK allows the crystallographer to compare the variation in protein geometry with similar geometry observed in carefully refined small-molecule structures but gives no guidance about how this comparison should be made. In a recent paper (Tickle, Laskowski & Moss, 1998a) we derived the statistically expected values of geometric residuals arising from the least-squares refinement of macromolecular structures. We have shown that at the convergence of a refinement the expected value of a squared distance residual is given by

s ²- a^TH^-1a

where s is the observed standard deviation of the bond length or angle taken from small molecule structures, H is the normal matrix and a is a row from the geometrical contribution to the normal matrix.

In the same paper we calculated s.u.'s of positional parameters for two proteins, using the full matrix of the normal equations of least-squares refinement. We described how these calculations have been implemented in a least-squares refinement program and presented the results for two trial proteins Gamma-B and Beta-B2 crystallin. In particular, we analysed the relationship between the positional s.u.'s, restraints and atomic B-values.

Proposed Investigation

In order that protein crystallography should come of age, there is a clear need for quantitative estimates of all model errors. We need to be able to check that model structures are free from gross errors, particularly at lower resolution. We also need to know the precision of individual atomic co-ordinates at the end of a restrained refinement. These two needs correspond to the two parts of our work programme.

Systematic errors in X-ray structure determination

The first part of our investigation would explore the use of the R_ratio for detecting problems during structure refinement. We would investigate

The R_ratio as a criterion for detecting serious error in lower-resolution structures

The R_ratio as a criterion of convergence and the effect of lack of convergence and incorrect weighting on error analysis

The effect of non-crystallographic symmetry and the method of solving the phase problem on the R_ratio. This work would involve analysis of those refined structures in the Protein Data Bank for which R_free data is available

Our work would use structures determined in house and would also draw on our experience of incorrect protein models that have occurred during structure determinations.

Estimation of random errors in protein co-ordinates

In the second part of our investigation we would investigate methods of determining errors in individual protein atomic co-ordinates which do not require full matrix methods. The latter methods may not be routinely feasible for medium or large protein structures for a number of years. Many hours of computing time on a fast workstation are needed to calculate the variance-covariance matrix for such structures. We shall therefore investigate approximate methods of co-ordinate error determination and check them against full-matrix estimates. We shall also look into statistics such as the R_free which may warn of overfitting. The following would be the objectives of our work.

Determination of the s.u.'s of atomic co-ordinates from block matrix refinement. We shall examine different blocking patterns of the normal matrix which take into account the effect of restraints on errors.

Determination of the s.u.'s of bond lengths, angles and torsion angles and the effect of restraints and error correlation on these quantities.

Derivation of a formula linking s.u.'s of atomic co-ordinates with temperature factors, resolution, occupancy, restraints and R-factor.

Investigation of the R_freeas an indicator of overfitting.

All software developed under this project would be made available to the crystallographic community through CCP4.

Resources requested

We are requesting funds for a post-doctoral research assistant who would be able to undertake code development for the above work. This work requires a person with a good background in protein refinement, knowledge of statistics and significant programming skills. This project is intensely computational and therefore we request small contributions to the salaries of our computer and network support staff. To date our cpu-intensive work has taken place on a Silicon Graphics Power Challenge at the University of Southampton, courtesy of Dr Steve Wood. This arrangement is not satisfactory and we are therefore requesting funds for a DEC Workstation with 1.5 gigabytes of central memory. This would allow us to undertake the full-matrix error calculations on small and medium size proteins that would be necessary to help validate the s.u. estimates from block-matrix methods. We will need to back up large matrices for further analysis and are therefore requesting 40 DAT tapes.

References

Bacchi A, Lamzin V S & Wilson K S (1996). A self-validation technique for protein-structure refinement: the extended Hamilton test. Acta Cryst., D52, 641-646.

Bernstein F C, Koetzle T F, Williams G J B, Meyer E F Jr, Brice M D, Rodgers J R, Kennard O, Shimanouchi T & Tasumi M. (1977). The Protein Data Bank: a computer-based archival file for macromolecular structures. J Mol Biol, 112, 535-542.

Blow D M, Birktoft J J & Hartley B S (1969). Role of a buried acid group in the mechanism of action of chymotrypsin. Nature, 221, 337–340.

Brändén C-I & Jones T A (1990). Between objectivity and subjectivity. Nature, 343, 687-689.

Brünger A T (1992). Free R value: a novel statistical quantity for assessing the accuracy of crystal structures. Nature, 355, 472-475.

Brünger A T (1993). Assessment of phase accuracy by cross validation: the free R value. Methods and applications. Acta Cryst., D49, 24-36.

Bunting K, Cooper J B, Badasso M O, Tickle I J, Newton, M, Wood S P, Zhang Y & Young D (1997), "Engineering a change in metal ion specificity of the iron-dependent superoxide dismutase from Mycobacterium tuberculosis: X-ray structural analysis of site-directed mutants", Eur. J. Biochem., 251(3), 795-803.

Carson M, Buckner T W, Yang Z, Narayana V L & Bugg C E (1994). Error detection in crystallographic models. Acta Cryst., D50, 900-909.

Cruickshank D W J (1965). Errors in least-squares methods, in Computing Methods in Crystallography, ed. Rollett J S. Pergamon, Oxford, pp 112-116.

Cruickshank D W J (1996). Protein precision re-examined: Luzzati plots do not estimate final errors, in Proceedings of the CCP4 Study Weekend, ed. E Dodson, M Moore, A Ralph & S Bailey. Daresbury Laboratory, UK.

Daopin S, Davies D R, Schlunegger M P & Grütter M G (1994). Comparison of two crystal structures of TGF-b 2: the accuracy of refined protein structures. Acta Cryst., D50, 85-92.

Dauter Z, Lamzin V S & Wilson K S (1997). The benefits of atomic resolution. Curr. Opin. Struct. Biol., 7, 681–688.

Dodson E, Kleywegt G J & Wilson K (1996). Report of a workshop on the use of statistical validators in protein X-ray crystallography. Acta Cryst., D52, 228-234.

Elango N & Parthasarathy S (1990). Estimation of the mean error in atomic positions from the overall values of R indices. Acta Cryst., A46, 495-502.

Engh R A & Huber R (1991). Accurate bond and angle parameters for X-ray protein structure refinement. Acta Cryst., A47, 392-400.

Ferguson-Miller S & Babcock G T (1996). Heme/copper terminal oxidases. Chem. Rev., 96, 2889–2907.

Fields B A, Bartsch H H, Bartunik H D, Cordes F, Guss JM & Freeman H C (1994). Accuracy and precision in protein crystal structure analysis: two independent refinements of the structure of poplar plastocyanin at 173 K. Acta Cryst., D50, 709-730.

Guss J M, Harrowell P R, Murata M, Norris V A & Freeman H C (1986). Crystal structure analyses of reduced (Cu I ) poplar plastocyanin at six pH values. J. Mol. Biol., 192, 361–387.

Holland D R, Clancy L L, Muchmore S W, Ryde T J, Einspahr H M, Finzel B C, Heinrikson R L & Watenpaugh K D (1990). The crystal-structure of a lysine-49 phospholipase-a2 from the venom of the cottonmouth snake at 2.0Å resolution. J. Biol. Chem., 265, 17649-17656.

Holm R H, Kennepohl P & Solomon E I (1996). Structural and functional aspects of metal sites in biology. Chem. Rev., 96, 2239–2314.

Jones T A, Zou J-Y, Cowan S W & Kjeldgaard M (1991). Improved methods for building protein models in electron density maps and the location of errors in these models. Acta Cryst., A47, 110-119.

Kleywegt G J & Brünger A T (1996). Checking your imagination: applications of the free R value. Structure, 4, 897-904.

Kleywegt G J & Jones T A (1995a). Where freedom is given, liberties are taken. Structure, 3, 535-540.

Laskowski R A, MacArthur M W, Moss D S & Thornton J M (1993). PROCHECK: a program to check the stereochemical quality of protein structures. J. Appl. Cryst., 26, 283-291.

Lewis M, Chang G, Horton N C, Kercher M A, Pace H C, Schumacher M A, Brennan R G & Lu P (1996). Crystal structure of the lactose operon repressor and its complexes with DNA and inducer. Science, 271, 1247–1254.

Longhi S, Czjzek M, Lamzin V, Nicolas A & Cambillau C (1997). Atomic resolution (1.0Å) crystal structure of Fusarium solani cutinase: stereochemical analysis. J. Mol. Biol., 268, 779–799.

Luzzati P V (1952). Traitement statistique des erreurs dans la détermination des structures cristallines. Acta Cryst., 5, 802-810.

McCullagh P & Nelder J A (1983). Generalised Linear Models. pub. Chapman & Hall, London, UK, pp. 209-221.

Murshudov G & Dodson E J (1997). Simplified error estimation à la Cruickshank in macromolecular crystallography, in CCP4 Newsletter, January 1997, ed. S Bailey. Daresbury Laboratory, UK.

Ohlendorf D H (1994). Accuracy of refined protein structures. II. Comparison of four independently refined models of human interleukin 1b . Acta Cryst., D50, 808-812.

Paoli M, Dodson G, Liddington R C & Wilkinson A J (1997). Tension in haemoglobin revealed by Fe-His(F8) bond rupture in the fully liganded T-state. J. Mol Biol. 271(2), 161-167.

Read R J (1986). Improved Fourier coefficients for maps using phases from partial structures with errors. Acta Cryst., A42, 140-149.

Robinson W T & Sheldrick G M (1988). in Crystallographic Computing 4: Techniques and New Technologies, eds. Isaacs N W & Taylor M R. International Union of Crystallography, Oxford Univ. Press, pp. 366-377.

Sevcik J, Hill C P, Dauter Z & Wilson K S (1993). Complex of ribonuclease from streptomyces aureofaciens with 2'-GMP at 1.7Å resolution. Acta Cryst., D49, 257-271.

Sheldrick G M (1985). in Crystallographic Computing 3: Data Collection, Structure Determination, Proteins, and Databases, eds. Sheldrick G M, Krüger C & Goddard R. Clarendon Press, Oxford, pp. 184-189.

Stroud R M & Fauman E B (1995). Significance of structural changes in proteins: expected errors in refined protein structures. Protein Science, 4, 2392-2404.

Tickle I J, Laskowski R A & Moss D S (1998a). Error estimates of protein structure co-ordinates and deviations from standard geometry by full matrix refinement of g B- and b B2-crystallin. Acta Cryst., D54, 243-252.

Tickle I J, Laskowski R A & Moss D S (1998b). Part I: derivation of expected values of cross-validation residuals used in macromolecular least-squares refinement. Acta Cryst., D54, in press.

Tsukihara T, Aoyama H, Yamashita E, Tomizaki T, Yamaguchi H, Shinzawa-Itoh K, Nakashima R, Yaono R & Yoshikawa S (1996). The whole structure of the 13-subunit oxidized cytochrome-c oxidase at 2.8Å. Science, 272, 1136–1144.

Wang Z M, Luecke H, Yao N H & Quiocho F A (1997). A low energy short hydrogen bond in very high resolution structures of protein receptor-phosphate complexes. Nat. Struct. Biol., 4, 519–522.

Wright C S, Alden R A & Kraut J (1969). Structure of subtilisin bpn at 2.5Å resolution. Nature, 221, 235–242.