Acta Cryst., Vol. D54, (1998), 547-557.

Rfree and the Rfree ratio. Part I: derivation of expected values of cross-validation residuals used in macromolecular least-squares refinement

Ian J Tickle, Roman A Laskowski & David S Moss

Department of Crystallography, Birkbeck College, Malet Street, London  WC1E 7HX, UK.

Abstract

The last five years have seen a large increase in the use of cross validation in the refinement of macromolecular structures using X-ray data. In this technique a test set of reflections is set aside from the working set and the progress of the refinement is monitored by the calculation of a free R-factor which is based only on the excluded reflections. This paper gives estimates for the ratio of the free R-factor to the R-factor calculated from the working set for both unrestrained and restrained refinement. It is assumed that both the X-ray and restraint observations have been correctly weighted and that there is no correlation of errors between the test and working sets. It is also shown that the least-squares weights that minimise the variances of the refined parameters, also approximately minimise the free R-factor. The estimated free R-factor ratios are compared with those reported for structures in the Protein Data Bank.

Introduction

One of the problems in macromolecular crystallography is that the crystallographer cannot always be sure that an apparently fully refined structure is free from large systematic errors. The agreement between the model of the molecular structure and the X-ray diffraction data from which it has been derived is measured by the crystallographic R factor, but it is well known that structures with acceptable values of this parameter can have significant errors (Brändén & Jones, 1990; Kleywegt & Jones, 1995a). The R-factor is susceptible to manipulation by leaving out weak data or by overfitting the data with too many parameters and so is not a completely reliable guide to accuracy. In small-molecule crystallography, where the number of X-ray intensity observations usually exceeds the number of parameters in the model by at least an order of magnitude, the R-factor is a more sure guide to both accuracy and precision.

In 1992 Brünger introduced the idea of an tex2html_wrap_inline1138 (Brünger, 1992, 1993), based on the standard statistical modelling technique of jack-knifing or cross-validatory residuals (McCullagh & Nelder, 1983). The tex2html_wrap_inline1138 is the same as the conventional R-factor, but based on a test set consisting of a small percentage (usually ~5-10%) of reflections excluded from a structure refinement. The remaining reflections included in the refinement are known as the working set. The tex2html_wrap_inline1138 value, unlike the R-factor, cannot be driven down by refining a false model because the reflections on which it is based are excluded from this process. tex2html_wrap_inline1138 is only expected to decrease during the course of a successful refinement. Consequently, a high value of this statistic and a concomitant low value of R may indicate an inaccurate model. The procedure assumes that the reflections removed for the cross-validation test have been randomly selected and have errors uncorrelated with those that remain in the set used in the refinement. This assumption may be partly invalidated by the presence of non-crystallographic symmetry. Ideally, the refinement should be repeated several times removing non-overlapping sets of reflections each time.

The tex2html_wrap_inline1138 is highly correlated with the phase accuracy of the atomic model (Brünger, 1992, 1993) and can detect various types of errors in the structure including phase errors and partial mistracing of the structure. It has also be used in evaluating different refinement protocols, such as the optimization of the weights used during refinement. It is particularly useful in preventing the overfitting of data (Kleywegt & Brünger, 1996).

Kleywegt & Jones (1995a, b) have shown that with low resolution data it is possible to completely mistrace a structure, deliberately tracing it backwards through the density, and still achieve an acceptable R factor. The tex2html_wrap_inline1138 , on the other hand, could not be duped so easily, and remained at a high value, close to that expected for a random set of scatterers, throughout the refinement.

The use of tex2html_wrap_inline1138 is thus a valuable guide to the progress of refinement, particularly for low-resolution data, and its use and publication are widely encouraged. A recent review (Kleywegt & Brünger, 1996) indicated that the use of the measure is becoming more widespread with it being reported in 44% of articles describing macromolecular X-ray structures.

However, the usefulness of tex2html_wrap_inline1138 is limited by the fact that what is an ``acceptable'' value is often not evident. One would expect tex2html_wrap_inline1138 to always be higher than R even when there are no systematic errors in the model structure, but it is not clear how much higher it should be. At present we merely have a number of rules of thumb (Kleywegt & Brünger, 1996).

Cruickshank has estimated that the expected value of the free R-factor (EFRF) is given by

displaymath72

where tex2html_wrap_inline1196 is the number of observations, tex2html_wrap_inline1198 is the number of parameters, and R is the conventional R-factor (Dodson, Kleywegt & Wilson, 1996). Bacchi, Lamzin & Wilson (1996) use this expression in an extension of the self-validation Hamilton test to assess the significance of any observed drop in tex2html_wrap_inline1138 during refinement.

The need for more understanding of the behaviour of tex2html_wrap_inline1138 was highlighted by Dodson, Kleywegt & Wilson (1996). In spite of the enthusiasm for its use, actual applications of tex2html_wrap_inline1138 have remained somewhat subjective without an understanding of its statistical basis. For example, if non-crystallographic symmetry (NCS) constraints are relaxed during a structure refinement, how much should tex2html_wrap_inline1138 rise during subsequent refinement if the restrained model is correct? Without understanding how tex2html_wrap_inline1138 varies as a function of the number of restraints and/or number of parameters it is only possible to make rather subjective judgements.

This paper begins to answer these questions by deriving the expected value of the free residual from which estimates of both tex2html_wrap_inline1138 and the ratio of tex2html_wrap_inline1138 to R are calculated.

Theory

Error assumptions

The expected or estimated values of residuals and R-factors will be derived on the assumption that the weights used in the structure refinement correctly reflect the errors which include not only experimental errors in measuring X-ray intensities but also errors in the functional form of the structure factor model which produce random and uncorrelated perturbations in the residuals. These model errors, which may arise from complicated atomic disorder, are an important source of random error in protein structures and are the reason why R-factors of refined macromolecular structures are usually higher than their small molecule counterparts. It is assumed in the derivation of the statistics in this paper that these random errors have been correctly accounted for in the weighting of the X-ray data and of any restraints in the refinement process.

Some model errors, such as the absence of a bulk solvent correction, lead to correlated model errors in reciprocal space. In the theory to be presented in this paper, such correlation could be accommodated if refinement took place with a weight matrix with off-diagonal terms but in practice computational difficulties preclude the use of such matrices in macromolecular refinement. All expressions derived in this paper that use a diagonal weight matrix assume that correlated errors are absent. Model errors such as missing or misplaced atoms are similarly assumed to be absent. On the other hand, no assumptions are made about the completeness or otherwise of the reflection dataset.

In order that the model errors in the free reflections are uncorrelated with those reflections used in refinement, the reflections in one set must not be related to those in the other set by crystallographic and non-crystallographic symmetry. No reflection in the free set must be related to one in the main set by pseudosymmetry. Care is needed when selecting reflections from datasets where Bijvoet pairs have been kept separate. Another case arises when there are domains or molecules in the asymmetric unit related by a non-crystallographic axis which is along a rational direction in the crystal lattice (e.g. the pseudo-dyads in rhombohedral insulin).

Definitions

For convenience the definitions of symbols commonly used in this paper and its Appendices are grouped together below. A superscript T denotes a matrix transpose.

Scalars

f
the number of structure amplitude observations included in the refinement (the working set)
m
the number of parameters being refined
n
the number of observations, including any restraints, in the refinement
p
the number of observations excluded from refinement (the test set)
r
the number of restraints included in the refinement (r=n-f)
tex2html_wrap_inline1236
the weight of the ith observation
tex2html_wrap_inline1240
the standard deviation of the ith observation
tex2html_wrap_inline1244
an X-ray residual
D
the weighted refinement residual
tex2html_wrap_inline1248
the contribution to D from the working set of f reflections
tex2html_wrap_inline1254
D based on a test set of p excluded reflections
tex2html_wrap_inline1260
the contribution to D from the r restraints ( tex2html_wrap_inline1266 )
|F|
the structure amplitude
G
the least-squares X-ray scale factor
tex2html_wrap_inline1272
the number of atoms being refined
tex2html_wrap_inline1274
the standard R-factor
tex2html_wrap_inline1278
the generalised R-factor
tex2html_wrap_inline1282 & tex2html_wrap_inline1284
R-factors based on all reflections in the working set
tex2html_wrap_inline1138 & tex2html_wrap_inline1290
R-factors based on a test set of p excluded reflections

Column matrices

tex2html_wrap_inline1296
the ith row of tex2html_wrap_inline1300
tex2html_wrap_inline1302
the ith row of tex2html_wrap_inline1306
f
the n observations employed in the refinement (structure amplitudes and restraints)
f
the least-squares estimate of f calculated at the convergence of the refinement
g
the p excluded observations
g
the least-squares estimate of g calculated from tex2html_wrap_inline1312 at the convergence of the refinement
x
the least-squares estimate of the m parameters
tex2html_wrap_inline1316
the least-squares residual associated with the excluded observations

Rectangular matrices

A
the least-squares design matrix of derivatives of order n x m
B
the p x m matrix analogous to A but involving the excluded observations
tex2html_wrap_inline1326
the p x p variance-covariance matrix of the excluded residuals ( tex2html_wrap_inline1332 )
H
the m x m normal matrix given by tex2html_wrap_inline1338
S
the p x n matrix given by tex2html_wrap_inline1344
W
the n x n symmetric weight matrix and tex2html_wrap_inline1350 is the VCM (variance-covariance matrix) of the included observations. This matrix reflects the random experimental and model errors.
tex2html_wrap_inline1352
the p x p symmetric weight matrix of tex2html_wrap_inline1358 and tex2html_wrap_inline1360 is the VCM of the excluded observations

The expected value of the free residual

The algebra for the derivation of the expected value of the free residual is presented in Appendix I. There it is first shown that the second moment matrix of residuals corresponding to a test set of observations excluded from least-squares refinement is equal to the sum of the variance-covariance matrix (VCM) of the omitted observations and the VCM of the corresponding quantities calculated from the parameter estimates at the convergence of a refinement. The expected value of the sum of squared residuals associated with the excluded observations is then obtained by taking matrix traces.

The theory then draws on results from an earlier paper (Tickle, Laskowski & Moss, 1997) where it is shown that when the weighting is on an absolute scale, the expected value of the sum of a subset of s weighted residuals at the convergence of a least-squares refinement is

  equation169

where the angle brackets denote statistical expectation.

When the above summation is over all n observations (reflections and restraints)

  equation180

and hence

displaymath189

In Appendix I equation (21) shows that the expected value of the residual associated with p excluded observations in the test set, is given by

  equation194

It should be noted that the derivation of this equation does not assume that the test set has been randomly selected from reciprocal space.

We now consider the f structure amplitude observations included in the refinement (the working set). From equation (1) the expected value of the residual associated with these observations at convergence is given by

  equation215

The similarities between equations (3) and (4) will be noted. In the linear approximation, both expressions for the expected value are independent of the observations (structure amplitudes and restraints).

Using these results it is possible to obtain estimates of the ratio of tex2html_wrap_inline1138 to R for models with only random uncorrelated errors. This tex2html_wrap_inline1138 ratio is estimated first for unrestrained refinement and then for refinements with geometrical restraints. These tex2html_wrap_inline1138 ratios are the starting point for understanding tex2html_wrap_inline1138 ratios where systematic errors are present.

Random exclusion of observations in unrestrained refinement

The calculation of the expected values of the residuals using equations (3) and (4) is a computationally intensive task in macromolecular refinement, involving the inversion of the normal matrix H. In general we need statistics which are more readily available during the refinement process.

We first consider the case of unrestrained refinement. In this case f=n and from equations (2) and (4)

  equation246

and

  equation255

Equation (3) gives the expected value of the free residual based on any given subset of p observations. In order to derive simpler expressions, we must now assume that the test reflections have been randomly chosen from reciprocal space. In this case when p is large we assume that the sums of the quadratic forms are approximately proportional to the number of observations involved. Thus

  equation260

Hence from equations (3), (5) and (7)

  equation278

From equations (6) and (8) the estimated ratio of the residuals for the case of random uncorrelated errors is

  equation284

Unlike equations (3) and (4), the above equation is independent of the scale of the weights.

The expected value of RGfree in unrestrained refinement

The expected values of the residuals derived in the previous section may be used to give estimated values of the generalised R-factor, tex2html_wrap_inline1392 , defined by

displaymath294

We define tex2html_wrap_inline1394 as tex2html_wrap_inline1396 based on p excluded reflections. If there are only random and uncorrelated errors then the numerator of tex2html_wrap_inline1394 may be approximated by tex2html_wrap_inline1402 and using equation (8) we can write

  equation307

We define tex2html_wrap_inline1404 as tex2html_wrap_inline1396 based on all f included reflections and if the weights have been scaled so that the residual in the numerator is equal to its expected value then

  equation316

The ratio of RGfree to RG in unrestrained refinement

From equations (9), (10) and (11) we have

displaymath328

Thus we may write the ratio of the generalised R-factors for the case of random uncorrelated errors as

  equation336

These results give the expected ratio of the generalised R-factor of the test set to that of the working set at the convergence of an unrestrained refinement in the case where there are only random uncorrelated errors which are correctly reflected in the weights employed. The results depend only on the number of reflections and the number of parameters.

RGfree and the RGfree ratio in restrained refinement

In a restrained refinement, such as is typical in macromolecular crystallography, the ratios derived in the previous two sections would only be applicable if tex2html_wrap_inline1290 were calculated from a random selection of residuals including both structure amplitude observations and restraints. Since R-factors are traditionally only based on structure amplitudes, the estimation of tex2html_wrap_inline1290 ratios for restrained refinement requires further analysis.

The number of observations this time is n and f of these are structure amplitudes, the balance consisting of r geometrical, thermal or other restraints which make a contribution tex2html_wrap_inline1260 to the minimised residual at convergence. From equation (1) we have

equation349

where the summation is taken over the restraint observations. Using this equation and equation (2) the summation over the structure amplitude observations can be written

  eqnarray359

This result and equations (4) and (7) give the following approximation for tex2html_wrap_inline1402

  displaymath378

Similarly an approximation to tex2html_wrap_inline1438 can be derived from equations (4), (7) and (14).

eqnarray387

Hence an estimate of tex2html_wrap_inline1290 at the convergence of a correctly weighted restrained refinement with only random uncorrelated errors is

displaymath398

The estimated ratio of the free residual to the included residual is given by

displaymath405

and hence

  equation411

Minimum variance weights minimise Rfree

The expected values derived above are only applicable for correctly weighted least-squares refinements. However tex2html_wrap_inline1138 has also been used to optimise the weighting of geometrical or temperature factor terms in refinement by adjusting the weights so as to minimise tex2html_wrap_inline1138 (Brünger, 1992, 1993). It is therefore of interest to enquire how tex2html_wrap_inline1138 responds to variations in weighting.

Appendix III shows that the weights tex2html_wrap_inline1450 which correctly reflect experimental and model errors, minimise the variance of both the refined parameters tex2html_wrap_inline1312 , and also the expected value of the sum of the squares of the unweighted residuals in the test set. Hence the choice of these weights approximately minimises tex2html_wrap_inline1138 . One method of estimating such weights has been described by Tickle, Laskowski & Moss (1998).

The use of standard R-factors

The R-factor expressions used so far in this paper have been based on the generalised R-factor, tex2html_wrap_inline1392 . However identical expressions can be derived for the estimated values of the standard free R-factor, tex2html_wrap_inline1138 , and the standard tex2html_wrap_inline1138 ratio. For example, in Appendix II the following estimate of the tex2html_wrap_inline1138 ratio in unrestrained refinement is derived which is based on standard R-factors.

displaymath429

The numerator and denominator in the above equation are first order binomial approximations to square roots and therefore another estimate of the standard tex2html_wrap_inline1138 ratio is

displaymath436

The right hand side is now the same as for the tex2html_wrap_inline1290 ratio (equation 12). These expressions may be compared with the estimate used by Bacchi, Lamzin & Wilson (1996) which in our notation is

displaymath444

The derivation of simple estimates for the standard tex2html_wrap_inline1138 and its ratio requires the assumption that the variances of all structure amplitudes are equal. The use of generalised R-factors, whose estimates do not require this assumption, is to be preferred.

Discussion

Restrained and unrestrained RGfree ratios

tex2html_wrap_inline1138 has been employed (Kleywegt & Brünger, 1996) to detect situations where addition of extra parameters to a refinement brings no significant improvement in the model. For example the extra parameters may concern the relaxation of non-crystallographic symmetry or the employment of individual atomic temperature factors. Relevant estimates of tex2html_wrap_inline1290 are summarised in Table 1 and help to make these discussions more quantitative. The Table gives the more important expressions derived earlier for a refinement at convergence with only random uncorrelated errors and weighting which correctly reflects the model and experimental errors.

The constrained and unrestrained statistics in the Table show what happens when constraints are imposed, assuming that the reduction in parameters does not introduce model error. It is seen that a reduction in the expected values of tex2html_wrap_inline1290 and the tex2html_wrap_inline1290 ratio takes place. A reduction also takes place when restraints are imposed, assuming again that the above conditions are not violated.

The meaning of the Rfree ratio

The estimated tex2html_wrap_inline1138 ratios derived in this paper are the values that should be achievable at the end of a structure refinement when only random uncorrelated errors exist in data and model provided that the observations have been properly weighted (see below).

A larger tex2html_wrap_inline1290 ratio than that predicted by these formulae may indicate that parameter shifts have taken place which have minimised the residual without significantly improving the model. This may arise when errors in the model are sufficiently large for the refinement to descend into a false minimum.

A smaller tex2html_wrap_inline1290 ratio than that predicted by these formulae may indicate that the refinement has not reached convergence since the initial value of the ratio immediately after the division of the data into a working set and a test set will be approximately unity. Interpretation requires care since a wrong model which has not been fully minimised against the data may produce the same ratio as a fully minimised correct model.

In macromolecular refinement, model error is usually the major contributor to R and tex2html_wrap_inline1138 at the end of the refinement. Paucity of diffraction data means that thermal and static disorder in the more mobile parts of the molecule cannot be accurately modelled. These model errors may cause random perturbations in the structure amplitude residuals which are indistinguishable from random experimental errors. The tex2html_wrap_inline1290 ratio will not be affected by the magnitude of these errors provided that the latter have independent effects on the included and excluded residuals.

Observed Rfree ratios from the Protein Data Bank

We have examined tex2html_wrap_inline1138 ratios for crystal structures in the Protein Data Bank (Bernstein et al, 1977) as at 1 June 1997. Figure 1 shows a plot of the tex2html_wrap_inline1138 ratio as a function of tex2html_wrap_inline1512 , where tex2html_wrap_inline1272 is the number of atoms included in the refinement and f is the number of reflections used, for 725 macromolecular structures for which all these values are reported. The points are colour coded according to resolution range.

We define the tex2html_wrap_inline1138 ratio as tex2html_wrap_inline1520 and tex2html_wrap_inline1522 . Values of y range from about 0.8 to 1.8. By substituting tex2html_wrap_inline1526 into equation (15), we have

  equation472

Figure 1 shows the curves corresponding to equation (16) for different values of a.

In order to make the comparison between experimental and theoretical values easier, a function of y was sought which is a linear function of x. By squaring and rearranging the terms in equation (16) we arrive at

displaymath480

where

displaymath482

Figure 2 shows a plot of z against x where the points are colour coded as in Figure 1. The coloured straight lines in Figure 2 are least-squares lines fitted to the data points in the particular resolution range represented by points of the same colour. For example, the pink triangular points represent data between 3 and 4Å\ resolution and the pink line is the least-squares line through the pink triangular points. The pecked black lines emanating from the origin in Figure 2 are plots of z=ax for the same values of a as shown in Figure 1.

We requested information from some of the authors whose structures were outliers in Figure 2. It became apparent that very unusual tex2html_wrap_inline1138 ratios are normally not the result of careful refinement protocols. The coloured lines were therefore plotted ignoring the points in the darker regions outside the sector bounded by the black lines of slopes 0.5 and 10. The choice of these slopes as cutoffs was somewhat arbitrary but the removal of these outliers caused the coloured lines to pass nearer to the origin.

The plots of z=ax represent refinement regimes with different numbers of parameters per atom. The gradients of the lines (a) increase with the number of parameters per atom. In the absence of relevant information in the Data Bank, it was assumed that in restrained refinements, tex2html_wrap_inline1548 and tex2html_wrap_inline1550 . These estimates ignore temperature factor restraints (if any) because our survey of the latter revealed widely different restraint protocols. Using these values, for restrained refinements

displaymath491

It can be seen that the z=2x line (isotropic temperature factors) passes through the constellations of orange crosses and pink triangular points, representing structures between 2.5 and 1.5Å\ resolution and is close to the green pecked line (2.5-2.0Å\ data). Similarly the z=x line (overall temperature factor) lies close to the pink line which is fitted to the 4 to 3Å data. Even in the absence of details of restraint procedures, the z=ax lines can be seen to pass through areas of the plot where the particular refinement regime is most relevant. The large spread of values about the straight lines is unlikely to be solely a statistical effect and may well say something about the quality of the refinements.

Comparison of the lines z=2x and z=4x, which differ only in respect of restraints, shows how restraints lower the tex2html_wrap_inline1138 ratio. Non-crystallographic symmetry (NCS, see Introduction) might give rise to lower than predicted tex2html_wrap_inline1138 ratios. However, a check on structures in our plots which exhibit NCS, did not reveal any obvious systematic effects.

Conclusion

Values of tex2html_wrap_inline1138 are affected by all types of error in the model and the data. The tex2html_wrap_inline1138 ratio, however, is independent of random errors and provides a statistic which can be compared with its theoretically estimated value and used to detect systematic model or weighting errors at the convergence of least-squares refinement. However, achievement of a theoretical value of the ratio is not by itself proof of the correctness of the model or of the quality of the refinement. Nevertheless it would still be helpful if refinement programs printed out the calculated and estimated value of the tex2html_wrap_inline1138 ratio using the expressions shown in Table 1. This would encourage a better understanding of tex2html_wrap_inline1138 than exists at present. Calculation of the observed and theoretical values of these ratios has already been implemented in the refinement program RESTRAIN (Driessen et al., 1989).

At low resolution the number of data excluded for cross-validation may be small and in these circumstances the precision of free residuals is important. This will be the subject of part II of this work.

The authors would like to thank Professor D W J Cruickshank FRS for his helpful comments on this paper.


Appendix I

Statistical properties of free residuals

Derivation of the expected value of the free residual

Here we derive the statistical expectation of the free residual tex2html_wrap_inline1402 , at the convergence of a least-squares refinement.

The normal equations of least-squares refinement at convergence can be written

If the errors in the observations and the model are not too large then a truncated Taylor expansion may be written about the expected values of the parameter vector tex2html_wrap_inline1576 and the observation vector tex2html_wrap_inline1578 .

displaymath571

The structure amplitudes and target distances corresponding to the excluded observations in a test set can be expressed in terms of the parameter estimates by the truncated Taylor expansion

eqnarray581

Thus the column of residuals of the excluded observations is given by

eqnarray597

Assuming that the errors in tex2html_wrap_inline1580 and tex2html_wrap_inline1358 are uncorrelated, the VCM of the residuals associated with the excluded observations tex2html_wrap_inline1332 is given by

  eqnarray617

Thus tex2html_wrap_inline1326 is equal to the sum of the VCM of the excluded observations and the VCM of the corresponding quantities calculated from the refined parameters. From equation (18) it follows that

  equation643

where tex2html_wrap_inline1588 is a unit matrix of order p. Using the fact that

displaymath657

and noting that the trace of a product of matrices is invariant under a cyclic permutation of the order of matrix multiplication, we take the trace of both sides of equation (19) to give

  eqnarray664

If the p excluded observations are structure amplitudes and we assume that the weight matrix is diagonal, then equation (20) can be written as

  equation697

where the angle brackets denote statistical expectation, tex2html_wrap_inline1402 is the expected value of the residual associated with the given p excluded observations and tex2html_wrap_inline1302 is the i th row of B. The expected value of a single weighted excluded residual in the summation is obtained by taking a single diagonal term from equation (19) which gives

displaymath721

Two notable assumptions have been made in the above analysis. First, it has been assumed that the test set residuals are uncorrelated with those in the working set. Equation (18) is invalid if this assumption is not true.

Second, it is assumed that the refinement has used a weight matrix tex2html_wrap_inline1450 which correctly reflects the experimental and model errors. Equation (21) is invalid if this assumption is not true. When a diagonal weight matrix is used, as is almost always the case in practice, correlated errors in reciprocal space will not be correctly represented. In Appendix III it is shown that the use of weight matrices which do not correctly account for the errors will lead to test set residuals with larger variance-covariance matrices.


Appendix II

The ratio of Rfree to R in unrestrained refinement

The derivations in the body of this paper have been given in terms of the generalised R-factor. In this Appendix we derive the expected values of residuals expressed as the sum of unweighted absolute differences. Hence we obtain an estimate of the tex2html_wrap_inline1138 ratio expressed in terms of the standard R-factor which is defined as

displaymath739

The variance of an included residual at the convergence of a refinement (see Appendix II in Tickle, Laskowski and Moss, 1997) can be written as

displaymath751

To make further progress we need to assume a distribution for each residual tex2html_wrap_inline1614 . If a normal distribution is assumed, we can write

  equation766

and hence

displaymath779

The square root in the above summation may be expanded binomially to the first order giving

displaymath793

We now assume that tex2html_wrap_inline1240 is the same for all observations so that we can write

displaymath807

Using equation (5) to simplify the right-hand side, we can write

  equation822

By analogous reasoning the sum over the p excluded reflections can be approximated as

  equation833

We have now derived approximations for the numerators of tex2html_wrap_inline1282 and tex2html_wrap_inline1138 . Their denominators are, on average, proportional to the number of reflections in the respective summations. Hence dividing equation (24) by equation (23) and multiplying by f/p we have

displaymath429


Appendix III

Minimum variance weights minimise the expected value of Rfree

First we show that the least-squares refinement using a weight matrix tex2html_wrap_inline1450 which is the inverse of the VCM of the observations, minimises the VCMs of both the refined parameters tex2html_wrap_inline1312 and the free residuals tex2html_wrap_inline1636 .

Consider a function tex2html_wrap_inline1638 which is a function of the refined least-squares parameters which is linear to within a first order Taylor approximation.

displaymath865

Now consider two column matrices tex2html_wrap_inline1640 and tex2html_wrap_inline1642 , defined below, which are both unbiased estimates of tex2html_wrap_inline1638 .

   eqnarray872

where

   eqnarray886

tex2html_wrap_inline1646 is any weight matrix and tex2html_wrap_inline1648 . From equations (27) and (28) and from the definitions of tex2html_wrap_inline1650 and tex2html_wrap_inline1652

  equation904

We wish to show that the VCM of tex2html_wrap_inline1640 is smaller than that of tex2html_wrap_inline1642 . Because tex2html_wrap_inline1640 and tex2html_wrap_inline1642 are unbiased estimators, tex2html_wrap_inline1664 and thus the VCM of tex2html_wrap_inline1642 can be expressed as

  eqnarray911

The last two terms of the above equation are the transpose of each other and are each zero matrices as shown by the following analysis which uses equations (25), (26), (27) and (29).

eqnarray946

Hence from equation (30)

displaymath973

Since the VCMs are positive definite

displaymath987

Thus the VCM of tex2html_wrap_inline1640 which is calculated with tex2html_wrap_inline1450 is less than the VCM of tex2html_wrap_inline1640 which is calculated with another weight matrix tex2html_wrap_inline1646 . Making the substitution tex2html_wrap_inline1684 and setting tex2html_wrap_inline1686 to a unit matrix, this analysis shows that by using the weight matrix tex2html_wrap_inline1450 , we minimise the variance of tex2html_wrap_inline1312 . Substituting tex2html_wrap_inline1692 and tex2html_wrap_inline1694 , the same analysis shows that tex2html_wrap_inline1450 also minimises the VCM of tex2html_wrap_inline1700 . From equation (18) the VCM of the residuals associated with the excluded observations tex2html_wrap_inline1326 is the sum of the constant matrix tex2html_wrap_inline1360 and the VCM of tex2html_wrap_inline1700 . Hence tex2html_wrap_inline1326 and its trace are also minimised by choosing tex2html_wrap_inline1450 as the weight matrix.

The trace of tex2html_wrap_inline1326 is the expected value of the unweighted sum of squared residuals

displaymath1007

where the summation is taken over the p reflections in the test set. By using the normal approximation in equation (22) we can say that the sum of absolute differences

displaymath1017

and hence tex2html_wrap_inline1138 are approximately minimised by choosing tex2html_wrap_inline1450 as the least-squares weight matrix.


References

Bacchi A, Lamzin V S & Wilson K S (1996). A self-validation technique for protein-structure refinement: the extended Hamilton test. Acta Cryst., D52, 641-646.

Bernstein F C, Koetzle T F, Williams G J B, Meyer E F Jr, Brice M D, Rodgers J R, Kennard O, Shimanouchi T & Tasumi M. (1977). The Protein Data Bank: a computer-based archival file for macromolecular structures. J Mol Biol 112: 535-542.

Brändén C-I & Jones T A (1990). Between objectivity and subjectivity. Nature, 343, 687-689.

Brünger A T (1992). Free R value: a novel statistical quantity for assessing the accuracy of crystal structures. Nature, 355, 472-475.

Brünger A T (1993). Assessment of phase accuracy by cross validation: the free R value. Methods and applications. Acta Cryst., D49, 24-36.

Dodson E, Kleywegt G J & Wilson K (1996). Report of a workshop on the use of statistical validators in protein X-ray crystallography. Acta Cryst., D52, 228-234.

Driessen H, Haneef M I J, Harris G W, Howlin B, Khan G & Moss D S (1989). RESTRAIN: restrained structure-factor least-squares refinement program for macromolecular structures. J. Appl. Cryst., 22, 510-516.

Kleywegt G J & Brünger A T (1996). Checking your imagination: applications of the free R value. Structure, 4, 897-904.

Kleywegt G J & Jones T A (1995a). Where freedom is given, liberties are taken. Structure, 3, 535-540.

Kleywegt G J & Jones T A (1995b). Braille for pugilists. In Making the Most of Your Model, eds. Hunter W N, Thornton J M & Bailey S, SERC Daresbury Laboratory, Warrington, UK, pp. 11-24.

McCullagh P & Nelder J A (1983). Generalised Linear Models. pub. Chapman & Hall, London, UK, pp. 209-221.

Tickle I J, Laskowski R A & Moss D S (1998). Error estimates of protein structure coordinates and deviations from standard geometry by full matrix refinement of tex2html_wrap_inline1730 B- and tex2html_wrap_inline1732 B2-crystallin. Acta Cryst., D54, 243-252.


Figures

Figure 1

Figure 1 PostScript

Figure 1. Plot of the Rfree/R ratio as a function of the ratio for 725 macromolecular structures in the Protein Data Bank, where Na/f is the number of atoms included in the refinement and f the number of reflections used. The data points are colour-coded according to their resolution range, as shown in the key on the bottom right of the graph. The high resolution data points tend to be close to the vertical axis with the other points tending to spread further to the right the lower their resolution, as might be expected. Also shown are four dotted curves corresponding to different values of the variable a. The curves shown are for: a = 1 which corresponds to 3 parameters per atom (ie restrained refinement of atomic coordinates only, plus an overall temperature factor); a = 2 is for 4 parameters per atom (restrained refinement of coordinates plus individual isotropic temperature factors); a = 4 represents unrestrained refinement with 4 parameters per atom; and a = 7 represents 9 parameters per atom (restrained anisotropic refinement).

Figure 2

Figure 2 PostScript

Figure 2. Plot of the same data as in Fig.1 but with the vertical axis representing z which is a linear function of Na/f, where z defntn and y defntn . Here the curves from Fig.1, which corresponded to different values of a, are now straight lines. Again, the data points are colour-coded according to resolution range, as shown in the key on the right. Also shown are 5 coloured lines representing least-squares lines fitted to the data points in each of these five different resolution ranges. The lines are identified by the appropriate point-markers in square brackets outside the graph border.The grey regions show the regions from which the data points have been excluded in the least-squares lines calculations, namely points outside the sector bounded by the lines corresponding to a = 10 and a = 0.5.


Table 1

  table1080



Document created using latex2html