Chapter 5 - Estimates, Confidence Intervals, and Hypothesis Tests
In this chapter, we discuss the fitting criterion that NONMEM uses, parameter estimates, and standard error estimates. We then discuss how to form confidence intervals for parameters and do hypothesis tests with NONMEM.
In principle, all fitting procedures attempt to adjust the values of the parameters of the model to give a "best fit" of the predictions to the actual observations. The set of parameters that accomplish this are called the parameter estimates, and are denoted here as
, , and . Methods differ in how they define "best". The criterion that NONMEM uses is a Least Squares (LS) type criterion. The form of this criterion varies as the error model varies, and as population models with multiple random effects must be considered. We briefly discuss these various criteria next, to give the reader a feel for what NONMEM is doing. A detailed knowledge of the statistical basis for the choice of fitting criterion is not necessary either to use or interpret NONMEM fits. In this chapter, a fixed effects parameter will be denoted by a ; the distinction between individual fixed effects parameters ( ) and population fixed effects parameters will not be important here.
For the Additive error
model (3.4), the Ordinary Least Squares criterion (OLS)
chooses the estimate
so as to make the sum of squared (estimated) errors as small
as possible. These estimates cause the prediction, here
denoted
, to be an estimate of the mean value of
, which is intuitively appealing. The prediction is obtained
by computing the value for y under the model with parameters
set to their estimated values and
set to zero†.
----------
†
, not
, since we follow the NONMEM convention and, when discussing
individual type data as here, use
to denote the random effects of a single level that appear
in the model, those in the residual error model.
----------
The simple OLS criterion just defined becomes inefficient and is no longer the "best" one to use when the error model is other than the Additive error model. It treats all estimated errors as equally important (i.e. a reduction in the magnitude of either of two estimated errors that are of the same magnitude is equally valuable in that either reduction decreases the sum of squared errors by the same amount), and this results in parameter estimates that cause all errors to have about the same typical magnitude, as assumed under the Additive model. The CCV error model, though, says that the typical magnitude of an error varies monotonically with the magnitude of the (true) prediction of y. In principle, Weighted Least Squares (WLS) gives a fit more commensurate with the CCV or other non-Additive error model. WLS chooses as that value of minimizing
Each is a weight which, ideally, is set proportional to the inverse of the variance of . In the CCV model this variance is proportional to (evaluated at the true value of ). Use of such weights will down-weight the importance of estimated squared errors associated with large values of and promote the relative contribution of those associated with small values of .
In many cases, users can
supply approximate weights, and the WLS objective function
can be used as stated in (5.1). When, as with the CCV model
for example, the ideal weights depend on the true values of
parameters, these true values can be replaced by initial
estimates, and then the WLS objective function as given in
(5.1) can be minimized. Alternatively, instead of viewing
as a function of
only through the estimated error’s dependence on
, it can be viewed as a function of
through both that dependence and also through the
ideal weights’ dependence on
. The entire function can then be minimized with respect to
. That this creates a problem is most easily seen when the
error model contains a parameter which is not itself a
parameter of the structural model, but which, nonetheless,
must be regarded as an element of
. Such an error model is the Power Function model of (3.7),
and the "extra" parameter is
. The WLS objective function with the reciprocal variance of
substituted for
is†
----------
† Again, we
call attention to the symbols used for the random effects
parameter: the term
appears in the objective function, (5.2), not
, because we are discussing individual type data, not
population type data.
----------
In this case if were set to a very large number, while the other parameters in were only such as to make all , then all would be very large, and the summation would attain a very small value. (The value of is irrelevant to the minimization with respect to .) Thus, all elements in other than would be indeterminate (as long as they were such that all were greater than 1); a most unsatisfactory state of affairs.
There is a way to deal with this problem that preserves the spirit of least-squares fitting, and NONMEM uses it. In essence, it adds to the WLS objective function a term proportional to the sum of the logarithms of the error variances. Thus a penalty is paid for increasing the error variances without a concomitant decrease in the estimated errors themselves. This modified objective function is called the Extended Least Squares (ELS) objective function. It is minimized with respect to all parameters of the structural and error models simultaneously (in the current example, and , as can be considered an element of ). Note that the objective function may be negative. This has no particular significance.
The complications arising from a population model are due entirely to the random interindividual effects occurring in the parameter model. To deal with this, NONMEM uses an approximation to the true model. The approximate model is linear in all the random effects. For this linearized model, the vector of mean values for the observations from the individual is the vector of true predictions for these observations. These predictions are obtained from the model by setting the parameters to their true values and by setting all the elements of both and to zero. In other words, these are the true predictions for the mean individual with fixed effects equal to those of the individual. For this linearized model it is also possible to write a formula for the variance-covariance matrix of the observations from the individual. This matrix is a function of the individual’s fixed effects and the population parameters , , and . Finally, the ELS objective function discussed above is generalized to be a sum over individuals, rather than observations, and where the term of the sum involves a squared error between a vector of observations and an associated vector of predictions, weighted by the reciprocal of the associated variance-covariance matrix for the individual.
It is useful to consider how to estimate parameters that do not appear in the model we use to fit the data, but are, instead, functions of them (e.g. the half-life parameter , when the rate constant of elimination is the model parameter).
It is always possible, of course, to parameterize the model in the function of interest. For example, we have already seen (Chapters 2 & 3) that we may use the function (parameter) in the one-compartment model instead of . However, we may be interested in the values of several alternative parameterizations (e.g., we may want to know k, clearance, and half-life). Rather than rerun the same problem with several alternative parameterizations, we can use the fact that the LS estimate of a function of the parameters is given by the same function of the LS parameter estimates. Formally, if is the function of interest, then . E.g. Letting , , and , then .
Clearly, it is almost impossible for the estimates to actually be the true values. The question is: how far are the true values from the estimates? To discuss this question, imagine replicating the entire experiment (gathering new data, but keeping fixed) multiple times. Also, for simplicity, imagine that the model has only one scalar parameter, , and that its true value, is known. If, after each replication, the estimate of is recorded, and a histogram of these values is plotted, one might see something like figure 5.1A or 5.1B.
Figure 5.1. Two hypothetical histograms of estimates of a single parameter upon replication of a given experiment. Left panel (A): The estimates have small variance (spread) but are biased (the mean of the estimates differs from the true value, ); Right panel: The estimates have large variance but are relatively unbiased.
The difference between the estimate and the true value, , obviously differs from replication to replication. Let this difference be called the estimation error. We cannot know the estimation error of any particular estimate (if we could, we could know the true value itself, by subtraction), but we can hope to estimate the mean error magnitude. Since errors can be positive or negative, a measure of magnitude that is unaffected by sign is desirable. This is traditionally the Mean Squared Error ( ). The MSE can be factored into two parts:
where is the bias of the estimates (mean (signed) difference between the estimates and the true value) and is the standard error of the estimates ( is the variance of the estimates). As illustrated in figure 5.1, the can be about the same for two types of estimates while both their bias and differ. It is very hard to estimate the bias of an estimator unless the true parameter value is, in fact, known. This is not true of the : the standard deviation of the distribution of estimates of a parameter on replication is the ; no knowledge of the true value of the parameter is required. In many situations, LS estimators of fixed effects parameters are unbiased, although in nonlinear problems, such as most pharmacokinetic ones, this cannot be assured.
Figure 5.1 illustrates the distribution of parameter estimates that might result if an experiment were replicated. The bias and spread depend on the estimation method, the design of the experiment ( , which implicitly includes ) and on the true parameter values, including the variances (and covariances) of the random effects influencing . If, for example, more observations were obtained in each experiment (more individuals in a population study), the spread of the estimates (one from each experiment) would decrease until, in the limit, if an infinite number of observations were obtained in each experiment, every estimate would be the same (equal to the true value plus the bias of the estimator). Thus, the distribution of the estimate tells us nothing about biology or measurement error, but only about the of the estimate itself.
In contrast, and tell us about unexplained (or random) interindividual variability (biology) or error magnitude (biology plus measurement error), not about how precisely we know these things. No matter how many observations we make, interindividual variability will remain the same size (but the variability of our estimate of its size will decrease), as will the measurement error variability of a particular instrument.
It is very important not to confuse variability (e.g., between individuals) in a model parameter with variability in the estimate of that parameter, despite the fact that the terms we use to describe both variabilities are similar. Thus an element of , say has a , , while the estimate of , , also has a given by the square of the standard error for . Indeed, the use of the term "standard error" rather than "standard deviation" to name a measure of the spread in the distribution of the parameter rather than in the parameter helps call attention to the distinction between these two types of things.
Acknowledging that any particular estimate from any particular experiment is unlikely to equal the true parameter value implies that we should be interested in "interval" estimates of parameters as well as (instead of?) point estimates. An interval estimate of a parameter is usually a range of values for the parameter, often centered at the point estimate, such that the range contains the true parameter value with a specified probability. The probability chosen is often 95%, in which case the interval estimate is called the 95% Confidence Interval (CI).
A CI is often based only on the parameter estimate and its . In the next sections we discuss three questions about such CIs a little further. (i) How to estimate the from a single set of data (we cannot replicate our experiment many times and construct a histogram as in figure 5.1). (ii) Given an estimate of , how to use that number to compute a (95% confidence) interval with 95% chance of containing the true parameter value. (iii) Given an estimate of , how to compute a confidence interval for a function of the parameter.
Remarkably, the of a parameter estimate can be estimated using only the data from a single experiment. The idea is that the data give us estimates of the variances of all random effects in our model, from which we can estimate the variability in future data (if we were to replicate the experiment). That is, the SE of the estimates on replication depends only on quantities we either know or have estimates of: the , the number of observed ( ), and the variances of all random effects.
It is a little beyond the scope of this discussion to give the method by which NONMEM actually estimates the of a parameter estimate; however, examples of how this can be done are found in any statistical textbook on regression. NONMEM presents the estimated standard error for each parameter of the model (including the random effects parameters, and ) as part of its output.
Statistical theory tells us not only how to compute the of a parameter estimate, but also that for a LS estimate (and many other kinds of estimates), the shape of the distribution of the estimates is approximately Normal if the data set is large enough. This means that we may use percentiles of the Normal distribution, to compute confidence interval bounds (when is small, the distribution is often used instead; this is discussed in statistics texts). In general, a 100(1- )% confidence interval for a single parameter, say, is computed as . Here denotes the percentile of the Normal distribution. As previously noted, is often chosen to be .05, in which case is approximately 2.
As discussed above, one can reparameterize the model in terms of the new parameter, and NONMEM will then estimate its standard error. If re-running the fit presents a problem, there is a simple way to compute a confidence interval for a function of a single parameter. If the lower and upper bounds of a 100(1- )% confidence interval for are denoted and , respectively, then a 100(1- )% confidence interval for has lower and upper bounds and , respectively. This confidence interval, however, is not necessarily centered at .
A truly new feature introduced by multiple parameters is the phenomenon of correlation among parameter estimates. NONMEM outputs a correlation matrix for the parameter estimates. The element of the matrix is the correlation between the ith and jth parameter estimates. A large correlation (e.g. ) means that the conditional distribution of the ith estimate, given a fixed value of the jth estimate, depends considerably on this fixed value. Sometimes each parameter estimate of a pair that is highly correlated has a large standard error, meaning that neither parameter can be well-estimated. This problem may be caused by data that do not distinguish among the parameters very well, while a simpler model, or a different design, or more data might permit more precise estimates of each.
As a simple example, imagine a straight line fit to just two points, both at the same value of : neither slope nor intercept can be estimated at all as long as the other is unknown, but fixing either one (simplifying the model) determines the other. Both parameters could be estimated by observing another point at some other value of (more data), or, still using just 2 points, by placing these points at two different values of (modifying the design). Thus, when standard errors are large, it is useful to examine the correlation matrix of parameter estimates to see, in particular, if some simplification of the model may help.
There is an approximate formula for computing a standard error, and hence a confidence interval for a function of several parameters (e.g., a confidence interval for half-life when the estimated parameters are and ). It uses the standard errors of the parameter estimates and the correlations between the parameter estimates. However, discussion of this formula is beyond the scope of this introduction. If a confidence interval for a function of several parameters is desired, it is often more convenient to re-fit the data with the model reparameterized to include the function as an explicit parameter.
Before going into details, a note of caution is in order about hypothesis testing in general. The logic of hypothesis testing is that one sets up a hypothesis about a parameter’s value, called the null hypothesis, and asks if the data are sufficiently in conflict with it to call it into question. If they are, one rejects the null hypothesis. However, logically, if they are not, one has simply failed to reject the null hypothesis; one does not necessarily have sufficient data to accept it. An extreme example will make this clear. Let the null hypothesis be any assertion at all about the state of nature. Gather no evidence bearing on the question. Clearly, the evidence (which is void) is insufficient to reject the null hypothesis, but just as clearly, in this case the evidence provides no support for it either.
In less extreme cases there is a way to view failure to reject as lending some support to the null hypothesis, but the matter is problematic. It is somewhat less problematic to use a confidence interval to quantify support for a null hypothesis. A null hypothesis is an assertion that a parameter’s true value is found among a set of null values. A confidence interval puts reasonable bounds on the possible values of a parameter. One can then say that the evidence (the data from which the parameter estimate is derived) does support a null hypothesis (about the value of the parameter) if the null values are included in the interval and that the evidence fully support the null hypothesis if all nonnull values lie outside. An example will help make this clear.
Consider that mean drug clearance in adults varies linearly with the weight of the individual to a clinically significant degree. Formally, this can be put as a statement about a regression coefficient in a model such as
The null hypothesis might be that is close to zero, i.e. that it is not so different from zero as to be clinically significant. To make this precise, suppose that we know that mean clearance for a 70 kg person (i.e., ) is about 100 ml/min. If were .20 ml/min/kg or less, a 50 kg increment (decrement) in weight from 70 kg would be associated with less than a 10% change in total clearance. This is clinically insignificant, so the appropriate null values for might be 0.0 to .20, assuming, of course, that a physical lower bound for the parameter is zero. (More usually in statistical discussions a set of null values consists of a single number, e.g. 0.) If the confidence interval for only includes null values (e.g., it is .10 to .15), one might then safely conclude that weight, if it has any effect at all, has no significant effect, and that the data fully support the null hypothesis. If the confidence interval includes null values and others (e.g., it is 0.0 to .60), one would conclude that there is some support for the null hypothesis, but that there is also some support for rejecting it. In this case the data are insufficient to allow outright acceptance or rejection. If the confidence interval includes no null values (e.g., it is .80 to 1.3), one would reject the null hypothesis and conclude that weight has a clinically significant (linear) effect on clearance.
For these reasons, we urge caution when performing hypothesis tests and suggest that confidence intervals are often more useful. None the less, the popularity of hypothesis tests requires that they be done, and we now describe two methods for so doing, the first somewhat more approximate and less general than the second, but easier to do.
A straight-forward way to test a null hypothesis about the value of a parameter is to use a confidence interval for this purpose. In other words, if the confidence interval excludes the null values, then the null hypothesis is rejected. As described in Section 4.2.2, such a confidence interval is based on the estimated standard error. This method generalizes to a hypothesis about the values of several parameters simultaneously, but this is beyond the scope of this introduction.
An approach that involves the extra effort of re-fitting the data has the advantage of being less approximate than the one that uses a confidence interval based on the SE. This method is the so-called Likelihood Ratio Test.
The basic idea is to compare directly the goodness of fit (as indicated by the minimum value of the extended least squares objective function) obtained between using a model in which the parameter is fixed to the hypothesized value (the reduced model) and a model in which the parameter must be estimated (the full model).
A model is a reduced model of a full model if it is identical to the full model except that one or more parameters of the latter have been fixed to hypothesized values (usually 0). Consider the examples:
E.g. #1. Valid Full/Reduced Pair:
Full model:
Reduced model:
E.g. #2. Invalid Full/Reduced Pair:
Full model:
Reduced model:
In example #1, fixing to 0 produces the reduced model, while in example #2, no parameter of the full model can be fixed to a particular value to yield the "reduced" model. It will always be true that if the models are set up correctly, the number of parameters that must be estimated will be greater in the full model than in the reduced model. Note that this is not so for example #2.
The reduced model expresses the null hypothesis; the full model expresses an alternative hypothesis. In example #1 above, the null hypothesis is "typical value of clearance is independent of weight", and the alternative is "typical value of clearance depends linearly on weight."
Note an important point here: the alternative hypothesis represents a particular alternative, and the likelihood ratio test using it will most sensitively reject the null hypothesis only when this particular alternative holds. If the full model were that "the typical value of clearance is inversely proportional to weight" (so that as weight increases, the typical value of clearance decreases, a situation which rarely holds), the likelihood ratio test using the alternative we have stated would not be particularly sensitive to rejecting the null hypothesis, and we might fail to do so. In contrast, we might succeed in rejecting the null hypothesis if we used some other alternative model closer to the truth.
Part of the NONMEM output is the "Minimum Value of the Objective Function" (see Chapter 2). Denote this by . If NONMEM’s approximate model were the true model, then would be minus twice the maximum logarithm of the likelihood of the data (for those readers unfamiliar with likelihoods, and curious as to what they are, we suggest consulting a statistics textbook). Statistical theory tells us that the difference in minus twice the maximum log likelihoods between a full and reduced model can be referenced to a known distribution. Thus, to perform the Likelihood Ratio Test, one proceeds as follows.
Let be the minimum value of the objective function from the fit to the full model, and let be the corresponding quantity from the fit to the reduced model. Fit both models separately yielding and , and form the statistic,
This statistic is approximately distributed chi-square ( ) with degree of freedom, where is the number of parameters whose values are fixed in the reduced model. For an -level test, compare to , the 100(1- ) percentile of the distribution.
In particular, when exactly one parameter of the full model is fixed in the reduced model, a decrease of 3.84 in the minimum value of the objective function is significant at .
If NONMEM’s approximate model (linear in the random effects) were the true model, and in addition, were linear in the fixed effects, then would be (approximately) distributed according to the distribution with , and degrees of freedom ( ). Since is equal to only when is "large", and is greater otherwise, it is more conservative to reference to in all instances, even when is nonlinear.
An idea related to hypothesis testing is this: when faced with alternative explanations (models) for some data, how does one use the data to determine which model(s) is (are) most plausible? When one of the models is a reduced sub-model of the other, and there is some reason to prefer the reduced model to the alternative, then the Likelihood Ratio test can be used to test whether this a priori preference must be rejected (at the level). However, when one gives the matter some thought, there is usually little objective reason to prefer one model over another on a priori grounds. For example, although possibly more convenient, a monoexponential model is, if anything, less likely on biological grounds than a biexponential.
Not only may there not be a clear probability favoring one contending model over another, but the two models may not form a full and reduced model pair. In such circumstances, one must rely on some goodness-of-fit criterion to distinguish between the models. Consider choosing between just two models (the ideas to be discussed readily generalize to more than two), denoted model and model . If the number of free parameters in model ( ) is the same as that of ( ), then here is a reasonable criterion: favor the model with the better fit. Note that there is no value associated with this statement; no hypothesis is being tested.
Unfortunately, if one cannot simply compare and and choose the one with the smaller value. The reason is best understood when and are a full and reduced model pair. The full model will always fit the data better (i.e., have a smaller ) as it has more free parameters to adjust its shape to the data. While the same is not always true for any pair of non-linear models with different numbers of parameters, it is often true: the model with the greater number of parameters will fit a given data set better than the model with fewer parameters. Yet the larger (more parameters) model may not really be better; it may, in fact, fit an entirely new data set worse than the simpler model if its better fit to the original data was simply because it exploited the flexibility of its extra parameter(s) to better fit some random aspect of that data.
Based on the above intuitive argument, it seems that one should penalize the larger model in some way before comparing the likelihoods. This intuition is formally realized in the Akaike Information Criterion (AIC) which says that one should compute + , and choose model if is >0, and model if is <0. The second term penalizes model if , and vice versa. When , the reduces to the comparison of and described previously.