Learning Hub‎ > ‎

Famous citation References


k-fold cross-validation Mosteller F. and Tukey J.W. Data analysis, including statistics. In Handbook of Social Psychology. Addison-Wesley, Reading, MA, 1968.

leave-one-out cross-validation P. A. Lachenbruch and M. R. Mickey, “Estimation of error rates in discriminant analysis,” Technometrics, vol. 10, no. 1, pp. 1–12, Feb. 1968.

Bootstrapping CVBradley Efron. Estimating the error rate of a prediction rule: Improvement on cross-validation. Journal of the American Statistical Association, 78:316-331, 1983.

--------------------------not formal---------------------------------------


Quenouille (1949) [300] An early application fo data splitting (according to Zhang (1992) [395].)

Stone (1974) [355] Use of cross-validation for the choice and assessment of a prediction function. Concentrates on leave-one-out cross-validation. Source of early references.

Geisser (1975) [151] Similar in spirit to Stone (1974) [355]. Focuses strongly on finding a prediction (as opposed to using prediction to assess model), with no reference to parametric models. Allows more general splits than just leave-one-out.

Cox (1975) [89] Discusses data-splitting in the context of significance testing. In order to avoid problems with significance levels when hypothesis if formulated and tested using the same data, data are split at random into two parts; first is used to formulate the hypothesis, second for testing. Optimal proportions of the split and the efficiency of the procedure are examined in a simple example.

McCarthy (1976) [263] Shows that for leave-m out cross-validation, choosing the sets to leave out using a suitable balanced design can be more efficient than random splits, and nearly as good as evaluating all splits.

Mosteller and Tukey (1977) [278] Notes on cross-validation in a textbook. General remarks on need for and methods of cross-val. (Ch. 2), and comparison between cross-val. and the jackknife (Ch. 8).

Stone (1977) [356] Asymptotic equivalence between AIC and model choice criterion implied by cross-validation as in Stone (1974) [355]. Asymptotics lead to the Network Information Criterion (NIC) and with a further assumption (the true model is in the parametric class considered) to AIC.

Stone (1977) [357] Some results concerning the consistency of leave-one-out cross-validation criteria.

Snee (1977) [344] Review of methods for validating regression models: (1) compare predictions and coefficients to results from theory; (2) new data; (3) data splitting. DUPLEX algorithm for data splitting (systematically selects thw two to be fairly similar in terms of range and variation).

Geisser and Eddy (1979) [152] `Predictive Sample Reuse': model selection by maximizing $f_{k}=\prod_{j} f_{j}(y_{j}|y_{-j};M_{k})$ where $f_{j}$ is 
(i) $\int f(y_{j}|y_{-j},\theta_{k}; M_{k})f(\theta_{k}|y_{-j}; M_{k}) \, d\theta_{k}$ or 
(ii) $f(y_{j}|y_{-j},\hat{\theta}_{k}^{(-j)};\!M_{k})$. (i) and (ii) are asymptotically equivalent to each other and to AIC (see Stone (1977) [356]).

Chow et al. (1983) [80] Cross-validation estimates for a smoothing parameter in nonparametric density estimation. Consistency of cross-validated kernels and histograms.

Copas (1983) [84] Prediction (concentrating on MSEP) for multiple linear model, binary models also discussed. Fit to future data is always worse than for current data: `shrinkage'. `Preshrunk' estimators were parameter estimates are shrunk improve MSEP; shrinkage factor is estimated from the data. MSEP of model selected using subset selection, typically worse than for full model. In discussion, comments on, for example, similarity between distributions of current and future predictors. A very nice paper.

Cudeck and Browne (1983) [102] Model comparison for covariance structure models. Discussion of large-sample problems with GOF tests. Argues that there is no true model, only approximations, one reason to emphasise out-of-sample performance as criterion. Cross-validation: random split into two, cross-validation index (discrepancy function), both ways to get double c-v. Compared to (rescaled) AIC and BIC as related one-sample methods. In an example, (1) larger models favoured (even by cv) in large samples, (2) double c-v agreed both ways in only 2/6 cases, but differences small, (3) for number of parameters in selected model, AIC $>$ CV$>$ BIC. Nice paper.

Efron (1983) [120] Estimating the error rate of prediction (classification). Comparing cross-validation, jackknife, bootstrap and improvements of basic bs. Theoretical considerations and simulations, showing that c-v has higher variability than bs, and both are beaten by improved bs estimates. [The differences may be smaller for continuous responses.]

Efron and Gong (1983) [122] An expository paper on the bootstrap, the jackknife and cross-validation. Examines the connections between c-v and the others as estimators of prediction error.

Picard and Cook (1984) [295] Subset selection procedures for linear models. Assessment of predictive ability (MSE of prediction), noting optimism for models assessed using the same data as used for selection; need for validation data. Properties of assessment using data splitting.

Bunke and Droge (1984) [68] Estimates of MSEP for (a rather specialized) normal linear model, comparing $C_{p}$, cross-val. and bootstrap. Exact means and MSEs of these estimators given and compared. $C_{p}$ (which is the limit of the bootstrap estimate) does better than cross-validation.

Kuk (1984) [228] All subsets regression for proportional hazards models. Shows how this can be made computationally easy as follows: (1) use cross-validatory criterion where each uncensored observation in turn is changed to censored, (2) this is asymptotically equivalent to an AIC-like statistic involving the partial likelihood test statistic, (3) use Wald statistic instead, (4) this is is formally equivalent to Cp for a linear model.

Gong (1986) [162] Estimating prediction error (misclassification error), especially for a binary outcome. Comparing cross-val., bootstrap and jackknife estimates. Example from logistic regression where the `prediction rule includes the (forward) model selection procedure. In simulations, cv and jk do not improve on apparent error rate, but bs does.

Hastie (1987) [179] Discusses the links between the deviance and Kullback-Leibler divergence. Estimation and prediction. Notes on overfitting.

Li (1987) [241] Considers $C_{L}$ (generalization of $C_{p}$), (leave-one-out) cross-val. and generalized cross-val. for models of the type $Y_{i}=\mu_{i}+\epsilon_{i}$, where the model choice is for form of $\mu_{i}$. Shows that these criteria are asymptotically optimal in the sense that the selected model achievs the lower bound of 

Browne and Cudeck (1989) [64] Covariance structure models. Estimating a cross-val. index using a calibration sample only. For a normal ML discrepancy function, this is equivalent to a rescaled AIC.

Burman (1989) [69] Considers cross-validation (`v-fold c-v') where data are split into v (rather than n) equal sets and each set is used in turn as validation sample. Also `repeated learning-testing methods' where data are repeatedly split at random into (uneven) calibration and validation samples. (Both of these are motivated by cases where ordinary c-v is computationally demanding.) Shows that results are biased when v is small and derives a correction term.

Camstra and Boomsma (1992) [72] A review of cross-validation methods, with emphasis on model selection for covariance structure analysis. Description vs. prediction. Measures of predictive validity (squared correlation and MSEP). Measures of goodness-of-fit and their properties. Cross-val. with validation data, and approximate methods when no validation data used. Relation of AIC, BIC and their modifications to cross-validation measures. Review of simulation studies in the literature. Good source of references.

Gelfand et al. (1992) [154] Model adequacy vs. model choice, predictive criteria for both. Leave-one-out cross-validation with Bayesian predictive densities. Criteria for assessment. Sampling-based methods for the computations. In discussion, see especially Raftery (e.g. on relationship of AIC and BIC to predictive criteria).

Shao (1993) [335] Shows that model selection (for linear models) based on leave-one-out cross-validation is inconsistent, but `leave-$n_{v}$-out' (i.e. averaging over all possible splits with $n_{v}$ observations in the validation data) is consistent if $n_{v}/n\rightarrow 1$. Instead of all splits, can also consider a `balanced' subset of them, or use Monte Carlo or an approximation.

Zhang (1993) [396] Considers `multifold' (i.e. `leave-$n_{v}$-out' with possibly $n_{v}>1$ cross-validation) for linear models. Shows the asymptotic equivalence of this to the FPE (i.e. generalised AIC) criterion, when $n_{v}/n\rightarrow \lambda<1$ (c.f. Shaoe (1993) [335]). Also discusses two simplified versions, one which avoid the need to consider all splits, and one based on the bootstrap.

Efron and Tibshirani (1993) [124] Chapter on cross-validation for estimation of prediction error (or, equivalently, computing a bias-correction for apparent error rate). Also bootstrap methods for computing the same.

Schumacher et a. (1997) [330] Prop. hazards model, selecting the `optimal' cutpoint for dichotomizing a covariate (leads to overestimation of the effect). Adjusted estimates based on leave-one-out cross-validation and bootstrap.

Ronchetti et al (1997) [319] Robust cross-validation (using the approach of Shao (1993) [335]) for linear models. Robust criteria are used for both esimating the parameters in the calibration sample, and assessing performance in the validation sample. A simulation study to compare to standard approach, showing considerable superiority of the robust method.

Lin and Pourahmadi (1998) [243] Considers parametric and non-parametric non-linear time series models, applied to the Canadian lynx data. Considers the in-sample (using estimates of effective degrees of freedom) and out-of-sample (using last 14 observations as validation sample) predictive performance of various models, illsutrating the influence of overfitting on especially the most complicated models. A nice paper.

Sullivan et al. (1998) [360] Considers a rather extreme example of data mining: search for calendar effects in stock returns. Constructs a bootstrap procedure for estimating the p-value of selected models, explcitly accounting for the selection process by a succession of researchers. Also considers out-of-sample performance of selected models.