cv.nfeaturesLDA {animation} | R Documentation |
This function provids an illustration of the process of finding out the optimum number of variables using k-fold cross-validation in a linear discriminant analysis (LDA).
cv.nfeaturesLDA(data = matrix(rnorm(600), 60), cl = gl(3, 20), k = 5, cex.rg = c(0.5, 3), col.av = c("blue", "red"), ...)
data |
a data matrix containg the predictors in columns |
cl |
a factor indicating the classification of the rows of |
k |
the number of folds |
cex.rg |
the range of the magnification to be used to the points in the plot |
col.av |
the two colors used to respectively denote rates of correct predictions in the i-th fold and the average rates for all k folds |
... |
arguments passed to |
For a classification problem, usually we wish to use as less variables as possible because of difficulties brought by the high dimension.
The selection procedure is like this:
Split the whole data randomly into k folds:
For the number of features g = 1, 2, ..., gmax, choose g features that have the largest discriminatory power (measured by the F-statistic in ANOVA):
For the fold i (i = 1, 2, ..., k):
Train a LDA model without the i-th fold data, and predict with the i-th fold for a proportion of correct predictions p[gi];
Average the k proportions to get the correct rate p[g];
Determine the optimum number of features with the largest p.
Note that g_{max} is set by ani.options('nmax')
(i.e. the
maximum number of features we want to choose).
A list containing
accuracy |
a matrix in which the element in the i-th row and j-th column is the rate of correct predictions based on LDA, i.e. build a LDA model with j variables and predict with data in the i-th fold (the test set) |
optimum |
the optimum number of features based on the cross-validation |
Yihui Xie <http://yihui.name>
Examples at https://yihui.name/animation/example/cv-nfeatureslda/
Maindonald J, Braun J (2007). Data Analysis and Graphics Using R - An Example-Based Approach. Cambridge University Press, 2nd edition. pp. 400