統計学輪講(第35回)

日時      2008年12月2日(火)    15時〜15時50分
場所      経済学部新棟3階第3教室
講演者    Muni S. Srivastava (University of Toronto)
演題      Comparison of Discrimination Methods for High Dimensional Data

概要

Dudoit, Fridlyand and Speed (2002) compare several discrimination methods 
for the classification of tumors using gene expression data. The comparison
includes the Fisher (1936) linear discrimination analysis method (FLDA), 
the classification and regression tree (CART) method of Breiman, Friedman, 
Olshen and Stone (1984), aggregating classifiers of Breiman (1996, 98) 
which include “bagging” methods of Friedman (1998), the “boosting”
method of Freund and Schapire (1997), and the nearest neighbour, called NN 
method of Fix and Hodges (1951). The comparison also included two more 
methods called DQDA method and DLDA method respectively. In the DQDA method, 
it is assumed that the population covariances are diagonal matrices, but 
unequal for different groups. The likelihood ratio rule is obtained 
assuming that the parameters are known, and then estimates are substituted 
in the likelihood ratio rule.  On the other hand, in the DLDA method, 
it is assumed that the population covariances are not only diagonal matrices, 
but they are also all equal and the rule is obtained in the same manner 
as in DQDA. However, among all the preceding methods considered by 
Dudoit, et al. (2002), only two methods, namely the DLDA and NN methods, 
performed well. However, the NN method is very computer intensive and 
performs no better than the DLDA method, especially when classifying into 
only two populations, and thus it will not be included in our study. 
While it is not possible to give reasons as to why other methods did not 
perform well, the poor performance of the FLDA method may be due to the 
large dimension p of the data even when the degrees of freedom n associated 
with the sample covariance is larger than p.
In large dimensions, the sample covariance may become near singular with 
very small eigenvalues. For this reason, it may be reasonable to consider 
a version of the principal component method which is applicable even when 
p  n. Using the Moore-Penrose inverse, a general method based on the 
minimum distance rule is proposed. Another method which uses an empirical 
Bayes estimate of the inverse of the covariance matrix, along with a variation 
of this method are also proposed. We compare these three new methods with 
the DLDA method of Dudoit, et al. (2002).