The “Peeking” Effect in Supervised Feature Selection on Diffusion Tensor Imaging Data

Published online before print July 18, 2013, doi: 10.3174/ajnr.A3685
AJNR 2013 34: E107

S. Diciotti,^a S. Ciulli^a and M. Mascalchi^a
aDepartment of Clinical and Experimental Biomedical Sciences
University of Florence
Florence, Italy

M. Giannelli^b
bUnit of Medical Physics
Pisa University Hospital
Azienda Ospedaliero-Universitaria Pisana
Pisa, Italy

N. Toschi^c
cMedical Physics Section, Department of Biomedicine and Prevention
Faculty of Medicine University of Rome Tor Vergata
Rome, Italy

We read with great interest the article by Haller et al1 in the February 2013 issue of the American Journal of Neuroradiology. The authors used whole-brain diffusion tensor imaging–derived fractional anisotropy (FA) data, skeletonized through use of the standard tract-based spatial statistics (TBSS) pipeline, to achieve the following: 1) report significant group differences in FA among mild cognitive impairment (MCI) subtypes, and 2) perform individual classification of MCI subtypes by using a supervised feature selection procedure combined with a support vector machine (SVM) classifier. The study reports extremely high classification performances (100% sensitivity and 94%–100% specificity), which the authors describe as perhaps “too optimistic” and partially ascribe to “some degree of overfitting,” possibly also due to the use of feature selection.

The above-mentioned study presents a questionable use of supervised feature selection, which was performed on the entire dataset (ie, on both training and test data) instead of only on the training set of each partition generated during the cross-validation procedure. It is well-known that using test set labels to perform inference on a feature subset during the learning process can cause an overestimation of the generalization capabilities of the classifier (sometimes called the “peeking” effect) and that this effect is particularly severe when a large number of features are removed (like in this whole-brain DTI study, in which approximately 150,000 features were reduced to 1000).2,3 In other words, training the classifier with the same instances (ie, data “points”) used for feature selection corresponds to providing it with “hints” about the solution of the classification problem, and Haller et al1 recognized this circumstance as a “limitation” of their study. However, this methodologic mistake3 (which unfortunately appears in several recent studies in the MR imaging literature) does not constitute a mere theoretic concern but rather can have important consequences on the final results.3

To better clarify and exemplify our point, we have analyzed DTI data in a patient cohort presented in a previous MCI-Alzheimer disease (AD) classification study.4 Specifically, we attempted to discriminate between 30 patients with amnesic MCI and 21 with mild AD by using the processing pipeline (a Relief-F feature selection of the top 1000 features followed by an SVM classifier and 10 repetitions of a 10-fold cross-validation) and the same type of data (skeletonized whole-brain FA data) used by Haller et al.1 We repeated the analysis by using either incorrect cross-validation (ie, feature selection on the entire dataset followed by classification in cross-validation, as carried out by Haller et al1) or correct cross-validation (feature selection within each training set of the cross-validation).

In the former analysis, patients with mild AD were classified with 80.0% sensitivity and 96.7% specificity, while in the latter analysis, results dropped to 45.3% sensitivity and 67.3% specificity. These data demonstrate the remarkable amount of possible overestimation of the generalization capabilities due to the “peeking” effect in a cross-validation study which uses whole-brain TBSS data, and we speculate that the sensitivity/specificity values reported by Haller et al1 would be substantially lowered if an orthodox feature-selection procedure was applied to their data.

In conclusion, given the relevance and potential of MCI subtype discrimination through MR imaging feature extraction and selection, full consideration of the methodologic pitfalls of combining supervised feature selection procedures with SVM in whole-brain imaging data analysis is highly recommended.

References

Haller S, Missonnier P, Herrmann FR, et al. Individual classification of mild cognitive impairment subtypes by support vector machine analysis of white matter DTI. AJNR Am J Neuroradiol 2013;34:283–91 » Abstract/FREE Full Text
Pereira F, Mitchell T, Botvinick M. Machine learning classifiers and fMRI: a tutorial overview. Neuroimage 2009;45(1 suppl):S199–209 » CrossRef » Medline
Smialowski P, Frishman D, Kramer S. Pitfalls of supervised feature selection. Bioinformatics 2010;26:440–43 » FREE Full Text
Diciotti S, Ginestroni A, Bessi V, et al. Identification of mild Alzheimer’s disease through automated classification of structural MRI features. Conf Proc IEEE Eng Med Biol Soc 2012;2012:428–31 » Medline

Reply

Published online before print July 18, 2013, doi: 10.3174/ajnr.A3699
AJNR 2013 34: E108-E109

S. Haller^a and K.-O. Lovblad^a
^aDepartment of Imaging and Medical Informatics

P. Giannakopoulos^b
^bDepartment of Mental Health and Psychiatry
University Hospitals of Geneva and Faculty of Medicine of the University of Geneva
Geneva, Switzerland

D. Van De Ville^c,d
^cDepartment of Radiology and Medical Informatics
University of Geneva
Geneva, Switzerland
^dInstitute of Bioengineering
Ecole Polytechnique Fédérale de Lausanne (EPFL)
Lausanne, Switzerland

We thank Diciotti et al for the interest in our study and their thoughtful comments regarding the methodology that was used to obtain part of the results reported in our manuscript. We acknowledge that feature selection is an important part of learning algorithms and still the subject of research in high-dimensional (and structured) datasets. We had already mentioned this “peeking” issue raised by Diciotti et al as a limitation in our manuscript.

Motivated by the comment by Diciotti et al, we repeated our data analysis by using feature selection within the cross-validation folds. In particular, we reanalyzed the data by using Relieff feature selection of the top 1000 features followed by a support vector machine (SVM) classifier within a leave-one-out cross-validation scheme. We did not optimize for the number features nor for the SVM classifier parameters, which we kept identical with the best setting in the original manuscript. We obtained results that are significantly above chance for discrimination, for example, between SD-fMCI and SD-aMCI. The accuracy was 84.6%, with a false-positive rate between 6.7% single-domain frontal mild cognitive impairment (SD-fMCI) and 27.3% single-domain amnestic mild cognitive impairment (SD-aMCI). As can be expected, these results are less optimistic than the reported values in the manuscript. This can be explained by a number of factors. For example, we note that the data analysis pipeline is not yet fully optimized because the time to respond to the letter by Diciotti et al was limited. We reason that this result nevertheless underlines the potential and feasibility of SVM for individual classification of MCI subtypes.

We would like to use this opportunity to further elaborate a number of current limitations of classification analysis at the individual level. Neuroimaging has been dominated by group-level statistical analyses to identify brain regions involved in certain diseases; however, such analyses do not necessarily reflect predictive value for the diagnosis of individual cases in clinical neuroradiology. Recent trends in neuroimaging data analysis are increasingly adopting tools from pattern recognition to evaluate results and develop potentially new imaging markers. This trend represents a fundamental shift in paradigm. Diciotti et al highlighted, in their letter, the issue of proper feature selection, and we would like to add a few other considerations for future development. In addition to these methodologic challenges, there are open medicolegal issues such as approval by the FDA or European Union.

A typical feature set extracted from MR imaging data can easily contain more than 100,000 features. In most cases, the features are related (similarity of adjacent or homologous voxels), and only a limited number will carry discriminative information. The selection of the best features is a long-standing problem in machine learning, which can be dealt with either explicitly (by a separate feature-selection step) or implicitly (by regularization in the classification method). In any case, increasing the dimensionality by adding more nondiscriminatory features increases computational demands and decreases performance. Identifying the optimal number of features (or tuning of regularization parameters) is nontrivial in practice.
Structural MR imaging data typically have several hundred thousand voxels, while typical single-center studies have around 20–50 individuals per group. The cross-validation technique is one frequently implemented method in such cases, with a small number of participants with respect to the size of the data (also implemented by Diciotti et al), yet it has its own limitations. Ideally, the classifier should be trained on one dataset and tested on another independent dataset to estimate the “real world” performance in clinical neuroradiology. Evidently, the available sample size for single-center studies is, in most cases, insufficient.
Support vector machines have been widely applied to neuroimaging data, probably because of their robustness against outliers, yet they were not specifically developed for neuroimaging data. SVM does not exploit spatial structure (ie, features can be randomly permuted without modifying the results). However, the brain has specific spatial structure so that adjacent or homologous voxels are more likely to have similar features compared with distant voxels. Introducing prior information about spatial structure in classification algorithms is an important research topic (eg, some algorithms recently proposed hierarchical clustering to regroup similar voxels and reinforce the robustness).1
There is substantial normal interindividual variation in brain morphometry, even in healthy volunteers (eg, up to 15% variation in cortical thickness).2Correspondingly, we could, for example, show in a previous study that there is less variation in the within-subject cortical asymmetry, and for example, discrimination between at-risk mental state and volunteers was possible only based on within-subject cortical asymmetry, yet this was impossible based on direct assessment of cortical thickness between subjects.3 While this study was performed in the domain of psychosis, the principles are also applicable to neurodegenerative diseases including dementia. While most classification techniques can exploit multivariate information and, thus theoretically, can reveal discriminative information by “clever” combinations of features, the high-dimensional nature and variability of the data could benefit from incorporating domain-specific knowledge.
There is interindividual variation in the neurocognitive reserve, which was described already in 1968.4 The same degree of clinical neurocognitive impairment can be caused by different levels of brain pathology—or from the other perspective, the same degree of brain pathology can evoke variable degrees of clinical neurocognitive impairment. This is due to individual factors such as education and social integration, which represent an important confound for classification analyses. Taking into account these factors is not obvious and should ideally be done within the classification algorithm and not as a separate preprocessing step (on the training data within the cross-validation fold).
MR imaging usually includes multiple pulse sequences. To increase the accuracy and, in particular, the robustness of individual classification analyses, it is probably beneficial to combine the information of multiple pulse sequences, ideally in combination with nonimaging parameters such as neuropsychologic tests, blood or CSF samples, and so forth. Determining the optimal combination of multiple domains in practice is, however, not trivial.5
Additional potentially confounding factors include noise in the data, between-scanner variability, variability in data preprocessing, and patient selection, among others.

In summary, individual-level classification of neuroimaging data is an emerging field and is still hindered by fundamental limitations of the methodology, including optimal feature selection, incorporating domain knowledge into the classification, and integration of multiparametric measurements.

In addition, we believe that further methodologic developments should be based on larger datasets and multicentric studies to increase both reproducibility and predictability. Recent data-sharing initiatives such as the Alzheimer’s Disease Neuroimaging Initiative,6 in combination with cloud-computing power, will provide the necessary prerequisites for these developments. In the near future, we will, hopefully, see new advances to bring individual-level classification analysis to the next level to provide earlier and more accurate diagnosis and to eventually improve patient care.

References

Jenatton R, Gramfort A, Michel V, et al. Multiscale mining of fMRI data with hierarchical structured sparsity. SIAM J Imaging Sci 2012;5:835–56 » CrossRef
Haug H. Brain sizes, surfaces, and neuronal sizes of the cortex cerebri: a stereological investigation of man and his variability and a comparison with some mammals (primates, whales, marsupials, insectivores, and one elephant). Am J Anat 1987;180:126–42 » CrossRef » Medline
Haller S, Borgwardt SJ, Schindler C, et al. Can cortical thickness asymmetry analysis contribute to detection of at-risk mental state and first-episode psychosis? A pilot study. Radiology 2009;250:212–21 » Abstract/FREE Full Text
Tomlinson BE, Blessed G, Roth M. Observations on the brains of non-demented old people. J Neurol Sci 1968;7:331–56 » CrossRef » Medline
Haller S, Lovblad KO, Giannakopoulos P. Principles of classification analyses in mild cognitive impairment (MCI) and Alzheimer disease. J Alzheimers Dis 2011;26(suppl 3):389–94 » CrossRef » Medline
Mueller SG, Weiner MW, Thal LJ, et al. Ways toward an early diagnosis in Alzheimer’s disease: the Alzheimer’s Disease Neuroimaging Initiative (ADNI). Alzheimers Dement 2005;1:55–66 » Medline

The “Peeking” Effect in Supervised Feature Selection on Diffusion Tensor Imaging Data

AJNR Blog

The Official Blog of the American Journal of Neuroradiology

The “Peeking” Effect in Supervised Feature Selection on Diffusion Tensor Imaging Data

References

Reply

References