 
Summary: NOTE Communicated by Thomas Dietterich
Combined 5 × 2 cv F Test for Comparing Supervised
Classification Learning Algorithms
Ethem Alpaydin
IDIAP, CP 592 CH1920 Martigny, Switzerland
and
Department of Computer Engineering, Bogazi¸ci University, TR80815 Istanbul, Turkey
Dietterich (1998) reviews five statistical tests and proposes the 5 × 2 cv t
test for determining whether there is a significant difference between the
error rates of two classifiers. In our experiments, we noticed that the 5 × 2
cv t test result may vary depending on factors that should not affect the
test, and we propose a variant, the combined 5×2 cv F test, that combines
multiple statistics to get a more robust test. Simulation results show that
this combined version of the test has lower type I error and higher power
than 5 × 2 cv proper.
1 Introduction
Given two learning algorithms and a training set, we want to test if the
two algorithms construct classifiers that have the same error rate on a test
example. The way we proceed is as follows: Given a labeled sample, we
divide it into a training set and a test set (or many such pairs), train the
