 
Summary: Data Complexity in Machine Learning
Ling Li and Yaser S. AbuMostafa
Learning Systems Group, California Institute of Technology
Abstract. We investigate the role of data complexity in the context of binary classification problems.
The universal data complexity is defined for a data set as the Kolmogorov complexity of the mapping
enforced by the data set. It is closely related to several existing principles used in machine learning such
as Occam's razor, the minimum description length, and the Bayesian approach. The data complexity
can also be defined based on a learning model, which is more realistic for applications. We demonstrate
the application of the data complexity in two learning problems, data decomposition and data pruning.
In data decomposition, we illustrate that a data set is best approximated by its principal subsets which
are Pareto optimal with respect to the complexity and the set size. In data pruning, we show that
outliers usually have high complexity contributions, and propose methods for estimating the complexity
contribution. Since in practice we have to approximate the ideal data complexity measures, we also
discuss the impact of such approximations.
1 Introduction
Machine learning is about pattern1 extraction. A typical example is an image classifier that auto
matically tells the existence of some specific object category, say cars, in an image. The classifier
would be constructed based on a training set of labeled image examples. It is relatively easy for
computers to "memorize" all the examples, but in order for the classifier to also be able to correctly
label images that have not been seen so far, meaningful patterns about images in general and the
