Foundational Principles for LargeScale Inference: Illustrations Through Correlation Mining
When can reliable inference be drawn in the ‘‘Big Data’’ context? This article presents a framework for answering this fundamental question in the context of correlation mining, with implications for general largescale inference. In largescale data applications like genomics, connectomics, and ecoinformatics, the data set is often variable rich but sample starved: a regime where the number n of acquired samples (statistical replicates) is far fewer than the number p of observed variables (genes, neurons, voxels, or chemical constituents). Much of recent work has focused on understanding the computational complexity of proposed methods for ‘‘Big Data.’’ Sample complexity, however, has received relatively less attention, especially in the setting when the sample size n is fixed, and the dimension p grows without bound. To address this gap, we develop a unified statistical framework that explicitly quantifies the sample complexity of various inferential tasks. Sampling regimes can be divided into several categories: 1) the classical asymptotic regime where the variable dimension is fixed and the sample size goes to infinity; 2) the mixed asymptotic regime where both variable dimension and sample size go to infinity at comparable rates; and 3) the purely highdimensional asymptotic regime where the variable dimension goes to infinitymore »
 Authors:

^{[1]};
^{[2]}
 Univ. of Michigan, Ann Arbor, MI (United States)
 Stanford Univ., CA (United States)
 Publication Date:
 Grant/Contract Number:
 NA0002534; FA95501310043; W911NF1110391; W911NF1210443; 2P01CA08763406A2; DMS0906392; DMSCMG1025465; AGS1003823; DMS1106642; DMSCAREER1352656; DARPAYFAN660011114131
 Type:
 Accepted Manuscript
 Journal Name:
 Proceedings of the IEEE
 Additional Journal Information:
 Journal Volume: 104; Journal Issue: 1; Journal ID: ISSN 00189219
 Publisher:
 Institute of Electrical and Electronics Engineers
 Research Org:
 Univ. of Michigan, Ann Arbor, MI (United States)
 Sponsoring Org:
 USDOE National Nuclear Security Administration (NNSA); National Institutes of Health (NIH); National Science Foundation (NSF); US Army Research Office (ARO)
 Country of Publication:
 United States
 Language:
 English
 Subject:
 97 MATHEMATICS AND COMPUTING; Asymptotic regimes; big data; correlation estimation; correlation mining; correlation screening; correlation selection; graphical models; largescale inference; purely high dimensional; sample complexity; triple asymptotic framework; unifying learning theory
 OSTI Identifier:
 1367662
Hero, Alfred O., and Rajaratnam, Bala. Foundational Principles for LargeScale Inference: Illustrations Through Correlation Mining. United States: N. p.,
Web. doi:10.1109/JPROC.2015.2494178.
Hero, Alfred O., & Rajaratnam, Bala. Foundational Principles for LargeScale Inference: Illustrations Through Correlation Mining. United States. doi:10.1109/JPROC.2015.2494178.
Hero, Alfred O., and Rajaratnam, Bala. 2015.
"Foundational Principles for LargeScale Inference: Illustrations Through Correlation Mining". United States.
doi:10.1109/JPROC.2015.2494178. https://www.osti.gov/servlets/purl/1367662.
@article{osti_1367662,
title = {Foundational Principles for LargeScale Inference: Illustrations Through Correlation Mining},
author = {Hero, Alfred O. and Rajaratnam, Bala},
abstractNote = {When can reliable inference be drawn in the ‘‘Big Data’’ context? This article presents a framework for answering this fundamental question in the context of correlation mining, with implications for general largescale inference. In largescale data applications like genomics, connectomics, and ecoinformatics, the data set is often variable rich but sample starved: a regime where the number n of acquired samples (statistical replicates) is far fewer than the number p of observed variables (genes, neurons, voxels, or chemical constituents). Much of recent work has focused on understanding the computational complexity of proposed methods for ‘‘Big Data.’’ Sample complexity, however, has received relatively less attention, especially in the setting when the sample size n is fixed, and the dimension p grows without bound. To address this gap, we develop a unified statistical framework that explicitly quantifies the sample complexity of various inferential tasks. Sampling regimes can be divided into several categories: 1) the classical asymptotic regime where the variable dimension is fixed and the sample size goes to infinity; 2) the mixed asymptotic regime where both variable dimension and sample size go to infinity at comparable rates; and 3) the purely highdimensional asymptotic regime where the variable dimension goes to infinity and the sample size is fixed. Each regime has its niche but only the latter regime applies to exascale data dimension. We illustrate this highdimensional framework for the problem of correlation mining, where it is the matrix of pairwise and partial correlations among the variables that are of interest. Correlation mining arises in numerous applications and subsumes the regression context as a special case. We demonstrate various regimes of correlation mining based on the unifying perspective of highdimensional learning rates and sample complexity for different structured covariance models and different inference tasks.},
doi = {10.1109/JPROC.2015.2494178},
journal = {Proceedings of the IEEE},
number = 1,
volume = 104,
place = {United States},
year = {2015},
month = {12}
}