skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Expectation-maximization algorithms for learning a finite mixture of univariate survival time distributions from partially specified class values

Thesis/Dissertation ·
DOI:https://doi.org/10.2172/1116720· OSTI ID:1116720
 [1]
  1. Iowa State Univ., Ames, IA (United States)

Heterogeneity exists on a data set when samples from di erent classes are merged into the data set. Finite mixture models can be used to represent a survival time distribution on heterogeneous patient group by the proportions of each class and by the survival time distribution within each class as well. The heterogeneous data set cannot be explicitly decomposed to homogeneous subgroups unless all the samples are precisely labeled by their origin classes; such impossibility of decomposition is a barrier to overcome for estimating nite mixture models. The expectation-maximization (EM) algorithm has been used to obtain maximum likelihood estimates of nite mixture models by soft-decomposition of heterogeneous samples without labels for a subset or the entire set of data. In medical surveillance databases we can find partially labeled data, that is, while not completely unlabeled there is only imprecise information about class values. In this study we propose new EM algorithms that take advantages of using such partial labels, and thus incorporate more information than traditional EM algorithms. We particularly propose four variants of the EM algorithm named EM-OCML, EM-PCML, EM-HCML and EM-CPCML, each of which assumes a specific mechanism of missing class values. We conducted a simulation study on exponential survival trees with five classes and showed that the advantages of incorporating substantial amount of partially labeled data can be highly signi cant. We also showed model selection based on AIC values fairly works to select the best proposed algorithm on each specific data set. A case study on a real-world data set of gastric cancer provided by Surveillance, Epidemiology and End Results (SEER) program showed a superiority of EM-CPCML to not only the other proposed EM algorithms but also conventional supervised, unsupervised and semi-supervised learning algorithms.

Research Organization:
Ames Lab., Ames, IA (United States)
Sponsoring Organization:
USDOE Office of Science (SC)
DOE Contract Number:
AC02-07CH11358
OSTI ID:
1116720
Report Number(s):
IS-T 3103
Country of Publication:
United States
Language:
English