skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Prototype Vector Machine for Large Scale Semi-Supervised Learning

Abstract

Practicaldataminingrarelyfalls exactlyinto the supervisedlearning scenario. Rather, the growing amount of unlabeled data poses a big challenge to large-scale semi-supervised learning (SSL). We note that the computationalintensivenessofgraph-based SSLarises largely from the manifold or graph regularization, which in turn lead to large models that are dificult to handle. To alleviate this, we proposed the prototype vector machine (PVM), a highlyscalable,graph-based algorithm for large-scale SSL. Our key innovation is the use of"prototypes vectors" for effcient approximation on both the graph-based regularizer and model representation. The choice of prototypes are grounded upon two important criteria: they not only perform effective low-rank approximation of the kernel matrix, but also span a model suffering the minimum information loss compared with the complete model. We demonstrate encouraging performance and appealing scaling properties of the PVM on a number of machine learning benchmark data sets.

Authors:
; ;
Publication Date:
Research Org.:
Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
Sponsoring Org.:
Life Sciences Division
OSTI Identifier:
960433
Report Number(s):
LBNL-1953E
TRN: US200923%%438
DOE Contract Number:
DE-AC02-05CH11231
Resource Type:
Conference
Resource Relation:
Conference: 26th International Conference on Machine Learning , Montreal, Canada , June 14-18, 2009
Country of Publication:
United States
Language:
English
Subject:
97; ALGORITHMS; APPROXIMATIONS; BENCHMARKS; KERNELS; LEARNING; MINING; PERFORMANCE; VECTORS

Citation Formats

Zhang, Kai, Kwok, James T., and Parvin, Bahram. Prototype Vector Machine for Large Scale Semi-Supervised Learning. United States: N. p., 2009. Web.
Zhang, Kai, Kwok, James T., & Parvin, Bahram. Prototype Vector Machine for Large Scale Semi-Supervised Learning. United States.
Zhang, Kai, Kwok, James T., and Parvin, Bahram. Wed . "Prototype Vector Machine for Large Scale Semi-Supervised Learning". United States. doi:. https://www.osti.gov/servlets/purl/960433.
@article{osti_960433,
title = {Prototype Vector Machine for Large Scale Semi-Supervised Learning},
author = {Zhang, Kai and Kwok, James T. and Parvin, Bahram},
abstractNote = {Practicaldataminingrarelyfalls exactlyinto the supervisedlearning scenario. Rather, the growing amount of unlabeled data poses a big challenge to large-scale semi-supervised learning (SSL). We note that the computationalintensivenessofgraph-based SSLarises largely from the manifold or graph regularization, which in turn lead to large models that are dificult to handle. To alleviate this, we proposed the prototype vector machine (PVM), a highlyscalable,graph-based algorithm for large-scale SSL. Our key innovation is the use of"prototypes vectors" for effcient approximation on both the graph-based regularizer and model representation. The choice of prototypes are grounded upon two important criteria: they not only perform effective low-rank approximation of the kernel matrix, but also span a model suffering the minimum information loss compared with the complete model. We demonstrate encouraging performance and appealing scaling properties of the PVM on a number of machine learning benchmark data sets.},
doi = {},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Wed Apr 29 00:00:00 EDT 2009},
month = {Wed Apr 29 00:00:00 EDT 2009}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share:
  • Machine learning is used in many applications, from machine vision to speech recognition to decision support systems, and is used to test applications. However, though much has been done to evaluate the performance of machine learning algorithms, little has been done to verify the algorithms or examine their failure modes. Moreover, complex learning frameworks often require stepping beyond black box evaluation to distinguish between errors based on natural limits on learning and errors that arise from mistakes in implementation. We present a conceptual architecture, failure model and taxonomy, and failure modes and effects analysis (FMEA) of a semi-supervised, multi-modal learningmore » system, and provide specific examples from its use in a radiological analysis assistant system. The goal of the research described in this paper is to provide a foundation from which dependability analysis of systems using semi-supervised, multi-modal learning can be conducted. The methods presented provide a first step towards that overall goal.« less
  • Motivation: As the amount of biological sequence data continues to grow exponentially we face the increasing challenge of assigning function to this enormous molecular ‘parts list’. The most popular approaches to this challenge make use of the simplifying assumption that similar functional molecules, or proteins, sometimes have similar composition, or sequence. However, these algorithms often fail to identify remote homologs (proteins with similar function but dissimilar sequence) which often are a significant fraction of the total homolog collection for a given sequence. We introduce a Support Vector Machine (SVM)-based tool to detect Homology Using Semisupervised iTerative LEarning (SVM-HUSTLE) that detectsmore » significantly more remote homologs than current state-of-the-art sequence or cluster-based methods. As opposed to building profiles or position specific scoring matrices, SVM-HUSTLE builds an SVM classifier for a query sequence by training on a collection of representative highconfidence training sets. SVM-HUSTLE combines principles of semi-supervised learning theory with statistical sampling to create many concurrent classifiers to iteratively detect and refine on-the-fly patterns indicating homology. Results: When compared against existing methods for identifying protein homologs (BLASTp, PSI-BLAST, RANKPROP, MOTIFPROP and their variants) on the SCOP 1.59 benchmark dataset consisting of 7329 protein sequences, SVM-HUSTLE significantly outperforms each of the above methods using the most stringent ROC1 statistic with p-values less than 1e-20.« less
  • Identification of the original groundwater types present in geochemical mixtures observed in an aquifer is a challenging but very important task. Frequently, some of the groundwater types are related to different infiltration and/or contamination sources associated with various geochemical signatures and origins. The characterization of groundwater mixing processes typically requires solving complex inverse models representing groundwater flow and geochemical transport in the aquifer, where the inverse analysis accounts for available site data. Usually, the model is calibrated against the available data characterizing the spatial and temporal distribution of the observed geochemical types. Numerous different geochemical constituents and processes may needmore » to be simulated in these models which further complicates the analyses. In this paper, we propose a new contaminant source identification approach that performs decomposition of the observation mixtures based on Non-negative Matrix Factorization (NMF) method for Blind Source Separation (BSS), coupled with a custom semi-supervised clustering algorithm. Our methodology, called NMFk, is capable of identifying (a) the unknown number of groundwater types and (b) the original geochemical concentration of the contaminant sources from measured geochemical mixtures with unknown mixing ratios without any additional site information. NMFk is tested on synthetic and real-world site data. Finally, the NMFk algorithm works with geochemical data represented in the form of concentrations, ratios (of two constituents; for example, isotope ratios), and delta notations (standard normalized stable isotope ratios).« less
  • A classification system is developed to identify driving situations from labeled examples of previous occurrences. The purpose of the classifier is to provide physical context to a separate system that mitigates unnecessary distractions, allowing the driver to maintain focus during periods of high difficulty. While watching videos of driving, we asked different users to indicate their perceptions of the current situation. We then trained a classifier to emulate the human recognition of driving situations using the Sandia Cognitive Framework. In unstructured conditions, such as driving in urban areas and the German autobahn, the classifier was able to correctly predict humanmore » perceptions of driving situations over 95% of the time. This paper focuses on the learning algorithms used to train the driving-situation classifier. Future work will reduce the human efforts needed to train the system.« less
  • In many practical situations thematic classes can not be discriminated by spectral measurements alone. Often one needs additional features such as population density, road density, wetlands, elevation, soil types, etc. which are discrete attributes. On the other hand remote sensing image features are continuous attributes. Finding a suitable statistical model and estimation of parameters is a challenging task in multisource (e.g., discrete and continuous attributes) data classification. In this paper we present a semi-supervised learning method by assuming that the samples were generated by a mixture model, where each component could be either a continuous or discrete distribution. Overall classificationmore » accuracy of the proposed method is improved by 12% in our initial experiments.« less