Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

FORTÉ Machine Learning

Technical Report ·
DOI:https://doi.org/10.2172/1561828· OSTI ID:1561828
 [1];  [1]
  1. Sandia National Laboratories (SNL-NM), Albuquerque, NM (United States); Sandia National Laboratories (SNL-CA), Livermore, CA (United States)

Of the non-corrupted data collected by the Orbiting Experiment (forté) satellite’s Photo-Diode Detector during the year 2001, I estimate that 7.9% of 914 894 signals are noise. My result differs dramatically from Guillen’s estimate of 96%. To arrive at this estimate, I used Gaussian mixture model (GMM) clustering–unsupervised machine learning–to aggregate the wave forms into groups based on the absolute value of the lowest 25 positive frequency discrete Fourier transform coefficients. Then, I marked several of the groups as noise by inspecting a random sampling of wave forms from each group. Marking groups as either noise or non-noise is a supervised binary classification operation. After removing the signals in noise groups from further consideration, I clustered the remaining signals into families. Again, I used a GMM, but for the familial clustering I used a Non-Negative Matrix Factorization feature vector transform. The result was 9 distinct families of lightning signals, as well as a second stage of noise filtering. To efficiently represent the entirety of the signal space, I broke each family into deciles based on their distance from the family mean. In this case, distance means the log-likelihood based on the GMM. Signals in lower deciles are more similar in shape and amplitude to their family average. I took the top 200 samples from each decile of each group, resulting in 18 000 signals. These signal approximately represent the entirety of the forté observations. To represent outliers, I also kept a zoo of the 1000 signals furthest from any family’s average. All told, the resulting data set represents the forté data with a reduction of about 51:1. To allow synthesis of an arbitrarily large number of test signals, I also captured each family’s average signal and the time-sample covariance matrix over the signals in each family. Using these two pieces of information, I can synthesize new waveforms by using a Gaussian random realization from the family average and covariance matrix. I wrote a program to test the synthesis quality. The program shows me two signals on the screen, one synthesized and one randomly drawn from the data. I attempted to identify the synthesized signal. Although the synthesis is imperfect, in an A/B comparison I only correctly chose the synthesized signal 36% of the time.

Research Organization:
Sandia National Laboratories (SNL-NM), Albuquerque, NM (United States); Sandia National Laboratories (SNL-CA), Livermore, CA (United States)
Sponsoring Organization:
USDOE National Nuclear Security Administration (NNSA)
DOE Contract Number:
AC04-94AL85000
OSTI ID:
1561828
Report Number(s):
SAND--2016-7742R; 646508
Country of Publication:
United States
Language:
English

Similar Records

Robust methods tailored for non-Gaussian narrowband array processing
Thesis/Dissertation · Sat Dec 31 23:00:00 EST 1988 · OSTI ID:6038475

Efficient Speaker Verification Using Gaussian Mixture Model Component Clustering
Technical Report · Sat Mar 31 20:00:00 EDT 2012 · OSTI ID:1039402

Likelihood Maximization and Moment Matching in Low SNR Gaussian Mixture Models
Journal Article · Sun Apr 17 00:00:00 EDT 2022 · Communications on Pure and Applied Mathematics · OSTI ID:2424512

Related Subjects