Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Kira: Processing Astronomy Imagery Using Big Data Technology

Journal Article · · IEEE Transactions on Big Data
Scientific analyses commonly compose multiple single-process programs into a dataflow. An end-to-end dataflow of single-process programs is known as a many-task application. Typically, HPC tools are used to parallelize these analyses. In this work, we investigate an alternate approach that uses Apache Spark-a modern platform for data intensive computing-to parallelize many-task applications. We implement Kira, a flexible and distributed astronomy image processing toolkit, and its Source Extractor (Kira SE) application. Using Kira SE as a case study, we examine the programming flexibility, dataflow richness, scheduling capacity and performance of Apache Spark running on the Amazon EC2 cloud. By exploiting data locality, Kira SE achieves a 4.1× speedup over an equivalent C program when analyzing a 1TB dataset using 512 cores on the Amazon EC2 cloud. Furthermore, Kira SE on the Amazon EC2 cloud achieves a 1.8× speedup over the C program on the NERSC Edison supercomputer. A 128-core Amazon EC2 cloud deployment of Kira SE using Spark Streaming can achieve a second-scale latency with a sustained throughput of 800 MB/s. Our experience with Kira demonstrates that data intensive computing platforms like Apache Spark are a performant alternative for many-task scientific applications.
Research Organization:
Univ. of California, Berkeley, CA (United States)
Sponsoring Organization:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
Grant/Contract Number:
AC02-05CH11231; SC0012463
OSTI ID:
1802358
Journal Information:
IEEE Transactions on Big Data, Journal Name: IEEE Transactions on Big Data Journal Issue: 2 Vol. 6; ISSN 2332-7790
Publisher:
IEEECopyright Statement
Country of Publication:
United States
Language:
English

References (19)

Basic Local Alignment Search Tool journal October 1990
A high-performance, portable implementation of the MPI message passing interface standard journal September 1996
Mapping brain activity at scale with cluster computing journal July 2014
SExtractor: Software for source extraction journal June 1996
The Sloan Digital Sky Survey: Technical Summary journal September 2000
Scientific computing meets big data technology: An astronomy use case conference October 2015
Condor-a hunter of idle workstations conference January 1988
MLI: An API for Distributed Machine Learning
  • Sparks, Evan R.; Talwalkar, Ameet; Smith, Virginia
  • 2013 IEEE International Conference on Data Mining (ICDM), 2013 IEEE 13th International Conference on Data Mining https://doi.org/10.1109/ICDM.2013.158
conference December 2013
The Hadoop Distributed File System conference May 2010
Toward loosely coupled programming on petascale systems conference November 2008
Massively parallel genomic sequence search on the Blue Gene/P architecture conference November 2008
MapReduce for Data Intensive Scientific Analyses conference December 2008
CloudBLAST: Combining MapReduce and Virtualization on Distributed Resources for Bioinformatics Applications conference December 2008
Designing a multi-petabyte database for LSST conference June 2006
Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling conference January 2010
FlumeJava: easy, efficient data-parallel pipelines
  • Chambers, Craig; Raniwala, Ashish; Perry, Frances
  • Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation - PLDI '10 https://doi.org/10.1145/1806596.1806638
conference January 2010
Discretized streams: fault-tolerant streaming computation at scale
  • Zaharia, Matei; Das, Tathagata; Li, Haoyuan
  • SOSP '13: ACM SIGOPS 24th Symposium on Operating Systems Principles, Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles https://doi.org/10.1145/2517349.2522737
conference November 2013
Rethinking Data-Intensive Science Using Scalable Analytics Systems
  • Nothaft, Frank Austin; Linderman, Michael; Franklin, Michael J.
  • Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data - SIGMOD '15 https://doi.org/10.1145/2723372.2742787
conference January 2015
Montage: a grid portal and software toolkit for science-grade astronomical image mosaicking journal January 2009

Cited By (4)

A Spark image processing toolkit journal May 2019
Explore Deep Neural Network and Reinforcement Learning to Large-scale Tasks Processing in Big Data journal December 2019
A Distributed K -Means Segmentation Algorithm Applied to Lobesia botrana Recognition journal January 2017
A Binary Cuckoo Search Big Data Algorithm Applied to Large-Scale Crew Scheduling Problems journal July 2018

Similar Records

A multi-platform evaluation of the randomized CX low-rank matrix factorization in Spark
Conference · Thu Jul 27 00:00:00 EDT 2017 · OSTI ID:1372901

Scheduling Multilevel Deadline-Constrained Scientific Workflows on Clouds Based on Cost Optimization
Journal Article · Wed Dec 31 19:00:00 EST 2014 · Scientific Programming · OSTI ID:1228415

Related Subjects