Kira: Processing Astronomy Imagery Using Big Data Technology
Journal Article
·
· IEEE Transactions on Big Data
- Univ. of California, Berkeley, CA (United States); UC Berkeley
- Univ. of California, Berkeley, CA (United States)
- Univ. of Chicago, IL (United States)
Scientific analyses commonly compose multiple single-process programs into a dataflow. An end-to-end dataflow of single-process programs is known as a many-task application. Typically, HPC tools are used to parallelize these analyses. In this work, we investigate an alternate approach that uses Apache Spark-a modern platform for data intensive computing-to parallelize many-task applications. We implement Kira, a flexible and distributed astronomy image processing toolkit, and its Source Extractor (Kira SE) application. Using Kira SE as a case study, we examine the programming flexibility, dataflow richness, scheduling capacity and performance of Apache Spark running on the Amazon EC2 cloud. By exploiting data locality, Kira SE achieves a 4.1× speedup over an equivalent C program when analyzing a 1TB dataset using 512 cores on the Amazon EC2 cloud. Furthermore, Kira SE on the Amazon EC2 cloud achieves a 1.8× speedup over the C program on the NERSC Edison supercomputer. A 128-core Amazon EC2 cloud deployment of Kira SE using Spark Streaming can achieve a second-scale latency with a sustained throughput of 800 MB/s. Our experience with Kira demonstrates that data intensive computing platforms like Apache Spark are a performant alternative for many-task scientific applications.
- Research Organization:
- Univ. of California, Berkeley, CA (United States)
- Sponsoring Organization:
- USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
- Grant/Contract Number:
- AC02-05CH11231; SC0012463
- OSTI ID:
- 1802358
- Journal Information:
- IEEE Transactions on Big Data, Journal Name: IEEE Transactions on Big Data Journal Issue: 2 Vol. 6; ISSN 2332-7790
- Publisher:
- IEEECopyright Statement
- Country of Publication:
- United States
- Language:
- English
Basic Local Alignment Search Tool
|
journal | October 1990 |
A high-performance, portable implementation of the MPI message passing interface standard
|
journal | September 1996 |
Mapping brain activity at scale with cluster computing
|
journal | July 2014 |
SExtractor: Software for source extraction
|
journal | June 1996 |
The Sloan Digital Sky Survey: Technical Summary
|
journal | September 2000 |
Scientific computing meets big data technology: An astronomy use case
|
conference | October 2015 |
Condor-a hunter of idle workstations
|
conference | January 1988 |
MLI: An API for Distributed Machine Learning
|
conference | December 2013 |
The Hadoop Distributed File System
|
conference | May 2010 |
Toward loosely coupled programming on petascale systems
|
conference | November 2008 |
Massively parallel genomic sequence search on the Blue Gene/P architecture
|
conference | November 2008 |
MapReduce for Data Intensive Scientific Analyses
|
conference | December 2008 |
CloudBLAST: Combining MapReduce and Virtualization on Distributed Resources for Bioinformatics Applications
|
conference | December 2008 |
Designing a multi-petabyte database for LSST
|
conference | June 2006 |
Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling
|
conference | January 2010 |
FlumeJava: easy, efficient data-parallel pipelines
|
conference | January 2010 |
Discretized streams: fault-tolerant streaming computation at scale
|
conference | November 2013 |
Rethinking Data-Intensive Science Using Scalable Analytics Systems
|
conference | January 2015 |
Montage: a grid portal and software toolkit for science-grade astronomical image mosaicking
|
journal | January 2009 |
A Spark image processing toolkit
|
journal | May 2019 |
Explore Deep Neural Network and Reinforcement Learning to Large-scale Tasks Processing in Big Data
|
journal | December 2019 |
A Distributed K -Means Segmentation Algorithm Applied to Lobesia botrana Recognition
|
journal | January 2017 |
A Binary Cuckoo Search Big Data Algorithm Applied to Large-Scale Crew Scheduling Problems
|
journal | July 2018 |
Similar Records
A multi-platform evaluation of the randomized CX low-rank matrix factorization in Spark
Scheduling Multilevel Deadline-Constrained Scientific Workflows on Clouds Based on Cost Optimization
Conference
·
Thu Jul 27 00:00:00 EDT 2017
·
OSTI ID:1372901
Scheduling Multilevel Deadline-Constrained Scientific Workflows on Clouds Based on Cost Optimization
Journal Article
·
Wed Dec 31 19:00:00 EST 2014
· Scientific Programming
·
OSTI ID:1228415