Timely Reporting of Heavy Hitters Using External Memory
Journal Article
·
· ACM Transactions on Database Systems
- Williams College, Williamstown, MA (United States)
- Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Univ. of California, Berkeley, CA (United States)
- Stony Brook Univ., NY (United States)
- Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
- Rutgers Univ., Piscataway, NJ (United States)
- VMware Research, Palo Alto, CA (United States)
Given an input stream S of size N, a Φ-heavy hitter is an item that occurs at least ΦN times in S. The problem of finding heavy-hitters is extensively studied in the database literature. In this work, we study a real-time heavy-hitters variant in which an element must be reported shortly after we see its T = Φ N-th occurrence (and hence it becomes a heavy hitter). We call this the Timely Event Detection (TED) Problem. The TED problem models the needs of many real-world monitoring systems, which demand accurate (i.e., no false negatives) and timely reporting of all events from large, high-speed streams with a low reporting threshold (high sensitivity). Like the classic heavy-hitters problem, solving the TED problem without false-positives requires large space (Ω (N) words). Thus in-RAM heavy-hitters algorithms typically sacrifice accuracy (i.e., allow false positives), sensitivity, or timeliness (i.e., use multiple passes). We show how to adapt heavy-hitters algorithms to external memory to solve the TED problem on large high-speed streams while guaranteeing accuracy, sensitivity, and timeliness. Our data structures are limited only by I/O-bandwidth (not latency) and support a tunable tradeoff between reporting delay and I/O overhead. With a small bounded reporting delay, our algorithms incur only a logarithmic I/O overhead. We implement and validate our data structures empirically using the Firehose streaming benchmark. Multi-threaded versions of our structures can scale to process 11M observations per second before becoming CPU bound. In comparison, a naive adaptation of the standard heavy-hitters algorithm to external memory would be limited by the storage device’s random I/O throughput, i.e., ≈100K observations per second.
- Research Organization:
- Sandia National Laboratories (SNL-CA), Livermore, CA (United States); Sandia National Laboratories (SNL-NM), Albuquerque, NM (United States)
- Sponsoring Organization:
- National Science Foundation (NSF); USDOE National Nuclear Security Administration (NNSA); USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
- Grant/Contract Number:
- AC02-05CH11231; NA0003525
- OSTI ID:
- 1830533
- Report Number(s):
- SAND--2021-14465J; 701551
- Journal Information:
- ACM Transactions on Database Systems, Journal Name: ACM Transactions on Database Systems Journal Issue: 4 Vol. 46; ISSN 0362-5915
- Publisher:
- Association for Computing Machinery (ACM)Copyright Statement
- Country of Publication:
- United States
- Language:
- English
Similar Records
MACORD: Online Adaptive Machine Learning Framework for Silent Error Detection
Parallel implementation of the Dirac equation in three Cartesian dimensions
Streaming Compression of Hexahedral Meshes
Conference
·
2017
·
OSTI ID:1526315
Parallel implementation of the Dirac equation in three Cartesian dimensions
Conference
·
1994
·
OSTI ID:10180431
Streaming Compression of Hexahedral Meshes
Conference
·
2010
·
OSTI ID:986614