skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Watermarks in stream processing systems: semantics and comparative analysis of Apache Flink and Google cloud dataflow

Conference ·

Streaming data processing is an exercise in taming disorder: from oftentimes huge torrents of information, we hope to extract powerful and timely analyses. But when dealing with streaming data, the unbounded and temporally disordered nature of real-world streams introduces a critical challenge: how does one reason about the completeness of a stream that never ends? In this paper, we present a comprehensive definition and analysis of watermarks, a key tool for reasoning about temporal completeness in infinite streams.First, we describe what watermarks are and why they are important, highlighting how they address a suite of stream processing needs that are poorly served by eventually-consistent approaches:• Computing a single correct answer, as in notifications.• Reasoning about a lack of data, as in dip detection.• Performing non-incremental processing over temporal subsets of an infinite stream, as in statistical anomaly detection with cubic spline models.• Safely and punctually garbage collecting obsolete inputs and intermediate state.• Surfacing a reliable signal of overall pipeline health.Second, we describe, evaluate, and compare the semantically equivalent, but starkly different, watermark implementations in two modern stream processing engines: Apache Flink and Google Cloud Dataflow.

Research Organization:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
Sponsoring Organization:
USDOE
DOE Contract Number:
AC05-00OR22725
OSTI ID:
1823361
Resource Relation:
Journal Volume: 14; Journal Issue: 12; Conference: 47th International Conference on Very Large Data Bases (VLDB) - Copenhagen, , Denmark - 8/16/2021 8:00:00 AM-8/20/2021 8:00:00 AM
Country of Publication:
United States
Language:
English

Similar Records

Small, high-speed dataflow processor
Conference · Sat Jan 01 00:00:00 EST 1983 · OSTI ID:1823361

A Bernoulli Gaussian Watermark for Detecting Integrity Attacks in Control Systems
Conference · Thu Nov 02 00:00:00 EDT 2017 · OSTI ID:1823361

Implementing error-valued semantics for dataflow languages on imperative machines
Conference · Thu Aug 27 00:00:00 EDT 1992 · OSTI ID:1823361

Related Subjects