Keyed Watermarks: A Partition-aware Watermark Generation in Apache Flink

Organisatsiooni nimi
University of Tartu, Tartu, Estonia
Big stream processing frameworks such as Apache Flink provide rich APIs to build real-time data analytics applications. Theoretically, unbounded streams of data flow into the system and results are calculated. To tackle the unbounded nature, streams are divided into chunks by means of window operators. A window can be seen as a way to take a snapshot of the stream and apply a user-defined logic on its contents.
Time-based windows can be further sub-divided by partitioning them based on a (group) of data items in the stream. The common notions of time supported are processing time and event time. In the former case, the data arriving at the source of a stream processing application are timestamped by the wall-clock time of the streaming engine. In this case, the timestamps are in order. In the latter case, event-time processing, the timestamp is assigned by the data generator and application developer has to provide a means to extract its value from the data. In this case, the control of time progress is outside the engine. Moreover, it is common that data might arrive out-of-order. Out of order means that some events can arrive later than expected with respect to its event time. A common approach to let stream processing engines reason about the progress of event time is low watermarks. A watermark is merely a timestamp. When a watermark is received by an operator. It means that all data with a timestamp less than received watermark’s value should have been received by the system. Thus, any pending windows whose end timestamp is before the watermark value can be closed and their user-defined logic can be run. The technique of watermarking is supported by Apache Flink.
Currently, watermarks for a specific data source is partition-agnostic. That is, all logical windows, partitioned by a group of elements, have to be processed on the arrival of the next watermark. Considering IoT applications, we might have several devices that emit data but at different rates. So, it does not make sense to data items from high-rate devices waiting for watermarks that might progress relatively slowly due to low-rate devices. We need to implement a partition-aware watermark processing approach throughout all window operators in a Flink job thus, windows using the same partition-key as the watermark can progress independently.
Lõputöö kaitsmise aasta
Ahmed Awad
inglise keel
Nõuded kandideerijale

Kandideerimise kontakt

Ahmed Awad
Vaata lähemalt