From 2190c6d5bda2465993bae1d1486bf259a8c5a0ee Mon Sep 17 00:00:00 2001 From: Kostas Kloudas Date: Wed, 7 Nov 2018 16:27:02 +0100 Subject: [PATCH] [FLINK-10803][docs] Update the StreamingFileSink documentation for S3. --- docs/dev/connectors/streamfile_sink.md | 17 +++++++++++++---- 1 file changed, 13 insertions(+), 4 deletions(-) diff --git a/docs/dev/connectors/streamfile_sink.md b/docs/dev/connectors/streamfile_sink.md index aea66c3cc48..8f50675ccbc 100644 --- a/docs/dev/connectors/streamfile_sink.md +++ b/docs/dev/connectors/streamfile_sink.md @@ -24,16 +24,25 @@ under the License. --> This connector provides a Sink that writes partitioned files to filesystems -supported by the Flink `FileSystem` abstraction. Since in streaming the input -is potentially infinite, the streaming file sink writes data into buckets. The -bucketing behaviour is configurable but a useful default is time-based +supported by the [Flink `FileSystem` abstraction]({{ site.baseurl}}/ops/filesystems.html). + +Important Note: For S3, the `StreamingFileSink` +supports only the [Hadoop-based](https://hadoop.apache.org/) FileSystem implementation, not +the implementation based on [Presto](https://prestodb.io/). In case your job uses the +`StreamingFileSink` to write to S3 but you want to use the Presto-based one for checkpointing, +it is advised to use explicitly *"s3a://"* (for Hadoop) as the scheme for the target path of +the sink and *"s3p://"* for checkpointing (for Presto). Using *"s3://"* for both the sink +and checkpointing may lead to unpredictable behavior, as both implementations "listen" to that scheme. + +Since in streaming the input is potentially infinite, the streaming file sink writes data +into buckets. The bucketing behaviour is configurable but a useful default is time-based bucketing where we start writing a new bucket every hour and thus get individual files that each contain a part of the infinite output stream. Within a bucket, we further split the output into smaller part files based on a rolling policy. This is useful to prevent individual bucket files from getting too big. This is also configurable but the default policy rolls files based on -file size and a timeout, i.e if no new data was written to a part file. +file size and a timeout, *i.e* if no new data was written to a part file. The `StreamingFileSink` supports both row-wise encoding formats and bulk-encoding formats, such as [Apache Parquet](http://parquet.apache.org). -- GitLab