提交 0c809285 编写于 作者: K kl0u 提交者: Robert Metzger

[FLINK-4133] Reflect streaming file source changes in documentation

This closes #2198
上级 8a2c03e1
......@@ -1310,21 +1310,32 @@ Data Sources
<br />
Sources can by created by using `StreamExecutionEnvironment.addSource(sourceFunction)`.
You can either use one of the source functions that come with Flink or write a custom source
by implementing the `SourceFunction` for non-parallel sources, or by implementing the
`ParallelSourceFunction` interface or extending `RichParallelSourceFunction` for parallel sources.
Sources are where your program reads its input from. You can attach a source to your program by
using `StreamExecutionEnvironment.addSource(sourceFunction)`. Flink comes with a number of pre-implemented
source functions, but you can always write your own custom sources by implementing the `SourceFunction`
for non-parallel sources, or by implementing the `ParallelSourceFunction` interface or extending the
`RichParallelSourceFunction` for parallel sources.
There are several predefined stream sources accessible from the `StreamExecutionEnvironment`:
File-based:
- `readTextFile(path)` / `TextInputFormat` - Reads files line wise and returns them as Strings.
- `readTextFile(path)` - Reads text files, i.e. files that respect the `TextInputFormat` specification, line-by-line and returns them as Strings.
- `readFile(path)` / Any input format - Reads files as dictated by the input format.
- `readFile(fileInputFormat, path)` - Reads (once) files as dictated by the specified file input format.
- `readFileStream` - create a stream by appending elements when there are changes to a file
- `readFile(fileInputFormat, path, watchType, interval, pathFilter, typeInfo)` - This is the method called internally by the two previous ones. It reads files in the `path` based on the given `fileInputFormat`. Depending on the provided `watchType`, this source may periodically monitor (every `interval` ms) the path for new data (`FileProcessingMode.PROCESS_CONTINUOUSLY`), or process once the data currently in the path and exit (`FileProcessingMode.PROCESS_ONCE`). Using the `pathFilter`, the user can further exclude files from being processed.
*IMPLEMENTATION:*
Under the hood, Flink splits the file reading process into two sub-tasks, namely *directory monitoring* and *data reading*. Each of these sub-tasks is implemented by a separate entity. Monitoring is implemented by a single, **non-parallel** (parallelism = 1) task, while reading is performed by multiple tasks running in parallel. The parallelism of the latter is equal to the job parallelism. The role of the single monitoring task is to scan the directory (periodically or only once depending on the `watchType`), find the files to be processed, divide them in *splits*, and assign these splits to the downstream readers. The readers are the ones who will read the actual data. Each split is read by only one reader, while a reader can read muplitple splits, one-by-one.
*IMPORTANT NOTES:*
1. If the `watchType` is set to `FileProcessingMode.PROCESS_CONTINUOUSLY`, when a file is modified, its contents are re-processed entirely. This can brake the "exactly-once" semantics, as appending data at the end of a file will lead to **all** its contents being re-processed.
2. If the `watchType` is set to `FileProcessingMode.PROCESS_ONCE`, the source scans the path **once** and exits, without waiting for the readers to finish reading the file contents. Of course the readers will continue reading until all file contents are read. Closing the source leads to no more checkpoints after that point. This may lead to slower recovery after a node failure, as the job will resume reading from the last checkpoint.
Socket-based:
- `socketTextStream` - Reads from a socket. Elements can be separated by a delimiter.
......@@ -1357,20 +1368,31 @@ Custom:
<br />
Sources can by created by using `StreamExecutionEnvironment.addSource(sourceFunction)`.
You can either use one of the source functions that come with Flink or write a custom source
by implementing the `SourceFunction` for non-parallel sources, or by implementing the
`ParallelSourceFunction` interface or extending `RichParallelSourceFunction` for parallel sources.
Sources are where your program reads its input from. You can attach a source to your program by
using `StreamExecutionEnvironment.addSource(sourceFunction)`. Flink comes with a number of pre-implemented
source functions, but you can always write your own custom sources by implementing the `SourceFunction`
for non-parallel sources, or by implementing the `ParallelSourceFunction` interface or extending the
`RichParallelSourceFunction` for parallel sources.
There are several predefined stream sources accessible from the `StreamExecutionEnvironment`:
File-based:
- `readTextFile(path)` / `TextInputFormat` - Reads files line wise and returns them as Strings.
- `readTextFile(path)` - Reads text files, i.e. files that respect the `TextInputFormat` specification, line-by-line and returns them as Strings.
- `readFile(path)` / Any input format - Reads files as dictated by the input format.
- `readFile(fileInputFormat, path)` - Reads (once) files as dictated by the specified file input format.
- `readFileStream` - create a stream by appending elements when there are changes to a file
- `readFile(fileInputFormat, path, watchType, interval, pathFilter)` - This is the method called internally by the two previous ones. It reads files in the `path` based on the given `fileInputFormat`. Depending on the provided `watchType`, this source may periodically monitor (every `interval` ms) the path for new data (`FileProcessingMode.PROCESS_CONTINUOUSLY`), or process once the data currently in the path and exit (`FileProcessingMode.PROCESS_ONCE`). Using the `pathFilter`, the user can further exclude files from being processed.
*IMPLEMENTATION:*
Under the hood, Flink splits the file reading process into two sub-tasks, namely *directory monitoring* and *data reading*. Each of these sub-tasks is implemented by a separate entity. Monitoring is implemented by a single, **non-parallel** (parallelism = 1) task, while reading is performed by multiple tasks running in parallel. The parallelism of the latter is equal to the job parallelism. The role of the single monitoring task is to scan the directory (periodically or only once depending on the `watchType`), find the files to be processed, divide them in *splits*, and assign these splits to the downstream readers. The readers are the ones who will read the actual data. Each split is read by only one reader, while a reader can read muplitple splits, one-by-one.
*IMPORTANT NOTES:*
1. If the `watchType` is set to `FileProcessingMode.PROCESS_CONTINUOUSLY`, when a file is modified, its contents are re-processed entirely. This can brake the "exactly-once" semantics, as appending data at the end of a file will lead to **all** its contents being re-processed.
2. If the `watchType` is set to `FileProcessingMode.PROCESS_ONCE`, the source scans the path **once** and exits, without waiting for the readers to finish reading the file contents. Of course the readers will continue reading until all file contents are read. Closing the source leads to no more checkpoints after that point. This may lead to slower recovery after a node failure, as the job will resume reading from the last checkpoint.
Socket-based:
......
......@@ -207,7 +207,7 @@ public class ContinuousFileReaderOperator<OUT, S extends Serializable> extends A
Tuple3<List<FileInputSplit>, FileInputSplit, S> restoredState) {
this.format = checkNotNull(format, "Unspecified FileInputFormat.");
this.serializer = checkNotNull(serializer, "Unspecified Serialized.");
this.serializer = checkNotNull(serializer, "Unspecified Serializer.");
this.pendingSplits = new LinkedList<>();
this.collector = collector;
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册