* add `-Dlog4j.configuration=<location of configuration file>` to `spark.driver.extraJavaOptions` (for the driver) or `spark.executor.extraJavaOptions` (for executors). Note that if using a file, the `file:` protocol should be explicitly provided, and the file needs to exist locally on all the nodes.
| `spark.files.maxPartitionBytes` | 134217728 (128 MB) | The maximum number of bytes to pack into a single partition when reading files. |
| `spark.files.openCostInBytes` | 4194304 (4 MB) | The estimated cost to open a file, measured by the number of bytes could be scanned in the same time. This is used when putting multiple files into a partition. It is better to over estimate, then the partitions with small files will be faster than partitions with bigger files. |
...
...
@@ -316,7 +316,7 @@ It also allows a different address from the local one to be advertised to execut
Spark 喜欢将所有 task 安排在最佳的本地级别,但这并不总是可能的。在任何空闲 executor 中没有未处理数据的情况下,Spark 将切换到较低的本地级别。有两个选项: a )等待一个繁忙的 CPU 释放在相同服务器上的数据上启动任务,或者 b )立即在更远的地方启动一个新的任务,需要在那里移动数据。
使用 multiple input streams(多个输入流)/ receivers(接收器)接收数据的替代方法是明确 repartition(重新分配)input data stream(输入数据流)(使用 `inputStream.repartition(<number of partitions>)`)。这会在 further processing(进一步处理)之前将 received batches of data(收到的批次数据)distributes(分发)到集群中指定数量的计算机.
使用 multiple input streams(多个输入流)/ receivers(接收器)接收数据的替代方法是明确 repartition(重新分配)input data stream(输入数据流)(使用 `inputStream.repartition(<number of partitions>)`)。这会在 further processing(进一步处理)之前将 received batches of data(收到的批次数据)distributes(分发)到集群中指定数量的计算机.
### Level of Parallelism in Data Processing(数据处理中的并行度水平)