提交 84384703 编写于 作者: R Rick Chao 提交者: TensorFlower Gardener

PSv2: Docstring minor rephrase and typo/example corrections.

PiperOrigin-RevId: 339929158
Change-Id: I8592aa6e2cec32a2ba97743a6f022f263f0f65e2
上级 60ac36f5
......@@ -291,11 +291,11 @@ class PerWorkerValues(object):
"""A container that holds a list of values, one value per worker.
`tf.distribute.experimental.coordinator.PerWorkerValues` contains a collection
of values, where each of the value is located one worker respectively, and
upon being used as one of the `args` or `kwargs` of
of values, where each of the values is located on its corresponding worker,
and upon being used as one of the `args` or `kwargs` of
`tf.distribute.experimental.coordinator.ClusterCoordinator.schedule()`, the
value specific to a worker will be passed into the function being executed at
that particular worker.
that corresponding worker.
Currently, the only supported path to create an object of
`tf.distribute.experimental.coordinator.PerWorkerValues` is through calling
......@@ -948,14 +948,13 @@ class ClusterCoordinator(object):
failed worker, it will be added for function execution after datasets created
by `create_per_worker_dataset` are re-built on it.
When a parameter server the coordinator fails, a
`tf.errors.UnavailableError` is raised by `schedule`, `join` or `done`. In
this case, in addition to bringing back the failed parameter server, users
should restart the coordinator to so that it reconnects to the parameter
server, re-creates the variables and loads checkpoints. If the coordinator
fails, users need to bring it back as well. The program will automatically
connect to the parameter servers and workers, and continue the progress from a
checkpoint.
When a parameter server fails, a `tf.errors.UnavailableError` is raised by
`schedule`, `join` or `done`. In this case, in addition to bringing back the
failed parameter server, users should restart the coordinator so that it
reconnects to workers and parameter servers, re-creates the variables, and
loads checkpoints. If the coordinator fails, after the user brings it back,
the program will automatically connect to workers and parameter servers, and
continue the progress from a checkpoint.
It is thus essential that in user's program, a checkpoint file is periodically
saved, and restored at the start of the program. If an
......@@ -1137,7 +1136,7 @@ class ClusterCoordinator(object):
def per_worker_dataset_fn():
return strategy.distribute_datasets_from_function(
lambda x: tf.data.from_tensor_slices([3] * 3)
lambda x: tf.data.Dataset.from_tensor_slices([3] * 3))
per_worker_dataset = coordinator.create_per_worker_dataset(
per_worker_dataset_fn)
......
......@@ -52,22 +52,22 @@ class ParameterServerStrategyV2(distribute_lib.Strategy):
synchronizing with each other. Under this configuration, it is known as
asynchronous training.
In TensorFlow 2, we recommend a central coordiantion-based architecture for
parameter server training, where workers and parameter servers run a
`tf.distribute.Server` and there is another task that creates resources on
workers and parameter servers, dispatches functions, and coordinates the
training. We refer to this task as “coordinator”. The coordinator uses a
In TensorFlow 2, we recommend an architecture based on central coordination
for parameter server training. Each worker and parameter server runs a
`tf.distribute.Server`, and on top of that, a coordinator task is responsible
for creating resources on workers and parameter servers, dispatching
functions, and coordinating the training. The coordinator uses a
`tf.distribute.experimental.coordinator.ClusterCoordinator` to coordinate the
cluster, and a `tf.distribute.experimental.ParameterServerStrategy` to define
variables on parameter servers and computation on workers.
For the training to work, the coordinator dispatches `tf.function`s to be
executed on remote workers. Upon receiving requests from
the coordinator, a worker executes the `tf.function` by reading the variables
from parameter servers, executing the ops, and updating the variables on the
parameter servers. Each of the worker only processes the requests from the
coordinator, and communicates with parameter servers, without direct
interactions with other workers in the cluster.
executed on remote workers. Upon receiving requests from the coordinator, a
worker executes the `tf.function` by reading the variables from parameter
servers, executing the ops, and updating the variables on the parameter
servers. Each of the worker only processes the requests from the coordinator,
and communicates with parameter servers, without direct interactions with
other workers in the cluster.
As a result, failures of some workers do not prevent the cluster from
continuing the work, and this allows the cluster to train with instances that
......@@ -77,7 +77,7 @@ class ParameterServerStrategyV2(distribute_lib.Strategy):
Note that the coordinator is not one of the training workers. Instead, it
creates resources such as variables and datasets, dispatchs `tf.function`s,
saving checkpoints and so on. In addition to workers, parameter servers and
saves checkpoints and so on. In addition to workers, parameter servers and
the coordinator, an optional evaluator can be run on the side that
periodically reads the checkpoints saved by the coordinator and runs
evaluations against each checkpoint.
......@@ -226,8 +226,8 @@ class ParameterServerStrategyV2(distribute_lib.Strategy):
```
Alternatively, you can also start a bunch of TensorFlow servers in advance and
connect to them later. The coordinator can be in the same cluster or on any
machine that has connectivity to workers and parameter server. This is covered
in our guide and tutorial.
machine that has connectivity to workers and parameter servers. This is
covered in our guide and tutorial.
__Variable creation with `strategy.scope()`__
......@@ -270,9 +270,9 @@ class ParameterServerStrategyV2(distribute_lib.Strategy):
"shard" the variables across the ps. Partitioning large variable among ps is a
commonly used technique to boost training throughput and mitigate memory
constraints. It enables parallel computations and updates on different shards
of a variable, and often yields better load balancing across parameter servers
. Without sharding, models with large variables (e.g, embeddings) that can't
fit into one machine's memory would otherwise be unable to train.
of a variable, and often yields better load balancing across parameter
servers. Without sharding, models with large variables (e.g, embeddings) that
can't fit into one machine's memory would otherwise be unable to train.
With `tf.distribute.experimental.ParameterServerStrategy`, if a
`variable_partitioner` is provided to `__init__` and certain conditions are
......@@ -294,40 +294,41 @@ class ParameterServerStrategyV2(distribute_lib.Strategy):
return x * self.w
# Partition the dense layer into 2 shards.
variable_partitioiner = (
variable_partitioner = (
tf.distribute.experimental.partitioners.FixedShardsPartitioner(
num_shards = 2))
strategy = ParameterServerStrategy(cluster_resolver=...,
strategy = tf.distribute.experimental.ParameterServerStrategy(
cluster_resolver=...,
variable_partitioner = variable_partitioner)
with strategy.scope():
dense = Dense()
assert len(dense.variables) == 2
assert isinstance(dense.variables[0], tf.Variable)
assert isinstance(dense.variables[1], tf.Variable)
assert dense.variables[0].name == "w/part_0"
assert dense.variables[1].name == "w/part_1"
assert dense.variables[0].shape == (50, 10)
assert dense.variables[1].shape == (50, 10)
```
The sharded variable container can be converted to a `Tensor` via
`tf.convert_to_tensor`. This means the container can be directly used in most
Python Ops where such `Tensor` convertion automatically happens. For example
Python Ops where such `Tensor` conversion automatically happens. For example,
in the above code snippet, `x * self.w` would implicitly apply the said tensor
convertion. Note that such convertion can be expensive, as the variable
conversion. Note that such conversion can be expensive, as the variable
components need to be transferred from multiple parameter servers to where
the value is used.
`tf.nn.embedding_lookup` on the other hand doesn't apply the tensor convertion
, and performs parallel lookups on the variable components instead. This is
crutial to scale up embedding lookups when the embedding table variable is
large.
`tf.nn.embedding_lookup` on the other hand doesn't apply the tensor
conversion, and performs parallel lookups on the variable components instead.
This is crucial to scale up embedding lookups when the embedding table
variable is large.
When a partitioned variable is saved to `SavedModel`, it will be saved as if
When a partitioned variable is saved to a `SavedModel`, it will be saved as if
it is one single variable. This improves serving efficiency by eliminating
a number of Ops that handle the partiton aspects.
Known limitations of variable partitioning:
* Number of parttions must not change across Checkpoint save/load.
* Number of partitions must not change across Checkpoint saving/loading.
* After saving partitioned variables to a SavedModel, the SavedModel can't be
loaded via `tf.saved_model.load`.
......@@ -358,7 +359,6 @@ class ParameterServerStrategyV2(distribute_lib.Strategy):
coordinator =
tf.distribute.experimental.coordinator.ClusterCoordinator(strategy=...)
distributed_dataset = coordinator.create_per_worker_dataset(dataset_fn)
```
__Limitations__
......@@ -404,7 +404,7 @@ class ParameterServerStrategyV2(distribute_lib.Strategy):
* `variable_partitioner` will be called for each variable created under
strategy `scope` to instruct how the variable should be partitioned.
Variables that have only one partition along the partitioning axis
(i.e., no need for partition) will be created as normal `tf.Variable`.
(i.e., no need for partition) will be created as a normal `tf.Variable`.
* Only the first / outermost axis partitioning is supported.
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册