2019-01-27 17:13:16

4f1b1282 · wizardforcel · 59d88814 · 4f1b1282 · 4f1b1282 · 4f1b1282
32 changed file
--- a/docs/1.0/SUMMARY.md
+++ b/docs/1.0/SUMMARY.md
@@ -11,7 +11,7 @@
        +   [Transfer Learning Tutorial](transfer_learning_tutorial.md)
        +   [Deploying a Seq2Seq Model with the Hybrid Frontend](deploy_seq2seq_hybrid_frontend_tutorial.md)
        +   [Saving and Loading Models](saving_loading_models.md)
-        +   [What is &lt;cite&gt;torch.nn&lt;/cite&gt; _really_?](nn_tutorial.md)
+        +   [What is `torch.nn` _really_?](nn_tutorial.md)
    +   [Image](tut_image.md)
        +   [Torchvision 模型微调](finetuning_torchvision_models_tutorial.md)
        +   [Spatial Transformer Networks Tutorial](spatial_transformer_tutorial.md)
@@ -64,6 +64,8 @@
        +   [torch.cuda](cuda.md)
        +   [torch.Storage](storage.md)
        +   [torch.nn](nn.md)
+        +   [torch.nn.functional](nn_functional.md)
+        +   [torch.nn.init](nn_init.md)
        +   [torch.optim](optim.md)
        +   [Automatic differentiation package - torch.autograd](autograd.md)
        +   [Distributed communication package - torch.distributed](distributed.md)

--- a/docs/1.0/autograd.md
+++ b/docs/1.0/autograd.md
@@ -54,7 +54,7 @@ class torch.autograd.no_grad

 Context-manager that disabled gradient calculation.

-Disabling gradient calculation is useful for inference, when you are sure that you will not call `Tensor.backward()`. It will reduce memory consumption for computations that would otherwise have &lt;cite&gt;requires_grad=True&lt;/cite&gt;. In this mode, the result of every computation will have &lt;cite&gt;requires_grad=False&lt;/cite&gt;, even when the inputs have &lt;cite&gt;requires_grad=True&lt;/cite&gt;.
+Disabling gradient calculation is useful for inference, when you are sure that you will not call `Tensor.backward()`. It will reduce memory consumption for computations that would otherwise have `requires_grad=True`. In this mode, the result of every computation will have `requires_grad=False`, even when the inputs have `requires_grad=True`.

 Also functions as a decorator.

@@ -532,13 +532,13 @@ Example

 When viewing a profile created using [`emit_nvtx`](#torch.autograd.profiler.emit_nvtx "torch.autograd.profiler.emit_nvtx") in the Nvidia Visual Profiler, correlating each backward-pass op with the corresponding forward-pass op can be difficult. To ease this task, [`emit_nvtx`](#torch.autograd.profiler.emit_nvtx "torch.autograd.profiler.emit_nvtx") appends sequence number information to the ranges it generates.

-During the forward pass, each function range is decorated with `seq=&lt;N&gt;`. `seq` is a running counter, incremented each time a new backward Function object is created and stashed for backward. Thus, the &lt;cite&gt;seq=&lt;N&gt;&lt;/cite&gt; annotation associated with each forward function range tells you that if a backward Function object is created by this forward function, the backward object will receive sequence number N. During the backward pass, the top-level range wrapping each C++ backward Function’s `apply()` call is decorated with `stashed seq=&lt;M&gt;`. `M` is the sequence number that the backward object was created with. By comparing `stashed seq` numbers in backward with `seq` numbers in forward, you can track down which forward op created each backward Function.
+During the forward pass, each function range is decorated with `seq=&lt;N&gt;`. `seq` is a running counter, incremented each time a new backward Function object is created and stashed for backward. Thus, the `seq=&lt;N&gt;` annotation associated with each forward function range tells you that if a backward Function object is created by this forward function, the backward object will receive sequence number N. During the backward pass, the top-level range wrapping each C++ backward Function’s `apply()` call is decorated with `stashed seq=&lt;M&gt;`. `M` is the sequence number that the backward object was created with. By comparing `stashed seq` numbers in backward with `seq` numbers in forward, you can track down which forward op created each backward Function.

 Any functions executed during the backward pass are also decorated with `seq=&lt;N&gt;`. During default backward (with `create_graph=False`) this information is irrelevant, and in fact, `N` may simply be 0 for all such functions. Only the top-level ranges associated with backward Function objects’ `apply()` methods are useful, as a way to correlate these Function objects with the earlier forward pass.

 **Double-backward**

-If, on the other hand, a backward pass with `create_graph=True` is underway (in other words, if you are setting up for a double-backward), each function’s execution during backward is given a nonzero, useful `seq=&lt;N&gt;`. Those functions may themselves create Function objects to be executed later during double-backward, just as the original functions in the forward pass did. The relationship between backward and double-backward is conceptually the same as the relationship between forward and backward: The functions still emit current-sequence-number-tagged ranges, the Function objects they create still stash those sequence numbers, and during the eventual double-backward, the Function objects’ `apply()` ranges are still tagged with `stashed seq` numbers, which can be compared to &lt;cite&gt;seq&lt;/cite&gt; numbers from the backward pass.
+If, on the other hand, a backward pass with `create_graph=True` is underway (in other words, if you are setting up for a double-backward), each function’s execution during backward is given a nonzero, useful `seq=&lt;N&gt;`. Those functions may themselves create Function objects to be executed later during double-backward, just as the original functions in the forward pass did. The relationship between backward and double-backward is conceptually the same as the relationship between forward and backward: The functions still emit current-sequence-number-tagged ranges, the Function objects they create still stash those sequence numbers, and during the eventual double-backward, the Function objects’ `apply()` ranges are still tagged with `stashed seq` numbers, which can be compared to `seq` numbers from the backward pass.

 ```py
 torch.autograd.profiler.load_nvprof(path)

--- a/docs/1.0/bottleneck.md
+++ b/docs/1.0/bottleneck.md
@@ -2,7 +2,7 @@

 # torch.utils.bottleneck

-&lt;cite&gt;torch.utils.bottleneck&lt;/cite&gt; is a tool that can be used as an initial step for debugging bottlenecks in your program. It summarizes runs of your script with the Python profiler and PyTorch’s autograd profiler.
+`torch.utils.bottleneck` is a tool that can be used as an initial step for debugging bottlenecks in your program. It summarizes runs of your script with the Python profiler and PyTorch’s autograd profiler.

 Run it on the command line with

@@ -11,7 +11,7 @@ python -m torch.utils.bottleneck /path/to/source/script.py [args]

 ```

-where [args] are any number of arguments to &lt;cite&gt;script.py&lt;/cite&gt;, or run `python -m torch.utils.bottleneck -h` for more usage instructions.
+where [args] are any number of arguments to `script.py`, or run `python -m torch.utils.bottleneck -h` for more usage instructions.

 Warning


--- a/docs/1.0/cuda.md
+++ b/docs/1.0/cuda.md
@@ -62,7 +62,7 @@ You can use both tensors and storages as arguments. If a given object is not all
 torch.cuda.empty_cache()
 ```

-Releases all unoccupied cached memory currently held by the caching allocator so that those can be used in other GPU application and visible in &lt;cite&gt;nvidia-smi&lt;/cite&gt;.
+Releases all unoccupied cached memory currently held by the caching allocator so that those can be used in other GPU application and visible in `nvidia-smi`.

 Note

@@ -141,7 +141,7 @@ Returns the current GPU memory usage by tensors in bytes for a given device.

 Note

-This is likely less than the amount shown in &lt;cite&gt;nvidia-smi&lt;/cite&gt; since some unused memory can be held by the caching allocator and some context needs to be created on GPU. See [Memory management](notes/cuda.html#cuda-memory-management) for more details about GPU memory management.
+This is likely less than the amount shown in `nvidia-smi` since some unused memory can be held by the caching allocator and some context needs to be created on GPU. See [Memory management](notes/cuda.html#cuda-memory-management) for more details about GPU memory management.

 ```py
 torch.cuda.memory_cached(device=None)
@@ -488,7 +488,7 @@ Makes a given stream wait for the event.
 torch.cuda.empty_cache()
 ```

-Releases all unoccupied cached memory currently held by the caching allocator so that those can be used in other GPU application and visible in &lt;cite&gt;nvidia-smi&lt;/cite&gt;.
+Releases all unoccupied cached memory currently held by the caching allocator so that those can be used in other GPU application and visible in `nvidia-smi`.

 Note

@@ -505,7 +505,7 @@ Returns the current GPU memory usage by tensors in bytes for a given device.

 Note

-This is likely less than the amount shown in &lt;cite&gt;nvidia-smi&lt;/cite&gt; since some unused memory can be held by the caching allocator and some context needs to be created on GPU. See [Memory management](notes/cuda.html#cuda-memory-management) for more details about GPU memory management.
+This is likely less than the amount shown in `nvidia-smi` since some unused memory can be held by the caching allocator and some context needs to be created on GPU. See [Memory management](notes/cuda.html#cuda-memory-management) for more details about GPU memory management.

 ```py
 torch.cuda.max_memory_allocated(device=None)

--- a/docs/1.0/dist_tuto.md
+++ b/docs/1.0/dist_tuto.md
@@ -164,7 +164,7 @@ In addition to `dist.all_reduce(tensor, op, group)`, there are a total of 6 coll
 *   `dist.scatter(tensor, src, scatter_list, group)`: Copies the `\(i^{\text{th}}\)` tensor `scatter_list[i]` to the `\(i^{\text{th}}\)` process.
 *   `dist.gather(tensor, dst, gather_list, group)`: Copies `tensor` from all processes in `dst`.
 *   `dist.all_gather(tensor_list, tensor, group)`: Copies `tensor` from all processes to `tensor_list`, on all processes.
-*   `dist.barrier(group)`: block all processes in &lt;cite&gt;group&lt;/cite&gt; until each one has entered this function.
+*   `dist.barrier(group)`: block all processes in `group` until each one has entered this function.

 ## Distributed Training


--- a/docs/1.0/distributed.md
+++ b/docs/1.0/distributed.md
@@ -62,9 +62,9 @@ For the full list of NCCL environment variables, please refer to [NVIDIA NCCL’

 ## Basics

-The &lt;cite&gt;torch.distributed&lt;/cite&gt; package provides PyTorch support and communication primitives for multiprocess parallelism across several computation nodes running on one or more machines. The class [`torch.nn.parallel.DistributedDataParallel()`](nn.html#torch.nn.parallel.DistributedDataParallel "torch.nn.parallel.DistributedDataParallel") builds on this functionality to provide synchronous distributed training as a wrapper around any PyTorch model. This differs from the kinds of parallelism provided by [Multiprocessing package - torch.multiprocessing](multiprocessing.html) and [`torch.nn.DataParallel()`](nn.html#torch.nn.DataParallel "torch.nn.DataParallel") in that it supports multiple network-connected machines and in that the user must explicitly launch a separate copy of the main training script for each process.
+The `torch.distributed` package provides PyTorch support and communication primitives for multiprocess parallelism across several computation nodes running on one or more machines. The class [`torch.nn.parallel.DistributedDataParallel()`](nn.html#torch.nn.parallel.DistributedDataParallel "torch.nn.parallel.DistributedDataParallel") builds on this functionality to provide synchronous distributed training as a wrapper around any PyTorch model. This differs from the kinds of parallelism provided by [Multiprocessing package - torch.multiprocessing](multiprocessing.html) and [`torch.nn.DataParallel()`](nn.html#torch.nn.DataParallel "torch.nn.DataParallel") in that it supports multiple network-connected machines and in that the user must explicitly launch a separate copy of the main training script for each process.

-In the single-machine synchronous case, &lt;cite&gt;torch.distributed&lt;/cite&gt; or the [`torch.nn.parallel.DistributedDataParallel()`](nn.html#torch.nn.parallel.DistributedDataParallel "torch.nn.parallel.DistributedDataParallel") wrapper may still have advantages over other approaches to data-parallelism, including [`torch.nn.DataParallel()`](nn.html#torch.nn.DataParallel "torch.nn.DataParallel"):
+In the single-machine synchronous case, `torch.distributed` or the [`torch.nn.parallel.DistributedDataParallel()`](nn.html#torch.nn.parallel.DistributedDataParallel "torch.nn.parallel.DistributedDataParallel") wrapper may still have advantages over other approaches to data-parallelism, including [`torch.nn.DataParallel()`](nn.html#torch.nn.DataParallel "torch.nn.DataParallel"):

 *   Each process maintains its own optimizer and performs a complete optimization step with each iteration. While this may appear redundant, since the gradients have already been gathered together and averaged across processes and are thus the same for every process, this means that no parameter broadcast step is needed, reducing time spent transferring tensors between nodes.
 *   Each process contains an independent Python interpreter, eliminating the extra interpreter overhead and “GIL-thrashing” that comes from driving several execution threads, model replicas, or GPUs from a single Python process. This is especially important for models that make heavy use of the Python runtime, including models with recurrent layers or many small components.
@@ -219,7 +219,7 @@ This is the default method, meaning that `init_method` does not have to be speci

 By default collectives operate on the default group (also called the world) and require all processes to enter the distributed function call. However, some workloads can benefit from more fine-grained communication. This is where distributed groups come into play. [`new_group()`](#torch.distributed.new_group "torch.distributed.new_group") function can be used to create new groups, with arbitrary subsets of all processes. It returns an opaque group handle that can be given as a `group` argument to all collectives (collectives are distributed functions to exchange information in certain well-known programming patterns).

-Currently &lt;cite&gt;torch.distributed&lt;/cite&gt; does not support creating groups with different backends. In other words, each group being created will use the same backend as you specified in [`init_process_group()`](#torch.distributed.init_process_group "torch.distributed.init_process_group").
+Currently `torch.distributed` does not support creating groups with different backends. In other words, each group being created will use the same backend as you specified in [`init_process_group()`](#torch.distributed.init_process_group "torch.distributed.init_process_group").

 ```py
 torch.distributed.new_group(ranks=None, timeout=datetime.timedelta(seconds=1800))
@@ -627,9 +627,9 @@ Only nccl backend is currently supported tensors should only be GPU tensors

 ## Launch utility

-The &lt;cite&gt;torch.distributed&lt;/cite&gt; package also provides a launch utility in &lt;cite&gt;torch.distributed.launch&lt;/cite&gt;. This helper utility can be used to launch multiple processes per node for distributed training. This utility also supports both python2 and python3.
+The `torch.distributed` package also provides a launch utility in `torch.distributed.launch`. This helper utility can be used to launch multiple processes per node for distributed training. This utility also supports both python2 and python3.

-&lt;cite&gt;torch.distributed.launch&lt;/cite&gt; is a module that spawns up multiple distributed training processes on each of the training nodes.
+`torch.distributed.launch` is a module that spawns up multiple distributed training processes on each of the training nodes.

 The utility can be used for single-node distributed training, in which one or more processes per node will be spawned. The utility can be used for either CPU training or GPU training. If the utility is used for GPU training, each distributed process will be operating on a single GPU. This can achieve well-improved single-node training performance. It can also be used in multi-node distributed training, by spawning up multiple processes on each node for well-improved multi-node distributed training performance as well. This will especially be benefitial for systems with multiple Infiniband interfaces that have direct-GPU support, since all of them can be utilized for aggregated communication bandwidth.


--- a/docs/1.0/distributed_deprecated.md
+++ b/docs/1.0/distributed_deprecated.md
@@ -26,9 +26,9 @@ Currently torch.distributed.deprecated supports four backends, each with differe

 ## Basics

-The &lt;cite&gt;torch.distributed.deprecated&lt;/cite&gt; package provides PyTorch support and communication primitives for multiprocess parallelism across several computation nodes running on one or more machines. The class `torch.nn.parallel.deprecated.DistributedDataParallel()` builds on this functionality to provide synchronous distributed training as a wrapper around any PyTorch model. This differs from the kinds of parallelism provided by [Multiprocessing package - torch.multiprocessing](multiprocessing.html) and [`torch.nn.DataParallel()`](nn.html#torch.nn.DataParallel "torch.nn.DataParallel") in that it supports multiple network-connected machines and in that the user must explicitly launch a separate copy of the main training script for each process.
+The `torch.distributed.deprecated` package provides PyTorch support and communication primitives for multiprocess parallelism across several computation nodes running on one or more machines. The class `torch.nn.parallel.deprecated.DistributedDataParallel()` builds on this functionality to provide synchronous distributed training as a wrapper around any PyTorch model. This differs from the kinds of parallelism provided by [Multiprocessing package - torch.multiprocessing](multiprocessing.html) and [`torch.nn.DataParallel()`](nn.html#torch.nn.DataParallel "torch.nn.DataParallel") in that it supports multiple network-connected machines and in that the user must explicitly launch a separate copy of the main training script for each process.

-In the single-machine synchronous case, &lt;cite&gt;torch.distributed.deprecated&lt;/cite&gt; or the `torch.nn.parallel.deprecated.DistributedDataParallel()` wrapper may still have advantages over other approaches to data-parallelism, including [`torch.nn.DataParallel()`](nn.html#torch.nn.DataParallel "torch.nn.DataParallel"):
+In the single-machine synchronous case, `torch.distributed.deprecated` or the `torch.nn.parallel.deprecated.DistributedDataParallel()` wrapper may still have advantages over other approaches to data-parallelism, including [`torch.nn.DataParallel()`](nn.html#torch.nn.DataParallel "torch.nn.DataParallel"):

 *   Each process maintains its own optimizer and performs a complete optimization step with each iteration. While this may appear redundant, since the gradients have already been gathered together and averaged across processes and are thus the same for every process, this means that no parameter broadcast step is needed, reducing time spent transferring tensors between nodes.
 *   Each process contains an independent Python interpreter, eliminating the extra interpreter overhead and “GIL-thrashing” that comes from driving several execution threads, model replicas, or GPUs from a single Python process. This is especially important for models that make heavy use of the Python runtime, including models with recurrent layers or many small components.
@@ -457,9 +457,9 @@ Only NCCL backend is currently supported. `output_tensor_lists` and `input_tenso

 ## Launch utility

-The &lt;cite&gt;torch.distributed.deprecated&lt;/cite&gt; package also provides a launch utility in &lt;cite&gt;torch.distributed.deprecated.launch&lt;/cite&gt;.
+The `torch.distributed.deprecated` package also provides a launch utility in `torch.distributed.deprecated.launch`.

-&lt;cite&gt;torch.distributed.launch&lt;/cite&gt; is a module that spawns up multiple distributed training processes on each of the training nodes.
+`torch.distributed.launch` is a module that spawns up multiple distributed training processes on each of the training nodes.

 The utility can be used for single-node distributed training, in which one or more processes per node will be spawned. The utility can be used for either CPU training or GPU training. If the utility is used for GPU training, each distributed process will be operating on a single GPU. This can achieve well-improved single-node training performance. It can also be used in multi-node distributed training, by spawning up multiple processes on each node for well-improved multi-node distributed training performance as well. This will especially be benefitial for systems with multiple Infiniband interfaces that have direct-GPU support, since all of them can be utilized for aggregated communication bandwidth.


--- a/docs/1.0/distributions.md
+++ b/docs/1.0/distributions.md
--- a/docs/1.0/docs_cpp_extension.md
+++ b/docs/1.0/docs_cpp_extension.md
@@ -95,7 +95,7 @@ CUDA support with mixed compilation is provided. Simply pass CUDA source files (
 *   **extra_include_paths** – optional list of include directories to forward to the build.
 *   **build_directory** – optional path to use as build workspace.
 *   **verbose** – If `True`, turns on verbose logging of load steps.
-*   **with_cuda** – Determines whether CUDA headers and libraries are added to the build. If set to `None` (default), this value is automatically determined based on the existence of `.cu` or `.cuh` in `sources`. Set it to &lt;cite&gt;True`&lt;/cite&gt; to force CUDA headers and libraries to be included.
+*   **with_cuda** – Determines whether CUDA headers and libraries are added to the build. If set to `None` (default), this value is automatically determined based on the existence of `.cu` or `.cuh` in `sources`. Set it to `True`` to force CUDA headers and libraries to be included.
 *   **is_python_module** – If `True` (default), imports the produced shared library as a Python module. If `False`, loads it into the process as a plain dynamic library.

 |
@@ -138,7 +138,7 @@ See [`load()`](#torch.utils.cpp_extension.load "torch.utils.cpp_extension.load")
 *   **cpp_sources** – A string, or list of strings, containing C++ source code.
 *   **cuda_sources** – A string, or list of strings, containing CUDA source code.
 *   **functions** – A list of function names for which to generate function bindings. If a dictionary is given, it should map function names to docstrings (which are otherwise just the function names).
-*   **with_cuda** – Determines whether CUDA headers and libraries are added to the build. If set to `None` (default), this value is automatically determined based on whether `cuda_sources` is provided. Set it to &lt;cite&gt;True`&lt;/cite&gt; to force CUDA headers and libraries to be included.
+*   **with_cuda** – Determines whether CUDA headers and libraries are added to the build. If set to `None` (default), this value is automatically determined based on whether `cuda_sources` is provided. Set it to `True`` to force CUDA headers and libraries to be included.

 |
 | --- | --- |
@@ -164,7 +164,7 @@ torch.utils.cpp_extension.include_paths(cuda=False)

 Get the include paths required to build a C++ or CUDA extension.

-| Parameters: | **cuda** – If &lt;cite&gt;True&lt;/cite&gt;, includes CUDA-specific include paths. |
+| Parameters: | **cuda** – If `True`, includes CUDA-specific include paths. |
 | --- | --- |
 | Returns: | A list of include path strings. |
 | --- | --- |

--- a/docs/1.0/hub.md
+++ b/docs/1.0/hub.md
@@ -10,11 +10,11 @@ Load a model from a github repo, with pretrained weights.

 | Parameters: | 

-*   **github** – Required, a string with format “repo_owner/repo_name[:tag_name]” with an optional tag/branch. The default branch is &lt;cite&gt;master&lt;/cite&gt; if not specified. Example: ‘pytorch/vision[:hub]’
+*   **github** – Required, a string with format “repo_owner/repo_name[:tag_name]” with an optional tag/branch. The default branch is `master` if not specified. Example: ‘pytorch/vision[:hub]’
 *   **model** – Required, a string of callable name defined in repo’s hubconf.py
-*   **force_reload** – Optional, whether to discard the existing cache and force a fresh download. Default is &lt;cite&gt;False&lt;/cite&gt;.
-*   ***args** – Optional, the corresponding args for callable &lt;cite&gt;model&lt;/cite&gt;.
-*   ****kwargs** – Optional, the corresponding kwargs for callable &lt;cite&gt;model&lt;/cite&gt;.
+*   **force_reload** – Optional, whether to discard the existing cache and force a fresh download. Default is `False`.
+*   ***args** – Optional, the corresponding args for callable `model`.
+*   ****kwargs** – Optional, the corresponding kwargs for callable `model`.

 |
 | --- | --- |

--- a/docs/1.0/jit.md
+++ b/docs/1.0/jit.md
@@ -140,7 +140,7 @@ torch.jit.load(f, map_location=None)

 Load a `ScriptModule` previously saved with `save`

-All previously saved modules, no matter their device, are first loaded onto CPU, and then are moved to the devices they were saved from. If this fails (e.g. because the run time system doesn’t have certain devices), an exception is raised. However, storages can be dynamically remapped to an alternative set of devices using the &lt;cite&gt;map_location&lt;/cite&gt; argument. Comparing to [`torch.load()`](torch.html#torch.load "torch.load"), &lt;cite&gt;map_location&lt;/cite&gt; in this function is simplified, which only accepts a string (e.g., ‘cpu’, ‘cuda:0’), or torch.device (e.g., torch.device(‘cpu’))
+All previously saved modules, no matter their device, are first loaded onto CPU, and then are moved to the devices they were saved from. If this fails (e.g. because the run time system doesn’t have certain devices), an exception is raised. However, storages can be dynamically remapped to an alternative set of devices using the `map_location` argument. Comparing to [`torch.load()`](torch.html#torch.load "torch.load"), `map_location` in this function is simplified, which only accepts a string (e.g., ‘cpu’, ‘cuda:0’), or torch.device (e.g., torch.device(‘cpu’))

 | Parameters: | 

@@ -915,7 +915,7 @@ graph(%0 : Float(3, 4)) {

 ```

-We can fix this by modifying the code to not use the in-place update, but rather build up the result tensor out-of-place with &lt;cite&gt;torch.cat&lt;/cite&gt;:
+We can fix this by modifying the code to not use the in-place update, but rather build up the result tensor out-of-place with `torch.cat`:

 ```py
 def fill_row_zero(x):

--- a/docs/1.0/model_zoo.md
+++ b/docs/1.0/model_zoo.md
@@ -8,9 +8,9 @@ torch.utils.model_zoo.load_url(url, model_dir=None, map_location=None, progress=

 Loads the Torch serialized object at the given URL.

-If the object is already present in &lt;cite&gt;model_dir&lt;/cite&gt;, it’s deserialized and returned. The filename part of the URL should follow the naming convention `filename-&lt;sha256&gt;.ext` where `&lt;sha256&gt;` is the first eight or more digits of the SHA256 hash of the contents of the file. The hash is used to ensure unique names and to verify the contents of the file.
+If the object is already present in `model_dir`, it’s deserialized and returned. The filename part of the URL should follow the naming convention `filename-&lt;sha256&gt;.ext` where `&lt;sha256&gt;` is the first eight or more digits of the SHA256 hash of the contents of the file. The hash is used to ensure unique names and to verify the contents of the file.

-The default value of &lt;cite&gt;model_dir&lt;/cite&gt; is `$TORCH_HOME/models` where `$TORCH_HOME` defaults to `~/.torch`. The default directory can be overridden with the `$TORCH_MODEL_ZOO` environment variable.
+The default value of `model_dir` is `$TORCH_HOME/models` where `$TORCH_HOME` defaults to `~/.torch`. The default directory can be overridden with the `$TORCH_MODEL_ZOO` environment variable.

 | Parameters: | 


--- a/docs/1.0/nn.md
+++ b/docs/1.0/nn.md
--- a/docs/1.0/nn_functional.md
+++ b/docs/1.0/nn_functional.md
--- a/docs/1.0/nn_init.md
+++ b/docs/1.0/nn_init.md
-
\ No newline at end of file
+# torch.nn.init
+
+```py
+torch.nn.init.calculate_gain(nonlinearity, param=None)
+```
+
+Return the recommended gain value for the given nonlinearity function. The values are as follows:
+
+| nonlinearity | gain |
+| --- | --- |
+| Linear / Identity | ![](http://latex.codecogs.com/gif.latex?1) |
+| Conv{1,2,3}D | ![](http://latex.codecogs.com/gif.latex?1) |
+| Sigmoid | ![](http://latex.codecogs.com/gif.latex?1) |
+| Tanh | ![](http://latex.codecogs.com/gif.latex?%5Cfrac%7B5%7D%7B3%7D) |
+| ReLU | ![](http://latex.codecogs.com/gif.latex?%5Csqrt%7B2%7D) |
+| Leaky Relu | ![](http://latex.codecogs.com/gif.latex?%5Csqrt%7B%5Cfrac%7B2%7D%7B1%20%2B%20%5Ctext%7Bnegative%5C_slope%7D%5E2%7D%7D) |
+
+ 
+| Parameters: | 
+
+*   **nonlinearity** – the non-linear function (`nn.functional` name)
+*   **param** – optional parameter for the non-linear function
+
+ |
+| --- | --- |
+
+Examples
+
+```py
+>>> gain = nn.init.calculate_gain('leaky_relu')
+
+```
+
+```py
+torch.nn.init.uniform_(tensor, a=0, b=1)
+```
+
+Fills the input Tensor with values drawn from the uniform distribution ![](http://latex.codecogs.com/gif.latex?%5Cmathcal%7BU%7D(a%2C%20b)).
+
+ 
+| Parameters: | 
+
+*   **tensor** – an n-dimensional `torch.Tensor`
+*   **a** – the lower bound of the uniform distribution
+*   **b** – the upper bound of the uniform distribution
+
+ |
+| --- | --- |
+
+Examples
+
+```py
+>>> w = torch.empty(3, 5)
+>>> nn.init.uniform_(w)
+
+```
+
+```py
+torch.nn.init.normal_(tensor, mean=0, std=1)
+```
+
+Fills the input Tensor with values drawn from the normal distribution ![](http://latex.codecogs.com/gif.latex?%5Cmathcal%7BN%7D(%5Ctext%7Bmean%7D%2C%20%5Ctext%7Bstd%7D)).
+
+ 
+| Parameters: | 
+
+*   **tensor** – an n-dimensional `torch.Tensor`
+*   **mean** – the mean of the normal distribution
+*   **std** – the standard deviation of the normal distribution
+
+ |
+| --- | --- |
+
+Examples
+
+```py
+>>> w = torch.empty(3, 5)
+>>> nn.init.normal_(w)
+
+```
+
+```py
+torch.nn.init.constant_(tensor, val)
+```
+
+Fills the input Tensor with the value ![](http://latex.codecogs.com/gif.latex?%5Ctext%7Bval%7D).
+
+ 
+| Parameters: | 
+
+*   **tensor** – an n-dimensional `torch.Tensor`
+*   **val** – the value to fill the tensor with
+
+ |
+| --- | --- |
+
+Examples
+
+```py
+>>> w = torch.empty(3, 5)
+>>> nn.init.constant_(w, 0.3)
+
+```
+
+```py
+torch.nn.init.eye_(tensor)
+```
+
+Fills the 2-dimensional input `Tensor` with the identity matrix. Preserves the identity of the inputs in `Linear` layers, where as many inputs are preserved as possible.
+
+ 
+| Parameters: | **tensor** – a 2-dimensional `torch.Tensor` |
+| --- | --- |
+
+Examples
+
+```py
+>>> w = torch.empty(3, 5)
+>>> nn.init.eye_(w)
+
+```
+
+```py
+torch.nn.init.dirac_(tensor)
+```
+
+Fills the {3, 4, 5}-dimensional input `Tensor` with the Dirac delta function. Preserves the identity of the inputs in `Convolutional` layers, where as many input channels are preserved as possible.
+
+ 
+| Parameters: | **tensor** – a {3, 4, 5}-dimensional `torch.Tensor` |
+| --- | --- |
+
+Examples
+
+```py
+>>> w = torch.empty(3, 16, 5, 5)
+>>> nn.init.dirac_(w)
+
+```
+
+```py
+torch.nn.init.xavier_uniform_(tensor, gain=1)
+```
+
+Fills the input `Tensor` with values according to the method described in “Understanding the difficulty of training deep feedforward neural networks” - Glorot, X. & Bengio, Y. (2010), using a uniform distribution. The resulting tensor will have values sampled from ![](http://latex.codecogs.com/gif.latex?%5Cmathcal%7BU%7D(-a%2C%20a)) where
+
+![](http://latex.codecogs.com/gif.latex?%0D%0Aa%20%3D%20%5Ctext%7Bgain%7D%20%5Ctimes%20%5Csqrt%7B%5Cfrac%7B6%7D%7B%5Ctext%7Bfan%5C_in%7D%20%2B%20%5Ctext%7Bfan%5C_out%7D%7D%7D%0D%0A%0D%0A)
+
+Also known as Glorot initialization.
+
+ 
+| Parameters: | 
+
+*   **tensor** – an n-dimensional `torch.Tensor`
+*   **gain** – an optional scaling factor
+
+ |
+| --- | --- |
+
+Examples
+
+```py
+>>> w = torch.empty(3, 5)
+>>> nn.init.xavier_uniform_(w, gain=nn.init.calculate_gain('relu'))
+
+```
+
+```py
+torch.nn.init.xavier_normal_(tensor, gain=1)
+```
+
+Fills the input `Tensor` with values according to the method described in “Understanding the difficulty of training deep feedforward neural networks” - Glorot, X. & Bengio, Y. (2010), using a normal distribution. The resulting tensor will have values sampled from ![](http://latex.codecogs.com/gif.latex?%5Cmathcal%7BN%7D(0%2C%20%5Ctext%7Bstd%7D)) where
+
+![](http://latex.codecogs.com/gif.latex?%0D%0A%5Ctext%7Bstd%7D%20%3D%20%5Ctext%7Bgain%7D%20%5Ctimes%20%5Csqrt%7B%5Cfrac%7B2%7D%7B%5Ctext%7Bfan%5C_in%7D%20%2B%20%5Ctext%7Bfan%5C_out%7D%7D%7D%0D%0A%0D%0A)
+
+Also known as Glorot initialization.
+
+ 
+| Parameters: | 
+
+*   **tensor** – an n-dimensional `torch.Tensor`
+*   **gain** – an optional scaling factor
+
+ |
+| --- | --- |
+
+Examples
+
+```py
+>>> w = torch.empty(3, 5)
+>>> nn.init.xavier_normal_(w)
+
+```
+
+```py
+torch.nn.init.kaiming_uniform_(tensor, a=0, mode='fan_in', nonlinearity='leaky_relu')
+```
+
+Fills the input `Tensor` with values according to the method described in “Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification” - He, K. et al. (2015), using a uniform distribution. The resulting tensor will have values sampled from ![](http://latex.codecogs.com/gif.latex?%5Cmathcal%7BU%7D(-%5Ctext%7Bbound%7D%2C%20%5Ctext%7Bbound%7D)) where
+
+![](http://latex.codecogs.com/gif.latex?%0D%0A%5Ctext%7Bbound%7D%20%3D%20%5Csqrt%7B%5Cfrac%7B6%7D%7B(1%20%2B%20a%5E2)%20%5Ctimes%20%5Ctext%7Bfan%5C_in%7D%7D%7D%0D%0A%0D%0A)
+
+Also known as He initialization.
+
+ 
+| Parameters: | 
+
+*   **tensor** – an n-dimensional `torch.Tensor`
+*   **a** – the negative slope of the rectifier used after this layer (0 for ReLU by default)
+*   **mode** – either ‘fan_in’ (default) or ‘fan_out’. Choosing `fan_in` preserves the magnitude of the variance of the weights in the forward pass. Choosing `fan_out` preserves the magnitudes in the backwards pass.
+*   **nonlinearity** – the non-linear function (`nn.functional` name), recommended to use only with ‘relu’ or ‘leaky_relu’ (default).
+
+ |
+| --- | --- |
+
+Examples
+
+```py
+>>> w = torch.empty(3, 5)
+>>> nn.init.kaiming_uniform_(w, mode='fan_in', nonlinearity='relu')
+
+```
+
+```py
+torch.nn.init.kaiming_normal_(tensor, a=0, mode='fan_in', nonlinearity='leaky_relu')
+```
+
+Fills the input `Tensor` with values according to the method described in “Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification” - He, K. et al. (2015), using a normal distribution. The resulting tensor will have values sampled from ![](http://latex.codecogs.com/gif.latex?%5Cmathcal%7BN%7D(0%2C%20%5Ctext%7Bstd%7D)) where
+
+![](http://latex.codecogs.com/gif.latex?%0D%0A%5Ctext%7Bstd%7D%20%3D%20%5Csqrt%7B%5Cfrac%7B2%7D%7B(1%20%2B%20a%5E2)%20%5Ctimes%20%5Ctext%7Bfan%5C_in%7D%7D%7D%0D%0A%0D%0A)
+
+Also known as He initialization.
+
+ 
+| Parameters: | 
+
+*   **tensor** – an n-dimensional `torch.Tensor`
+*   **a** – the negative slope of the rectifier used after this layer (0 for ReLU by default)
+*   **mode** – either ‘fan_in’ (default) or ‘fan_out’. Choosing `fan_in` preserves the magnitude of the variance of the weights in the forward pass. Choosing `fan_out` preserves the magnitudes in the backwards pass.
+*   **nonlinearity** – the non-linear function (`nn.functional` name), recommended to use only with ‘relu’ or ‘leaky_relu’ (default).
+
+ |
+| --- | --- |
+
+Examples
+
+```py
+>>> w = torch.empty(3, 5)
+>>> nn.init.kaiming_normal_(w, mode='fan_out', nonlinearity='relu')
+
+```
+
+```py
+torch.nn.init.orthogonal_(tensor, gain=1)
+```
+
+Fills the input `Tensor` with a (semi) orthogonal matrix, as described in “Exact solutions to the nonlinear dynamics of learning in deep linear neural networks” - Saxe, A. et al. (2013). The input tensor must have at least 2 dimensions, and for tensors with more than 2 dimensions the trailing dimensions are flattened.
+
+ 
+| Parameters: | 
+
+*   **tensor** – an n-dimensional `torch.Tensor`, where ![](http://latex.codecogs.com/gif.latex?n%20%5Cgeq%202)
+*   **gain** – optional scaling factor
+
+ |
+| --- | --- |
+
+Examples
+
+```py
+>>> w = torch.empty(3, 5)
+>>> nn.init.orthogonal_(w)
+
+```
+
+```py
+torch.nn.init.sparse_(tensor, sparsity, std=0.01)
+```
+
+Fills the 2D input `Tensor` as a sparse matrix, where the non-zero elements will be drawn from the normal distribution ![](http://latex.codecogs.com/gif.latex?%5Cmathcal%7BN%7D(0%2C%200.01)), as described in “Deep learning via Hessian-free optimization” - Martens, J. (2010).
+
+ 
+| Parameters: | 
+
+*   **tensor** – an n-dimensional `torch.Tensor`
+*   **sparsity** – The fraction of elements in each column to be set to zero
+*   **std** – the standard deviation of the normal distribution used to generate the non-zero values
+
+ |
+| --- | --- |
+
+Examples
+
+```py
+>>> w = torch.empty(3, 5)
+>>> nn.init.sparse_(w, sparsity=0.1)
+
+```
\ No newline at end of file
--- a/docs/1.0/nn_tutorial.md
+++ b/docs/1.0/nn_tutorial.md


-# What is &lt;cite&gt;torch.nn&lt;/cite&gt; _really_?
+# What is `torch.nn` _really_?

 by Jeremy Howard, [fast.ai](https://www.fast.ai). Thanks to Rachel Thomas and Francisco Ingham.

@@ -117,7 +117,7 @@ bias = torch.zeros(10, requires_grad=True)

 ```

-Thanks to PyTorch’s ability to calculate gradients automatically, we can use any standard Python function (or callable object) as a model! So let’s just write a plain matrix multiplication and broadcasted addition to create a simple linear model. We also need an activation function, so we’ll write &lt;cite&gt;log_softmax&lt;/cite&gt; and use it. Remember: although PyTorch provides lots of pre-written loss functions, activation functions, and so forth, you can easily write your own using plain python. PyTorch will even create fast GPU or vectorized CPU code for your function automatically.
+Thanks to PyTorch’s ability to calculate gradients automatically, we can use any standard Python function (or callable object) as a model! So let’s just write a plain matrix multiplication and broadcasted addition to create a simple linear model. We also need an activation function, so we’ll write `log_softmax` and use it. Remember: although PyTorch provides lots of pre-written loss functions, activation functions, and so forth, you can easily write your own using plain python. PyTorch will even create fast GPU or vectorized CPU code for your function automatically.

 ```py
 def log_softmax(x):
@@ -775,7 +775,7 @@ Out:

 `torch.nn` has another handy class we can use to simply our code: [Sequential](https://pytorch.org/docs/stable/nn.html#torch.nn.Sequential) . A `Sequential` object runs each of the modules contained within it, in a sequential manner. This is a simpler way of writing our neural network.

-To take advantage of this, we need to be able to easily define a **custom layer** from a given function. For instance, PyTorch doesn’t have a &lt;cite&gt;view&lt;/cite&gt; layer, and we need to create one for our network. `Lambda` will create a layer that we can then use when defining a network with `Sequential`.
+To take advantage of this, we need to be able to easily define a **custom layer** from a given function. For instance, PyTorch doesn’t have a `view` layer, and we need to create one for our network. `Lambda` will create a layer that we can then use when defining a network with `Sequential`.

 ```py
 class Lambda(nn.Module):
@@ -948,7 +948,7 @@ Out:

 ## Closing thoughts

-We now have a general data pipeline and training loop which you can use for training many types of models using Pytorch. To see how simple training a model can now be, take a look at the &lt;cite&gt;mnist_sample&lt;/cite&gt; sample notebook.
+We now have a general data pipeline and training loop which you can use for training many types of models using Pytorch. To see how simple training a model can now be, take a look at the `mnist_sample` sample notebook.

 Of course, there are many things you’ll want to add, such as data augmentation, hyperparameter tuning, monitoring training, transfer learning, and so forth. These features are available in the fastai library, which has been developed using the same design approach shown in this tutorial, providing a natural next step for practitioners looking to take their models further.

@@ -956,7 +956,7 @@ We promised at the start of this tutorial we’d explain through example each of

 > *   **torch.nn**
 >     *   `Module`: creates a callable which behaves like a function, but can also contain state(such as neural net layer weights). It knows what `Parameter` (s) it contains and can zero all their gradients, loop through them for weight updates, etc.
->     *   `Parameter`: a wrapper for a tensor that tells a `Module` that it has weights that need updating during backprop. Only tensors with the &lt;cite&gt;requires_grad&lt;/cite&gt; attribute set are updated
+>     *   `Parameter`: a wrapper for a tensor that tells a `Module` that it has weights that need updating during backprop. Only tensors with the `requires_grad` attribute set are updated
 >     *   `functional`: a module(usually imported into the `F` namespace by convention) which contains activation functions, loss functions, etc, as well as non-stateful versions of layers such as convolutional and linear layers.
 > *   `torch.optim`: Contains optimizers such as `SGD`, which update the weights of `Parameter` during the backward step
 > *   `Dataset`: An abstract interface of objects with a `__len__` and a `__getitem__`, including classes provided with Pytorch such as `TensorDataset`

--- a/docs/1.0/notes_broadcasting.md
+++ b/docs/1.0/notes_broadcasting.md
@@ -98,7 +98,7 @@ Note that the introduction of broadcasting can cause backwards incompatible chan

 ```

-would previously produce a Tensor with size: torch.Size([4,1]), but now produces a Tensor with size: torch.Size([4,4]). In order to help identify cases in your code where backwards incompatibilities introduced by broadcasting may exist, you may set &lt;cite&gt;torch.utils.backcompat.broadcast_warning.enabled&lt;/cite&gt; to &lt;cite&gt;True&lt;/cite&gt;, which will generate a python warning in such cases.
+would previously produce a Tensor with size: torch.Size([4,1]), but now produces a Tensor with size: torch.Size([4,4]). In order to help identify cases in your code where backwards incompatibilities introduced by broadcasting may exist, you may set `torch.utils.backcompat.broadcast_warning.enabled` to `True`, which will generate a python warning in such cases.

 For Example:


--- a/docs/1.0/notes_cuda.md
+++ b/docs/1.0/notes_cuda.md
@@ -53,7 +53,7 @@ By default, GPU operations are asynchronous. When you call a function that uses

 In general, the effect of asynchronous computation is invisible to the caller, because (1) each device executes operations in the order they are queued, and (2) PyTorch automatically performs necessary synchronization when copying data between CPU and GPU or between two GPUs. Hence, computation will proceed as if every operation was executed synchronously.

-You can force synchronous computation by setting environment variable &lt;cite&gt;CUDA_LAUNCH_BLOCKING=1&lt;/cite&gt;. This can be handy when an error occurs on the GPU. (With asynchronous execution, such an error isn’t reported until after the operation is actually executed, so the stack trace does not show where it was requested.)
+You can force synchronous computation by setting environment variable `CUDA_LAUNCH_BLOCKING=1`. This can be handy when an error occurs on the GPU. (With asynchronous execution, such an error isn’t reported until after the operation is actually executed, so the stack trace does not show where it was requested.)

 As an exception, several functions such as [`to()`](../tensors.html#torch.Tensor.to "torch.Tensor.to") and [`copy_()`](../tensors.html#torch.Tensor.copy_ "torch.Tensor.copy_") admit an explicit `non_blocking` argument, which lets the caller bypass synchronization when it is unnecessary. Another exception is CUDA streams, explained below.


--- a/docs/1.0/notes_faq.md
+++ b/docs/1.0/notes_faq.md
@@ -22,7 +22,7 @@ for i in range(10000):

 ```

-Here, `total_loss` is accumulating history across your training loop, since `loss` is a differentiable variable with autograd history. You can fix this by writing &lt;cite&gt;total_loss += float(loss)&lt;/cite&gt; instead.
+Here, `total_loss` is accumulating history across your training loop, since `loss` is a differentiable variable with autograd history. You can fix this by writing `total_loss += float(loss)` instead.

 Other instances of this problem: [1](https://discuss.pytorch.org/t/resolved-gpu-out-of-memory-error-with-batch-size-1/3719).


--- a/docs/1.0/numpy_extensions_tutorial.md
+++ b/docs/1.0/numpy_extensions_tutorial.md
@@ -4,7 +4,7 @@

 **Author**: [Adam Paszke](https://github.com/apaszke)

-**Updated by**: &lt;cite&gt;Adam Dziedzic&lt;/cite&gt; [[https://github.com/adam-dziedzic](https://github.com/adam-dziedzic](https://github.com/adam-dziedzic](https://github.com/adam-dziedzic))
+**Updated by**: `Adam Dziedzic` [[https://github.com/adam-dziedzic](https://github.com/adam-dziedzic](https://github.com/adam-dziedzic](https://github.com/adam-dziedzic))

 In this tutorial, we shall go through two tasks:


--- a/docs/1.0/onnx.md
+++ b/docs/1.0/onnx.md
@@ -87,7 +87,7 @@ onnx.helper.printable_graph(model.graph)

 ```

-To run the exported script with [caffe2](https://caffe2.ai/), you will need to install &lt;cite&gt;caffe2&lt;/cite&gt;: If you don’t have one already, Please [follow the install instructions](https://caffe2.ai/docs/getting-started.html).
+To run the exported script with [caffe2](https://caffe2.ai/), you will need to install `caffe2`: If you don’t have one already, Please [follow the install instructions](https://caffe2.ai/docs/getting-started.html).

 Once these are installed, you can use the backend for Caffe2:


--- a/docs/1.0/optim.md
+++ b/docs/1.0/optim.md
@@ -14,7 +14,7 @@ To construct an [`Optimizer`](#torch.optim.Optimizer "torch.optim.Optimizer") yo

 Note

-If you need to move a model to GPU via &lt;cite&gt;.cuda()&lt;/cite&gt;, please do so before constructing optimizers for it. Parameters of a model after &lt;cite&gt;.cuda()&lt;/cite&gt; will be different objects with those before the call.
+If you need to move a model to GPU via `.cuda()`, please do so before constructing optimizers for it. Parameters of a model after `.cuda()` will be different objects with those before the call.

 In general, you should make sure that optimized parameters live in consistent locations when optimizers are constructed and used.

@@ -108,7 +108,7 @@ Parameters need to be specified as collections that have a deterministic orderin
 add_param_group(param_group)
 ```

-Add a param group to the [`Optimizer`](#torch.optim.Optimizer "torch.optim.Optimizer") s &lt;cite&gt;param_groups&lt;/cite&gt;.
+Add a param group to the [`Optimizer`](#torch.optim.Optimizer "torch.optim.Optimizer") s `param_groups`.

 This can be useful when fine tuning a pre-trained network as frozen layers can be made trainable and added to the [`Optimizer`](#torch.optim.Optimizer "torch.optim.Optimizer") as training progresses.

@@ -637,12 +637,12 @@ Reduce learning rate when a metric has stopped improving. Models often benefit f
 | Parameters: | 

 *   **optimizer** ([_Optimizer_](#torch.optim.Optimizer "torch.optim.Optimizer")) – Wrapped optimizer.
-*   **mode** ([_str_](https://docs.python.org/3/library/stdtypes.html#str "(in Python v3.7)")) – One of &lt;cite&gt;min&lt;/cite&gt;, &lt;cite&gt;max&lt;/cite&gt;. In &lt;cite&gt;min&lt;/cite&gt; mode, lr will be reduced when the quantity monitored has stopped decreasing; in &lt;cite&gt;max&lt;/cite&gt; mode it will be reduced when the quantity monitored has stopped increasing. Default: ‘min’.
+*   **mode** ([_str_](https://docs.python.org/3/library/stdtypes.html#str "(in Python v3.7)")) – One of `min`, `max`. In `min` mode, lr will be reduced when the quantity monitored has stopped decreasing; in `max` mode it will be reduced when the quantity monitored has stopped increasing. Default: ‘min’.
 *   **factor** ([_float_](https://docs.python.org/3/library/functions.html#float "(in Python v3.7)")) – Factor by which the learning rate will be reduced. new_lr = lr * factor. Default: 0.1.
-*   **patience** ([_int_](https://docs.python.org/3/library/functions.html#int "(in Python v3.7)")) – Number of epochs with no improvement after which learning rate will be reduced. For example, if &lt;cite&gt;patience = 2&lt;/cite&gt;, then we will ignore the first 2 epochs with no improvement, and will only decrease the LR after the 3rd epoch if the loss still hasn’t improved then. Default: 10.
+*   **patience** ([_int_](https://docs.python.org/3/library/functions.html#int "(in Python v3.7)")) – Number of epochs with no improvement after which learning rate will be reduced. For example, if `patience = 2`, then we will ignore the first 2 epochs with no improvement, and will only decrease the LR after the 3rd epoch if the loss still hasn’t improved then. Default: 10.
 *   **verbose** ([_bool_](https://docs.python.org/3/library/functions.html#bool "(in Python v3.7)")) – If `True`, prints a message to stdout for each update. Default: `False`.
 *   **threshold** ([_float_](https://docs.python.org/3/library/functions.html#float "(in Python v3.7)")) – Threshold for measuring the new optimum, to only focus on significant changes. Default: 1e-4.
-*   **threshold_mode** ([_str_](https://docs.python.org/3/library/stdtypes.html#str "(in Python v3.7)")) – One of &lt;cite&gt;rel&lt;/cite&gt;, &lt;cite&gt;abs&lt;/cite&gt;. In &lt;cite&gt;rel&lt;/cite&gt; mode, dynamic_threshold = best * ( 1 + threshold ) in ‘max’ mode or best * ( 1 - threshold ) in &lt;cite&gt;min&lt;/cite&gt; mode. In &lt;cite&gt;abs&lt;/cite&gt; mode, dynamic_threshold = best + threshold in &lt;cite&gt;max&lt;/cite&gt; mode or best - threshold in &lt;cite&gt;min&lt;/cite&gt; mode. Default: ‘rel’.
+*   **threshold_mode** ([_str_](https://docs.python.org/3/library/stdtypes.html#str "(in Python v3.7)")) – One of `rel`, `abs`. In `rel` mode, dynamic_threshold = best * ( 1 + threshold ) in ‘max’ mode or best * ( 1 - threshold ) in `min` mode. In `abs` mode, dynamic_threshold = best + threshold in `max` mode or best - threshold in `min` mode. Default: ‘rel’.
 *   **cooldown** ([_int_](https://docs.python.org/3/library/functions.html#int "(in Python v3.7)")) – Number of epochs to wait before resuming normal operation after lr has been reduced. Default: 0.
 *   **min_lr** ([_float_](https://docs.python.org/3/library/functions.html#float "(in Python v3.7)") _or_ [_list_](https://docs.python.org/3/library/stdtypes.html#list "(in Python v3.7)")) – A scalar or a list of scalars. A lower bound on the learning rate of all param groups or each group respectively. Default: 0.
 *   **eps** ([_float_](https://docs.python.org/3/library/functions.html#float "(in Python v3.7)")) – Minimal decay applied to lr. If the difference between new and old lr is smaller than eps, the update is ignored. Default: 1e-8.

--- a/docs/1.0/reinforcement_q_learning.md
+++ b/docs/1.0/reinforcement_q_learning.md
@@ -22,7 +22,7 @@ Strictly speaking, we will present the state as the difference between the curre

 **Packages**

-First, let’s import needed packages. Firstly, we need [gym](https://gym.openai.com/docs) for the environment (Install using &lt;cite&gt;pip install gym&lt;/cite&gt;). We’ll also use the following from PyTorch:
+First, let’s import needed packages. Firstly, we need [gym](https://gym.openai.com/docs) for the environment (Install using `pip install gym`). We’ll also use the following from PyTorch:

 *   neural networks (`torch.nn`)
 *   optimization (`torch.optim`)
@@ -342,7 +342,7 @@ def optimize_model():

 Below, you can find the main training loop. At the beginning we reset the environment and initialize the `state` Tensor. Then, we sample an action, execute it, observe the next screen and the reward (always 1), and optimize our model once. When the episode ends (our model fails), we restart the loop.

-Below, &lt;cite&gt;num_episodes&lt;/cite&gt; is set small. You should download the notebook and run lot more epsiodes, such as 300+ for meaningful duration improvements.
+Below, `num_episodes` is set small. You should download the notebook and run lot more epsiodes, such as 300+ for meaningful duration improvements.

 ```py
 num_episodes = 50

--- a/docs/1.0/sparse.md
+++ b/docs/1.0/sparse.md
@@ -211,7 +211,7 @@ _nnz()
 torch.sparse.addmm(mat, mat1, mat2, beta=1, alpha=1)
 ```

-This function does exact same thing as [`torch.addmm()`](torch.html#torch.addmm "torch.addmm") in the forward, except that it supports backward for sparse matrix `mat1`. `mat1` need to have &lt;cite&gt;sparse_dim = 2&lt;/cite&gt;. Note that the gradients of `mat1` is a coalesced sparse tensor.
+This function does exact same thing as [`torch.addmm()`](torch.html#torch.addmm "torch.addmm") in the forward, except that it supports backward for sparse matrix `mat1`. `mat1` need to have `sparse_dim = 2`. Note that the gradients of `mat1` is a coalesced sparse tensor.

 | Parameters: | 

@@ -228,7 +228,7 @@ This function does exact same thing as [`torch.addmm()`](torch.html#torch.addmm
 torch.sparse.mm(mat1, mat2)
 ```

-Performs a matrix multiplication of the sparse matrix `mat1` and dense matrix `mat2`. Similar to [`torch.mm()`](torch.html#torch.mm "torch.mm"), If `mat1` is a ![](img/b2d82f601df5521e215e30962b942ad1.jpg) tensor, `mat2` is a ![](img/ec84c2d649caa2a7d4dc59b6b23b0278.jpg) tensor, out will be a ![](img/42cdcd96fd628658ac0e3e7070ba08d5.jpg) dense tensor. `mat1` need to have &lt;cite&gt;sparse_dim = 2&lt;/cite&gt;. This function also supports backward for both matrices. Note that the gradients of `mat1` is a coalesced sparse tensor.
+Performs a matrix multiplication of the sparse matrix `mat1` and dense matrix `mat2`. Similar to [`torch.mm()`](torch.html#torch.mm "torch.mm"), If `mat1` is a ![](img/b2d82f601df5521e215e30962b942ad1.jpg) tensor, `mat2` is a ![](img/ec84c2d649caa2a7d4dc59b6b23b0278.jpg) tensor, out will be a ![](img/42cdcd96fd628658ac0e3e7070ba08d5.jpg) dense tensor. `mat1` need to have `sparse_dim = 2`. This function also supports backward for both matrices. Note that the gradients of `mat1` is a coalesced sparse tensor.

 | Parameters: | 

@@ -271,9 +271,9 @@ tensor(indices=tensor([[0, 0, 0, 1, 1, 1],
 torch.sparse.sum(input, dim=None, dtype=None)
 ```

-Returns the sum of each row of SparseTensor `input` in the given dimensions `dim`. If :attr::&lt;cite&gt;dim&lt;/cite&gt; is a list of dimensions, reduce over all of them. When sum over all `sparse_dim`, this method returns a Tensor instead of SparseTensor.
+Returns the sum of each row of SparseTensor `input` in the given dimensions `dim`. If :attr::`dim` is a list of dimensions, reduce over all of them. When sum over all `sparse_dim`, this method returns a Tensor instead of SparseTensor.

-All summed `dim` are squeezed (see [`torch.squeeze()`](torch.html#torch.squeeze "torch.squeeze")), resulting an output tensor having :attr::&lt;cite&gt;dim&lt;/cite&gt; fewer dimensions than `input`.
+All summed `dim` are squeezed (see [`torch.squeeze()`](torch.html#torch.squeeze "torch.squeeze")), resulting an output tensor having :attr::`dim` fewer dimensions than `input`.

 During backward, only gradients at `nnz` locations of `input` will propagate back. Note that the gradients of `input` is coalesced.


--- a/docs/1.0/storage.md
+++ b/docs/1.0/storage.md
@@ -87,9 +87,9 @@ static from_buffer()
 static from_file(filename, shared=False, size=0) → Storage
 ```

-If &lt;cite&gt;shared&lt;/cite&gt; is &lt;cite&gt;True&lt;/cite&gt;, then memory is shared between all processes. All changes are written to the file. If &lt;cite&gt;shared&lt;/cite&gt; is &lt;cite&gt;False&lt;/cite&gt;, then the changes on the storage do not affect the file.
+If `shared` is `True`, then memory is shared between all processes. All changes are written to the file. If `shared` is `False`, then the changes on the storage do not affect the file.

-&lt;cite&gt;size&lt;/cite&gt; is the number of elements in the storage. If &lt;cite&gt;shared&lt;/cite&gt; is &lt;cite&gt;False&lt;/cite&gt;, then the file must contain at least &lt;cite&gt;size * sizeof(Type)&lt;/cite&gt; bytes (&lt;cite&gt;Type&lt;/cite&gt; is the type of storage). If &lt;cite&gt;shared&lt;/cite&gt; is &lt;cite&gt;True&lt;/cite&gt; the file will be created if needed.
+`size` is the number of elements in the storage. If `shared` is `False`, then the file must contain at least `size * sizeof(Type)` bytes (`Type` is the type of storage). If `shared` is `True` the file will be created if needed.

 | Parameters: | 

@@ -178,7 +178,7 @@ Returns a list containing the elements of this storage
 type(dtype=None, non_blocking=False, **kwargs)
 ```

-Returns the type if &lt;cite&gt;dtype&lt;/cite&gt; is not provided, else casts this object to the specified type.
+Returns the type if `dtype` is not provided, else casts this object to the specified type.

 If this is already of the correct type, no copy is performed and the original object is returned.


--- a/docs/1.0/tensors.md
+++ b/docs/1.0/tensors.md
@@ -126,7 +126,7 @@ Warning

 Warning

-When data is a tensor &lt;cite&gt;x&lt;/cite&gt;, [`new_tensor()`](#torch.Tensor.new_tensor "torch.Tensor.new_tensor") reads out ‘the data’ from whatever it is passed, and constructs a leaf variable. Therefore `tensor.new_tensor(x)` is equivalent to `x.clone().detach()` and `tensor.new_tensor(x, requires_grad=True)` is equivalent to `x.clone().detach().requires_grad_(True)`. The equivalents using `clone()` and `detach()` are recommended.
+When data is a tensor `x`, [`new_tensor()`](#torch.Tensor.new_tensor "torch.Tensor.new_tensor") reads out ‘the data’ from whatever it is passed, and constructs a leaf variable. Therefore `tensor.new_tensor(x)` is equivalent to `x.clone().detach()` and `tensor.new_tensor(x, requires_grad=True)` is equivalent to `x.clone().detach().requires_grad_(True)`. The equivalents using `clone()` and `detach()` are recommended.

 | Parameters: | 

@@ -571,7 +571,7 @@ Returns a copy of the `self` tensor. The copy has the same size and data type as

 Note

-Unlike &lt;cite&gt;copy_()&lt;/cite&gt;, this function is recorded in the computation graph. Gradients propagating to the cloned tensor will propagate to the original tensor.
+Unlike `copy_()`, this function is recorded in the computation graph. Gradients propagating to the cloned tensor will propagate to the original tensor.

 ```py
 contiguous() → Tensor
@@ -1129,8 +1129,8 @@ If `accumulate` is `True`, the elements in [`tensor`](torch.html#torch.tensor "t

 | Parameters: | 

-*   **indices** (_tuple of LongTensor_) – tensors used to index into &lt;cite&gt;self&lt;/cite&gt;.
-*   **value** ([_Tensor_](#torch.Tensor "torch.Tensor")) – tensor of same dtype as &lt;cite&gt;self&lt;/cite&gt;.
+*   **indices** (_tuple of LongTensor_) – tensors used to index into `self`.
+*   **value** ([_Tensor_](#torch.Tensor "torch.Tensor")) – tensor of same dtype as `self`.
 *   **accumulate** ([_bool_](https://docs.python.org/3/library/functions.html#bool "(in Python v3.7)")) – whether to accumulate into self

 |
@@ -1510,7 +1510,7 @@ See [`torch.nonzero()`](torch.html#torch.nonzero "torch.nonzero")
 norm(p='fro', dim=None, keepdim=False)
 ```

-See :func: &lt;cite&gt;torch.norm&lt;/cite&gt;
+See :func: `torch.norm`

 ```py
 normal_(mean=0, std=1, *, generator=None) → Tensor
@@ -1652,7 +1652,7 @@ See [`torch.qr()`](torch.html#torch.qr "torch.qr")
 random_(from=0, to=None, *, generator=None) → Tensor
 ```

-Fills `self` tensor with numbers sampled from the discrete uniform distribution over `[from, to - 1]`. If not specified, the values are usually only bounded by `self` tensor’s data type. However, for floating point types, if unspecified, range will be `[0, 2^mantissa]` to ensure that every value is representable. For example, &lt;cite&gt;torch.tensor(1, dtype=torch.double).random_()&lt;/cite&gt; will be uniform in `[0, 2^53]`.
+Fills `self` tensor with numbers sampled from the discrete uniform distribution over `[from, to - 1]`. If not specified, the values are usually only bounded by `self` tensor’s data type. However, for floating point types, if unspecified, range will be `[0, 2^mantissa]` to ensure that every value is representable. For example, `torch.tensor(1, dtype=torch.double).random_()` will be uniform in `[0, 2^53]`.

 ```py
 reciprocal() → Tensor
@@ -2420,7 +2420,7 @@ In-place version of [`trunc()`](#torch.Tensor.trunc "torch.Tensor.trunc")
 type(dtype=None, non_blocking=False, **kwargs) → str or Tensor
 ```

-Returns the type if &lt;cite&gt;dtype&lt;/cite&gt; is not provided, else casts this object to the specified type.
+Returns the type if `dtype` is not provided, else casts this object to the specified type.

 If this is already of the correct type, no copy is performed and the original object is returned.

@@ -2460,7 +2460,7 @@ Returns a tensor which contains all slices of size [`size`](#torch.Tensor.size "

 Step between two slices is given by `step`.

-If &lt;cite&gt;sizedim&lt;/cite&gt; is the size of dimension dim for `self`, the size of dimension [`dim`](#torch.Tensor.dim "torch.Tensor.dim") in the returned tensor will be &lt;cite&gt;(sizedim - size) / step + 1&lt;/cite&gt;.
+If `sizedim` is the size of dimension dim for `self`, the size of dimension [`dim`](#torch.Tensor.dim "torch.Tensor.dim") in the returned tensor will be `(sizedim - size) / step + 1`.

 An additional dimension of size size is appended in the returned tensor.


--- a/docs/1.0/torch.md
+++ b/docs/1.0/torch.md
--- a/docs/1.0/torch_script_custom_ops.md
+++ b/docs/1.0/torch_script_custom_ops.md
@@ -8,7 +8,7 @@ TorchScript supports a large subset of operations provided by the `torch` packag

 The following paragraphs give an example of writing a TorchScript custom op to call into [OpenCV](https://www.opencv.org), a computer vision library written in C++. We will discuss how to work with tensors in C++, how to efficiently convert them to third party tensor formats (in this case, OpenCV [``](#id1)Mat``s), how to register your operator with the TorchScript runtime and finally how to compile the operator and use it in Python and C++.

-This tutorial assumes you have the _preview release_ of PyTorch 1.0 installed via `pip` or &lt;cite&gt;conda&lt;/cite&gt;. See [https://pytorch.org/get-started/locally](https://pytorch.org/get-started/locally) for instructions on grabbing the latest release of PyTorch 1.0\. Alternatively, you can compile PyTorch from source. The documentation in [this file](https://github.com/pytorch/pytorch/blob/master/CONTRIBUTING.md) will assist you with this.
+This tutorial assumes you have the _preview release_ of PyTorch 1.0 installed via `pip` or `conda`. See [https://pytorch.org/get-started/locally](https://pytorch.org/get-started/locally) for instructions on grabbing the latest release of PyTorch 1.0\. Alternatively, you can compile PyTorch from source. The documentation in [this file](https://github.com/pytorch/pytorch/blob/master/CONTRIBUTING.md) will assist you with this.

 ## Implementing the Custom Operator in C++

@@ -678,7 +678,7 @@ The JIT compilation feature provided by the PyTorch C++ extension toolkit allows

 Note

-“JIT compilation” here has nothing to do with the JIT compilation taking place in the TorchScript compiler to optimize your program. It simply means that your custom operator C++ code will be compiled in a folder under your system’s &lt;cite&gt;/tmp&lt;/cite&gt; directory the first time you import it, as if you had compiled it yourself beforehand.
+“JIT compilation” here has nothing to do with the JIT compilation taking place in the TorchScript compiler to optimize your program. It simply means that your custom operator C++ code will be compiled in a folder under your system’s `/tmp` directory the first time you import it, as if you had compiled it yourself beforehand.

 This JIT compilation feature comes in two flavors. In the first, you still keep your operator implementation in a separate file (`op.cpp`), and then use `torch.utils.cpp_extension.load()` to compile your extension. Usually, this function will return the Python module exposing your C++ extension. However, since we are not compiling our custom operator into its own Python module, we only want to compile a plain shared library . Fortunately, `torch.utils.cpp_extension.load()` has an argument `is_python_module` which we can set to `False` to indicate that we are only interested in building a shared library and not a Python module. `torch.utils.cpp_extension.load()` will then compile and also load the shared library into the current process, just like `torch.ops.load_library` did before:


--- a/docs/1.0/torchvision_datasets.md
+++ b/docs/1.0/torchvision_datasets.md
@@ -332,7 +332,7 @@ class torchvision.datasets.CIFAR100(root, train=True, transform=None, target_tra

 [CIFAR100](https://www.cs.toronto.edu/~kriz/cifar.html) Dataset.

-This is a subclass of the &lt;cite&gt;CIFAR10&lt;/cite&gt; Dataset.
+This is a subclass of the `CIFAR10` Dataset.

 ```py
 class torchvision.datasets.STL10(root, split='train', transform=None, target_transform=None, download=False)¶
@@ -368,7 +368,7 @@ __getitem__(index)¶
 class torchvision.datasets.SVHN(root, split='train', transform=None, target_transform=None, download=False)¶
 ```

-[SVHN](http://ufldl.stanford.edu/housenumbers/) Dataset. Note: The SVHN dataset assigns the label &lt;cite&gt;10&lt;/cite&gt; to the digit &lt;cite&gt;0&lt;/cite&gt;. However, in this Dataset, we assign the label &lt;cite&gt;0&lt;/cite&gt; to the digit &lt;cite&gt;0&lt;/cite&gt; to be compatible with PyTorch loss functions which expect the class labels to be in the range &lt;cite&gt;[0, C-1]&lt;/cite&gt;
+[SVHN](http://ufldl.stanford.edu/housenumbers/) Dataset. Note: The SVHN dataset assigns the label `10` to the digit `0`. However, in this Dataset, we assign the label `0` to the digit `0` to be compatible with PyTorch loss functions which expect the class labels to be in the range `[0, C-1]`

 
 | Parameters: | 

--- a/docs/1.0/torchvision_models.md
+++ b/docs/1.0/torchvision_models.md
@@ -37,7 +37,7 @@ inception = models.inception_v3(pretrained=True)

 ```

-Instancing a pre-trained model will download its weights to a cache directory. This directory can be set using the &lt;cite&gt;TORCH_MODEL_ZOO&lt;/cite&gt; environment variable. See [`torch.utils.model_zoo.load_url()`](../model_zoo.html#torch.utils.model_zoo.load_url "torch.utils.model_zoo.load_url") for details.
+Instancing a pre-trained model will download its weights to a cache directory. This directory can be set using the `TORCH_MODEL_ZOO` environment variable. See [`torch.utils.model_zoo.load_url()`](../model_zoo.html#torch.utils.model_zoo.load_url "torch.utils.model_zoo.load_url") for details.

 Some models use modules which have different training and evaluation behavior, such as batch normalization. To switch between these modes, use `model.train()` or `model.eval()` as appropriate. See [`train()`](../nn.html#torch.nn.Module.train "torch.nn.Module.train") or [`eval()`](../nn.html#torch.nn.Module.eval "torch.nn.Module.eval") for details.


--- a/docs/1.0/torchvision_transforms.md
+++ b/docs/1.0/torchvision_transforms.md
@@ -504,7 +504,7 @@ Adjust hue of an image.

 The image hue is adjusted by converting the image to HSV and cyclically shifting the intensities in the hue channel (H). The image is then converted back to original image mode.

-&lt;cite&gt;hue_factor&lt;/cite&gt; is the amount of shift in H channel and must be in the interval &lt;cite&gt;[-0.5, 0.5]&lt;/cite&gt;.
+`hue_factor` is the amount of shift in H channel and must be in the interval `[-0.5, 0.5]`.

 See [https://en.wikipedia.org/wiki/Hue](https://en.wikipedia.org/wiki/Hue) for more details on Hue.


--- a/docs/1.0/type_info.md
+++ b/docs/1.0/type_info.md
@@ -10,7 +10,7 @@ The numerical properties of a [`torch.dtype`](tensor_attributes.html#torch.torch
 class torch.finfo
 ```

-A [`torch.finfo`](#torch.torch.finfo "torch.torch.finfo") is an object that represents the numerical properties of a floating point [`torch.dtype`](tensor_attributes.html#torch.torch.dtype "torch.torch.dtype"), (i.e. &lt;cite&gt;torch.float32&lt;/cite&gt;, &lt;cite&gt;torch.float64&lt;/cite&gt;, and &lt;cite&gt;torch.float16&lt;/cite&gt;). This is similar to [numpy.finfo](https://docs.scipy.org/doc/numpy/reference/generated/numpy.finfo.html).
+A [`torch.finfo`](#torch.torch.finfo "torch.torch.finfo") is an object that represents the numerical properties of a floating point [`torch.dtype`](tensor_attributes.html#torch.torch.dtype "torch.torch.dtype"), (i.e. `torch.float32`, `torch.float64`, and `torch.float16`). This is similar to [numpy.finfo](https://docs.scipy.org/doc/numpy/reference/generated/numpy.finfo.html).

 A [`torch.finfo`](#torch.torch.finfo "torch.torch.finfo") provides the following attributes:

@@ -31,7 +31,7 @@ The constructor of [`torch.finfo`](#torch.torch.finfo "torch.torch.finfo") can b
 class torch.iinfo
 ```

-A [`torch.iinfo`](#torch.torch.iinfo "torch.torch.iinfo") is an object that represents the numerical properties of a integer [`torch.dtype`](tensor_attributes.html#torch.torch.dtype "torch.torch.dtype") (i.e. &lt;cite&gt;torch.uint8&lt;/cite&gt;, &lt;cite&gt;torch.int8&lt;/cite&gt;, &lt;cite&gt;torch.int16&lt;/cite&gt;, &lt;cite&gt;torch.int32&lt;/cite&gt;, and &lt;cite&gt;torch.int64&lt;/cite&gt;). This is similar to [numpy.iinfo](https://docs.scipy.org/doc/numpy/reference/generated/numpy.iinfo.html).
+A [`torch.iinfo`](#torch.torch.iinfo "torch.torch.iinfo") is an object that represents the numerical properties of a integer [`torch.dtype`](tensor_attributes.html#torch.torch.dtype "torch.torch.dtype") (i.e. `torch.uint8`, `torch.int8`, `torch.int16`, `torch.int32`, and `torch.int64`). This is similar to [numpy.iinfo](https://docs.scipy.org/doc/numpy/reference/generated/numpy.iinfo.html).

 A [`torch.iinfo`](#torch.torch.iinfo "torch.torch.iinfo") provides the following attributes: