未验证 提交 6b32bee0 编写于 作者: 飞龙 提交者: GitHub

Merge pull request #322 from PEGASUS1993/master

Update 
# Frequently Asked Questions # 常见问题解答
## My model reports “cuda runtime error(2): out of memory” ## 我的模型报告“cuda runtime error(2): out of memory”
As the error message suggests, you have run out of memory on your GPU. Since we often deal with large amounts of data in PyTorch, small mistakes can rapidly cause your program to use up all of your GPU; fortunately, the fixes in these cases are often simple. Here are a few common things to check: 正如错误消息所示,您的GPU显存已耗尽。由于经常在PyTorch中处理大量数据,因此小错误会迅速导致程序耗尽所有GPU资源; 幸运的是,这些情况下的修复通常很简单。这里有一些常见点需要检查:
**Don’t accumulate history across your training loop.** By default, computations involving variables that require gradients will keep history. This means that you should avoid using such variables in computations which will live beyond your training loops, e.g., when tracking statistics. Instead, you should detach the variable or access its underlying data. **不要在训练循环中积累历史记录。** 默认情况下,涉及需要梯度计算的变量将保留历史记录。这意味着您应该避免在计算中使用这些变量,因为这些变量将超出您的训练循环,例如,在跟踪统计数据时。相反,您应该分离变量或访问其基础数据。
有时,当可微分变量发生时,它可能是不明显的。考虑以下训练循环(从[源代码](https://discuss.pytorch.org/t/high-memory-usage-while-training/162)中删除):
Sometimes, it can be non-obvious when differentiable variables can occur. Consider the following training loop (abridged from [source](https://discuss.pytorch.org/t/high-memory-usage-while-training/162)):
```py ```py
total_loss = 0 total_loss = 0
...@@ -20,15 +21,15 @@ for i in range(10000): ...@@ -20,15 +21,15 @@ for i in range(10000):
optimizer.step() optimizer.step()
total_loss += loss total_loss += loss
``` ```
Here, `total_loss` is accumulating history across your training loop, since `loss` is a differentiable variable with autograd history. You can fix this by writing `total_loss += float(loss)` instead. 在这里,total_loss在您的训练循环中累积历史记录,因为丢失是具有自动记录历史的可微分变量。 您可以通过编写total_loss + = float(loss)来解决此问题。
Other instances of this problem: [1](https://discuss.pytorch.org/t/resolved-gpu-out-of-memory-error-with-batch-size-1/3719). 此问题的其他实例:[1](https://discuss.pytorch.org/t/resolved-gpu-out-of-memory-error-with-batch-size-1/3719)
**Don’t hold onto tensors and variables you don’t need.** If you assign a Tensor or Variable to a local, Python will not deallocate until the local goes out of scope. You can free this reference by using `del x`. Similarly, if you assign a Tensor or Variable to a member variable of an object, it will not deallocate until the object goes out of scope. You will get the best memory usage if you don’t hold onto temporaries you don’t need. **不要抓住你不需要的张量或变量。** 如果将张量或变量分配给本地,则在本地超出范围之前,Python不会解除分配。您可以使用`del x`释放此引用。 同样,如果将张量或向量分配给对象的成员变量,则在对象超出范围之前不会释放。如果您没有保留不需要的临时工具,您将获得最佳的内存使用量。
The scopes of locals can be larger than you expect. For example: 本地规模大小可能比您预期的要大。 例如:
```py ```py
for i in range(5): for i in range(5):
...@@ -37,40 +38,41 @@ for i in range(5): ...@@ -37,40 +38,41 @@ for i in range(5):
output = h(result) output = h(result)
return output return output
``` ```
在这里,即使在执行h时,中间变量仍然存在,因为它的范围超出了循环的末尾。要提前释放它,你应该在完成它时使用del。
Here, `intermediate` remains live even while `h` is executing, because its scope extrudes past the end of the loop. To free it earlier, you should `del intermediate` when you are done with it. **不要在太大的序列上运行RNN。** 通过RNN反向传播所需的存储量与RNN的长度成线性关系; 因此,如果您尝试向RNN提供过长的序列,则会耗尽内存。
**Don’t run RNNs on sequences that are too large.** The amount of memory required to backpropagate through an RNN scales linearly with the length of the RNN; thus, you will run out of memory if you try to feed an RNN a sequence that is too long. 这种现象的技术术语是随着时间的推移而反向传播,并且有很多关于如何实现截断BPTT的参考,包括在单词语言模型示例中; 截断由重新打包功能处理,如本论坛帖子中所述。
The technical term for this phenomenon is [backpropagation through time](https://en.wikipedia.org/wiki/Backpropagation_through_time), and there are plenty of references for how to implement truncated BPTT, including in the [word language model](https://github.com/pytorch/examples/tree/master/word_language_model) example; truncation is handled by the `repackage` function as described in [this forum post](https://discuss.pytorch.org/t/help-clarifying-repackage-hidden-in-word-language-model/226). **不要使用太大的线性图层。** 线性层nn.Linear(m,n)使用O(nm)存储器:也就是说,权重的存储器需求与特征的数量成比例。 以这种方式很容易占用你的存储(并且记住,你将至少需要两倍存储权值的内存量,因为你还需要存储梯度。)
**Don’t use linear layers that are too large.** A linear layer `nn.Linear(m, n)` uses ![](img/a7adefa6eac5b357ac1c2fcc0bc36a52.jpg) memory: that is to say, the memory requirements of the weights scales quadratically with the number of features. It is very easy to [blow through your memory](https://github.com/pytorch/pytorch/issues/958) this way (and remember that you will need at least twice the size of the weights, since you also need to store the gradients.)
## My GPU memory isn’t freed properly ## My GPU memory isn’t freed properly
PyTorch uses a caching memory allocator to speed up memory allocations. As a result, the values shown in `nvidia-smi` usually don’t reflect the true memory usage. See [Memory management](cuda.html#cuda-memory-management) for more details about GPU memory management. PyTorch使用缓存内存分配器来加速内存分配。 因此,`nvidia-smi`中显示的值通常不会反映真实的内存使用情况。 有关GPU内存管理的更多详细信息,请参阅[内存管理](cuda.html#cuda-memory-management)
If your GPU memory isn’t freed even after Python quits, it is very likely that some Python subprocesses are still alive. You may find them via `ps -elf | grep python` and manually kill them with `kill -9 [pid]`. 如果在Python退出后你的GPU内存仍旧没有被释放,那么很可能是一些Python子进程仍处于活动状态。你可以通过`ps -elf |grep python`找到它们并用`kill -9 [pid]`手动结束这些进程。
## My data loader workers return identical random numbers ## My data loader workers return identical random numbers
You are likely using other libraries to generate random numbers in the dataset. For example, NumPy’s RNG is duplicated when worker subprocesses are started via `fork`. See [`torch.utils.data.DataLoader`](../data.html#torch.utils.data.DataLoader "torch.utils.data.DataLoader")’s documentation for how to properly set up random seeds in workers with its `worker_init_fn` option. 您可能正在数据集中使用其他库来生成随机数。 例如,当通过`fork`启动工作程序子进程时,NumPy的RNG会重复。有关如何使用`worker_init_fn`选项在工作程序中正确设置随机种子的文档,请参阅torch.utils.data.DataLoader文档。
## My recurrent network doesn’t work with data parallelism ## My recurrent network doesn’t work with data parallelism
在具有`DataParallel``data_parallel()`的模块中使用`pack sequence -> recurrent network -> unpack sequence`模式时有一个非常微妙的地方。每个设备上的`forward()`的输入只会是整个输入的一部分。由于默认情况下,解包操作`torch.nn.utils.rnn.pad_packed_sequence()`仅填充到其所见的最长输入,即该特定设备上的最长输入,所以在将结果收集在一起时会发生尺寸的不匹配。因此,您可以利用`pad_packed_sequence()``total_length`参数来确保`forward()`调用返回相同长度的序列。例如,你可以写:
There is a subtlety in using the `pack sequence -> recurrent network -> unpack sequence` pattern in a [`Module`](../nn.html#torch.nn.Module "torch.nn.Module") with [`DataParallel`](../nn.html#torch.nn.DataParallel "torch.nn.DataParallel") or [`data_parallel()`](../nn.html#torch.nn.parallel.data_parallel "torch.nn.parallel.data_parallel"). Input to each the `forward()` on each device will only be part of the entire input. Because the unpack operation [`torch.nn.utils.rnn.pad_packed_sequence()`](../nn.html#torch.nn.utils.rnn.pad_packed_sequence "torch.nn.utils.rnn.pad_packed_sequence") by default only pads up to the longest input it sees, i.e., the longest on that particular device, size mismatches will happen when results are gathered together. Therefore, you can instead take advantage of the `total_length` argument of [`pad_packed_sequence()`](../nn.html#torch.nn.utils.rnn.pad_packed_sequence "torch.nn.utils.rnn.pad_packed_sequence") to make sure that the `forward()` calls return sequences of same length. For example, you can write:
```py ```py
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
class MyModule(nn.Module): class MyModule(nn.Module):
# ... __init__, other methods, etc. # ... __init__, 以及其他访求
# padding_input is of shape [B x T x *] (batch_first mode) and contains # padding_input 的形状是[B x T x *](batch_first 模式),包含按长度排序的序列
# the sequences sorted by lengths # B 是批量大小
# B is the batch size # T 是最大序列长度
# T is max sequence length
def forward(self, padded_input, input_lengths): def forward(self, padded_input, input_lengths):
total_length = padded_input.size(1) # get the max sequence length total_length = padded_input.size(1) # get the max sequence length
packed_input = pack_padded_sequence(padded_input, input_lengths, packed_input = pack_padded_sequence(padded_input, input_lengths,
...@@ -83,7 +85,6 @@ class MyModule(nn.Module): ...@@ -83,7 +85,6 @@ class MyModule(nn.Module):
m = MyModule().cuda() m = MyModule().cuda()
dp_m = nn.DataParallel(m) dp_m = nn.DataParallel(m)
``` ```
Additionally, extra care needs to be taken when batch dimension is dim `1` (i.e., `batch_first=False`) with data parallelism. In this case, the first argument of pack_padded_sequence `padding_input` will be of shape `[T x B x *]` and should be scattered along dim `1`, but the second argument `input_lengths` will be of shape `[B]` and should be scattered along dim `0`. Extra code to manipulate the tensor shapes will be needed.
另外,在批量的维度为dim 1(即 batch_first = False )时需要注意数据的并行性。在这种情况下,pack_padded_sequence 函数的的第一个参数 padding_input 维度将是 [T x B x *] ,并且应该沿dim 1 (第1轴)分散,但第二个参数 input_lengths 的维度为 [B],应该沿dim 0 (第0轴)分散。需要额外的代码来操纵张量的维度。
# Windows FAQ
## Building from source # Windows FAQ
### Include optional components ## 从源码中构建
There are two supported components for Windows PyTorch: MKL and MAGMA. Here are the steps to build with them. ### 包含可选组件
Windows PyTorch有两个受支持的组件:MKL和MAGMA。 以下是使用它们构建的步骤。
```py ```py
REM Make sure you have 7z and curl installed. REM Make sure you have 7z and curl installed.
...@@ -28,30 +29,29 @@ set "MAGMA_HOME=%cd%\\magma" ...@@ -28,30 +29,29 @@ set "MAGMA_HOME=%cd%\\magma"
``` ```
### Speeding CUDA build for Windows ### 为Windows构建加速CUDA
Visual Studio doesn’t support parallel custom task currently. As an alternative, we can use `Ninja` to parallelize CUDA build tasks. It can be used by typing only a few lines of code. Visual Studio当前不支持并行自定义任务。 作为替代方案,我们可以使用Ninja来并行化CUDA构建任务。 只需键入几行代码即可使用它。
```py ```
REM Let's install ninja first. REM Let's install ninja first.
pip install ninja pip install ninja
REM Set it as the cmake generator REM Set it as the cmake generator
set CMAKE_GENERATOR=Ninja set CMAKE_GENERATOR=Ninja
```
```
### One key install script ### 脚本一键安装
You can take a look at [this set of scripts](https://github.com/peterjc123/pytorch-scripts). It will lead the way for you. 你可以参考[这些脚本](https://github.com/peterjc123/pytorch-scripts)。它会给你指导方向。
## Extension ## 扩展
### CFFI Extension ### CFEI扩展
The support for CFFI Extension is very experimental. There’re generally two steps to enable it under Windows. [CFFI](https://cffi.readthedocs.io/en/latest/)扩展的支持是非常试验性的。在Windows下启用它通常有两个步骤。
First, specify additional `libraries` in `Extension` object to make it build on Windows. 首先,在Extension对象中指定其他库以使其在Windows上构建。
```py ```py
ffi = create_extension( ffi = create_extension(
...@@ -65,11 +65,10 @@ ffi = create_extension( ...@@ -65,11 +65,10 @@ ffi = create_extension(
libraries=['ATen', '_C'] # Append cuda libaries when necessary, like cudart libraries=['ATen', '_C'] # Append cuda libaries when necessary, like cudart
) )
``` ```
其次,这是“由`extern THCState *state`状态引起的未解决的外部符号状态”的工作场所;
Second, here is a workground for “unresolved external symbol state caused by `extern THCState *state;`
Change the source code from C to C++. An example is listed below. 将源代码从C更改为C ++。 下面列出了一个例子。
```py ```py
#include <THC/THC.h> #include <THC/THC.h>
...@@ -94,15 +93,15 @@ extern "C" int my_lib_add_backward_cuda(THCudaTensor *grad_output, THCudaTensor ...@@ -94,15 +93,15 @@ extern "C" int my_lib_add_backward_cuda(THCudaTensor *grad_output, THCudaTensor
return 1; return 1;
} }
``` ```
### Cpp Extension ### C++扩展
This type of extension has better support compared with the previous one. However, it still needs some manual configuration. First, you should open the **x86_x64 Cross Tools Command Prompt for VS 2017**. And then, you can open the Git-Bash in it. It is usually located in `C:\Program Files\Git\git-bash.exe`. Finally, you can start your compiling process. 与前一种类型相比,这种类型的扩展具有更好的支持。不过它仍然需要一些手动配置。首先,打开VS 2017的x86_x64交叉工具命令提示符。然后,在其中打开Git-Bash。它通常位于C:\Program Files\Git\git-bash.exe中。最后,您可以开始编译过程。
## Installation ## 安装
### Package not found in win-32 channel. ### 在Win32 找不到安装包
```py ```py
Solving environment: failed Solving environment: failed
...@@ -126,56 +125,43 @@ Current channels: ...@@ -126,56 +125,43 @@ Current channels:
- https://repo.continuum.io/pkgs/msys2/noarch - https://repo.continuum.io/pkgs/msys2/noarch
``` ```
Pytorch不能在32位系统中工作运行。请安装使用64位的Windows和Python。
PyTorch doesn’t work on 32-bit system. Please use Windows and Python 64-bit version. ### 导入错误
### Why are there no Python 2 packages for Windows?
Because it’s not stable enough. There’re some issues that need to be solved before we officially release it. You can build it by yourself.
### Import error ```
```py
from torch._C import * from torch._C import *
ImportError: DLL load failed: The specified module could not be found. ImportError: DLL load failed: The specified module could not be found.
``` ```
The problem is caused by the missing of the essential files. Actually, we include almost all the essential files that PyTorch need for the conda package except VC2017 redistributable and some mkl libraries. You can resolve this by typing the following command. 问题是由基本文件丢失导致的。实际上,除了VC2017可再发行组件和一些mkl库之外,我们几乎包含了PyTorch对conda包所需的所有基本文件。您可以通过键入以下命令来解决此问题。
```py ```
conda install -c peterjc123 vc vs2017_runtime conda install -c peterjc123 vc vs2017_runtime
conda install mkl_fft intel_openmp numpy mkl conda install mkl_fft intel_openmp numpy mkl
``` ```
As for the wheels package, since we didn’t pack some libaries and VS2017 redistributable files in, please make sure you install them manually. The [VS 2017 redistributable installer](https://aka.ms/vs/15/release/VC_redist.x64.exe) can be downloaded. And you should also pay attention to your installation of Numpy. Make sure it uses MKL instead of OpenBLAS. You may type in the following command. 至于wheel包(轮子),由于我们没有包含一些库和VS2017可再发行文件,请手动安装它们。可以下载[VS 2017可再发行安装程序]((https://aka.ms/vs/15/release/VC_redist.x64.exe))。你还应该注意你的Numpy的安装。 确保它使用MKL而不是OpenBLAS版本的。您可以输入以下命令。
```py ```
pip install numpy mkl intel-openmp mkl_fft pip install numpy mkl intel-openmp mkl_fft
```
``` 另外一种可能是你安装了GPU版本的Pytorch但是电脑中并没有NVIDIA的显卡。碰到这种情况,就把GPU版本的Pytorch换成CPU版本的就好了。
Another possible cause may be you are using GPU version without NVIDIA graphics cards. Please replace your GPU package with the CPU one.
```py ```
from torch._C import * from torch._C import *
ImportError: DLL load failed: The operating system cannot run %1. ImportError: DLL load failed: The operating system cannot run %1.
``` ```
This is actually an upstream issue of Anaconda. When you initialize your environment with conda-forge channel, this issue will emerge. You may fix the intel-openmp libraries through this command. 这实际上是Anaconda的上游问题。使用conda-forge通道初始化环境时,将出现此问题。您可以通过此命令修复intel-openmp库。
```py
conda install -c defaults intel-openmp -f
```
## Usage (multiprocessing) ## 使用(多处理)
### Multiprocessing error without if-clause protection ### 无if语句保护的多进程处理错误
```py ```py
RuntimeError: RuntimeError:
...@@ -193,11 +179,11 @@ RuntimeError: ...@@ -193,11 +179,11 @@ RuntimeError:
The "freeze_support()" line can be omitted if the program The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable. is not going to be frozen to produce an executable.
``` ```
The implementation of `multiprocessing` is different on Windows, which uses `spawn` instead of `fork`. So we have to wrap the code with an if-clause to protect the code from executing multiple times. Refactor your code into the following structure. 在Windows上实现`多进程处理`是不同的,它使用的是spawn而不是fork。 因此,我们必须使用if子句包装代码,以防止代码执行多次。将您的代码重构为以下结构。
```py ```
import torch import torch
def main() def main()
...@@ -206,41 +192,37 @@ def main() ...@@ -206,41 +192,37 @@ def main()
if __name__ == '__main__': if __name__ == '__main__':
main() main()
``` ```
### Multiprocessing error “Broken pipe” ### 多进程处理错误“坏道”
```py ```
ForkingPickler(file, protocol).dump(obj) ForkingPickler(file, protocol).dump(obj)
BrokenPipeError: [Errno 32] Broken pipe BrokenPipeError: [Errno 32] Broken pipe
``` ```
This issue happens when the child process ends before the parent process finishes sending data. There may be something wrong with your code. You can debug your code by reducing the `num_worker` of [`DataLoader`](../data.html#torch.utils.data.DataLoader "torch.utils.data.DataLoader") to zero and see if the issue persists. 当在父进程完成发送数据之前子进程结束时,会发生此问题。您的代码可能有问题。您可以通过将DataLoader的num_worker减少为零来调试代码,并查看问题是否仍然存在。
### Multiprocessing error “driver shut down” ### 多进程处理错误“驱动程序关闭”
```py ```
Couldn’t open shared file mapping: <torch_14808_1591070686>, error code: <1455> at torch\lib\TH\THAllocator.c:154 Couldn’t open shared file mapping: <torch_14808_1591070686>, error code: <1455> at torch\lib\TH\THAllocator.c:154
[windows] driver shut down [windows] driver shut down
``` ```
Please update your graphics driver. If this persists, this may be that your graphics card is too old or the calculation is too heavy for your card. Please update the TDR settings according to this [post](https://www.pugetsystems.com/labs/hpc/Working-around-TDR-in-Windows-for-a-better-GPU-computing-experience-777/). 请更新您的显卡驱动程序。如果这种情况持续存在,则可能是您的显卡太旧或所需要的计算能力对您的显卡负担太重。请根据[这篇文章]((https://www.pugetsystems.com/labs/hpc/Working-around-TDR-in-Windows-for-a-better-GPU-computing-experience-777/).)更新TDR设置。
### CUDA IPC operations ### CUDA IPC操作
```py ```
THCudaCheck FAIL file=torch\csrc\generic\StorageSharing.cpp line=252 error=63 : OS call failed or operation not supported on this OS THCudaCheck FAIL file=torch\csrc\generic\StorageSharing.cpp line=252 error=63 : OS call failed or operation not supported on this OS
``` ```
They are not supported on Windows. Something like doing multiprocessing on CUDA tensors cannot succeed, there are two alternatives for this. Windows不支持它们。在CUDA张量上进行多处理这样的事情无法成功,有两种选择:
1\. Don’t use `multiprocessing`. Set the `num_worker` of [`DataLoader`](../data.html#torch.utils.data.DataLoader "torch.utils.data.DataLoader") to zero. 1\.不要使用多处理。将Data Loader的num_worker设置为零。
2\. Share CPU tensors instead. Make sure your custom `DataSet` returns CPU tensors. 2\.采用共享CPU张量方法。确保您的自定义`DataSet`返回CPU张量。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册