## My model reports “cuda runtime error(2): out of memory”
## 我的模型报告“cuda runtime error(2): out of memory”
As the error message suggests, you have run out of memory on your GPU. Since we often deal with large amounts of data in PyTorch, small mistakes can rapidly cause your program to use up all of your GPU; fortunately, the fixes in these cases are often simple. Here are a few common things to check:
**Don’t accumulate history across your training loop.** By default, computations involving variables that require gradients will keep history. This means that you should avoid using such variables in computations which will live beyond your training loops, e.g., when tracking statistics. Instead, you should detach the variable or access its underlying data.
Sometimes, it can be non-obvious when differentiable variables can occur. Consider the following training loop (abridged from [source](https://discuss.pytorch.org/t/high-memory-usage-while-training/162)):
```py
```py
total_loss=0
total_loss=0
...
@@ -20,15 +21,15 @@ for i in range(10000):
...
@@ -20,15 +21,15 @@ for i in range(10000):
optimizer.step()
optimizer.step()
total_loss+=loss
total_loss+=loss
```
```
Here, `total_loss` is accumulating history across your training loop, since `loss` is a differentiable variable with autograd history. You can fix this by writing `total_loss += float(loss)` instead.
**Don’t hold onto tensors and variables you don’t need.** If you assign a Tensor or Variable to a local, Python will not deallocate until the local goes out of scope. You can free this reference by using `del x`. Similarly, if you assign a Tensor or Variable to a member variable of an object, it will not deallocate until the object goes out of scope. You will get the best memory usage if you don’t hold onto temporaries you don’t need.
Here, `intermediate` remains live even while `h` is executing, because its scope extrudes past the end of the loop. To free it earlier, you should `del intermediate` when you are done with it.
**Don’t run RNNs on sequences that are too large.** The amount of memory required to backpropagate through an RNN scales linearly with the length of the RNN; thus, you will run out of memory if you try to feed an RNN a sequence that is too long.
The technical term for this phenomenon is [backpropagation through time](https://en.wikipedia.org/wiki/Backpropagation_through_time), and there are plenty of references for how to implement truncated BPTT, including in the [word language model](https://github.com/pytorch/examples/tree/master/word_language_model) example; truncation is handled by the `repackage` function as described in [this forum post](https://discuss.pytorch.org/t/help-clarifying-repackage-hidden-in-word-language-model/226).
**Don’t use linear layers that are too large.** A linear layer `nn.Linear(m, n)` uses ![](img/a7adefa6eac5b357ac1c2fcc0bc36a52.jpg) memory: that is to say, the memory requirements of the weights scales quadratically with the number of features. It is very easy to [blow through your memory](https://github.com/pytorch/pytorch/issues/958) this way (and remember that you will need at least twice the size of the weights, since you also need to store the gradients.)
## My GPU memory isn’t freed properly
## My GPU memory isn’t freed properly
PyTorch uses a caching memory allocator to speed up memory allocations. As a result, the values shown in `nvidia-smi` usually don’t reflect the true memory usage. See [Memory management](cuda.html#cuda-memory-management) for more details about GPU memory management.
If your GPU memory isn’t freed even after Python quits, it is very likely that some Python subprocesses are still alive. You may find them via `ps -elf | grep python` and manually kill them with `kill -9 [pid]`.
## My data loader workers return identical random numbers
## My data loader workers return identical random numbers
You are likely using other libraries to generate random numbers in the dataset. For example, NumPy’s RNG is duplicated when worker subprocesses are started via `fork`. See [`torch.utils.data.DataLoader`](../data.html#torch.utils.data.DataLoader"torch.utils.data.DataLoader")’s documentation for how to properly set up random seeds in workers with its `worker_init_fn` option.
There is a subtlety in using the `pack sequence -> recurrent network -> unpack sequence` pattern in a [`Module`](../nn.html#torch.nn.Module"torch.nn.Module") with [`DataParallel`](../nn.html#torch.nn.DataParallel"torch.nn.DataParallel") or [`data_parallel()`](../nn.html#torch.nn.parallel.data_parallel"torch.nn.parallel.data_parallel"). Input to each the `forward()` on each device will only be part of the entire input. Because the unpack operation [`torch.nn.utils.rnn.pad_packed_sequence()`](../nn.html#torch.nn.utils.rnn.pad_packed_sequence"torch.nn.utils.rnn.pad_packed_sequence") by default only pads up to the longest input it sees, i.e., the longest on that particular device, size mismatches will happen when results are gathered together. Therefore, you can instead take advantage of the `total_length` argument of [`pad_packed_sequence()`](../nn.html#torch.nn.utils.rnn.pad_packed_sequence"torch.nn.utils.rnn.pad_packed_sequence") to make sure that the `forward()` calls return sequences of same length. For example, you can write:
Additionally, extra care needs to be taken when batch dimension is dim `1` (i.e., `batch_first=False`) with data parallelism. In this case, the first argument of pack_padded_sequence `padding_input` will be of shape `[T x B x *]` and should be scattered along dim `1`, but the second argument `input_lengths` will be of shape `[B]` and should be scattered along dim `0`. Extra code to manipulate the tensor shapes will be needed.
另外,在批量的维度为dim 1(即 batch_first = False )时需要注意数据的并行性。在这种情况下,pack_padded_sequence 函数的的第一个参数 padding_input 维度将是 [T x B x *] ,并且应该沿dim 1 (第1轴)分散,但第二个参数 input_lengths 的维度为 [B],应该沿dim 0 (第0轴)分散。需要额外的代码来操纵张量的维度。
There are two supported components for Windows PyTorch: MKL and MAGMA. Here are the steps to build with them.
### 包含可选组件
Windows PyTorch有两个受支持的组件:MKL和MAGMA。 以下是使用它们构建的步骤。
```py
```py
REMMakesureyouhave7zandcurlinstalled.
REMMakesureyouhave7zandcurlinstalled.
...
@@ -28,30 +29,29 @@ set "MAGMA_HOME=%cd%\\magma"
...
@@ -28,30 +29,29 @@ set "MAGMA_HOME=%cd%\\magma"
```
```
### Speeding CUDA build for Windows
### 为Windows构建加速CUDA
Visual Studio doesn’t support parallel custom task currently. As an alternative, we can use `Ninja` to parallelize CUDA build tasks. It can be used by typing only a few lines of code.
Second, here is a workground for “unresolved external symbol state caused by `extern THCState *state;`”
Change the source code from C to C++. An example is listed below.
将源代码从C更改为C ++。 下面列出了一个例子。
```py
```py
#include <THC/THC.h>
#include <THC/THC.h>
...
@@ -94,15 +93,15 @@ extern "C" int my_lib_add_backward_cuda(THCudaTensor *grad_output, THCudaTensor
...
@@ -94,15 +93,15 @@ extern "C" int my_lib_add_backward_cuda(THCudaTensor *grad_output, THCudaTensor
return1;
return1;
}
}
```
```
### Cpp Extension
### C++扩展
This type of extension has better support compared with the previous one. However, it still needs some manual configuration. First, you should open the **x86_x64 Cross Tools Command Prompt for VS 2017**. And then, you can open the Git-Bash in it. It is usually located in `C:\Program Files\Git\git-bash.exe`. Finally, you can start your compiling process.
PyTorch doesn’t work on 32-bit system. Please use Windows and Python 64-bit version.
### 导入错误
### Why are there no Python 2 packages for Windows?
Because it’s not stable enough. There’re some issues that need to be solved before we officially release it. You can build it by yourself.
### Import error
```
```py
from torch._C import *
from torch._C import *
ImportError: DLL load failed: The specified module could not be found.
ImportError: DLL load failed: The specified module could not be found.
```
```
The problem is caused by the missing of the essential files. Actually, we include almost all the essential files that PyTorch need for the conda package except VC2017 redistributable and some mkl libraries. You can resolve this by typing the following command.
As for the wheels package, since we didn’t pack some libaries and VS2017 redistributable files in, please make sure you install them manually. The [VS 2017 redistributable installer](https://aka.ms/vs/15/release/VC_redist.x64.exe) can be downloaded. And you should also pay attention to your installation of Numpy. Make sure it uses MKL instead of OpenBLAS. You may type in the following command.
Another possible cause may be you are using GPU version without NVIDIA graphics cards. Please replace your GPU package with the CPU one.
```py
```
from torch._C import *
from torch._C import *
ImportError: DLL load failed: The operating system cannot run %1.
ImportError: DLL load failed: The operating system cannot run %1.
```
```
This is actually an upstream issue of Anaconda. When you initialize your environment with conda-forge channel, this issue will emerge. You may fix the intel-openmp libraries through this command.
### Multiprocessing error without if-clause protection
### 无if语句保护的多进程处理错误
```py
```py
RuntimeError:
RuntimeError:
...
@@ -193,11 +179,11 @@ RuntimeError:
...
@@ -193,11 +179,11 @@ RuntimeError:
The"freeze_support()"linecanbeomittediftheprogram
The"freeze_support()"linecanbeomittediftheprogram
isnotgoingtobefrozentoproduceanexecutable.
isnotgoingtobefrozentoproduceanexecutable.
```
```
The implementation of `multiprocessing` is different on Windows, which uses `spawn` instead of `fork`. So we have to wrap the code with an if-clause to protect the code from executing multiple times. Refactor your code into the following structure.
This issue happens when the child process ends before the parent process finishes sending data. There may be something wrong with your code. You can debug your code by reducing the `num_worker` of [`DataLoader`](../data.html#torch.utils.data.DataLoader"torch.utils.data.DataLoader") to zero and see if the issue persists.
Couldn’t open shared file mapping: <torch_14808_1591070686>, error code: <1455> at torch\lib\TH\THAllocator.c:154
Couldn’t open shared file mapping: <torch_14808_1591070686>, error code: <1455> at torch\lib\TH\THAllocator.c:154
[windows] driver shut down
[windows] driver shut down
```
```
Please update your graphics driver. If this persists, this may be that your graphics card is too old or the calculation is too heavy for your card. Please update the TDR settings according to this [post](https://www.pugetsystems.com/labs/hpc/Working-around-TDR-in-Windows-for-a-better-GPU-computing-experience-777/).
THCudaCheck FAIL file=torch\csrc\generic\StorageSharing.cpp line=252 error=63 : OS call failed or operation not supported on this OS
THCudaCheck FAIL file=torch\csrc\generic\StorageSharing.cpp line=252 error=63 : OS call failed or operation not supported on this OS
```
```
They are not supported on Windows. Something like doing multiprocessing on CUDA tensors cannot succeed, there are two alternatives for this.
Windows不支持它们。在CUDA张量上进行多处理这样的事情无法成功,有两种选择:
1\. Don’t use `multiprocessing`. Set the `num_worker` of [`DataLoader`](../data.html#torch.utils.data.DataLoader"torch.utils.data.DataLoader") to zero.
1\.不要使用多处理。将Data Loader的num_worker设置为零。
2\. Share CPU tensors instead. Make sure your custom `DataSet` returns CPU tensors.