2019-02-22 13:48:41

5df01738 · wizardforcel · 48de6ef7 · 5df01738 · 5df01738 · 5df01738
33 changed file
--- a/docs/0.4/1.md
+++ b/docs/0.4/1.md
@@ -21,7 +21,7 @@

 如果一个输入变量定义`requires_grad`,那么他的输出也可以使用`requires_grad`；相反，只有当所有的输入变量都不定义`requires_grad`梯度，才不会输出梯度。如果其中所有的变量都不需要计算梯度，在子图中从不执行向后计算。

-```
+```py
 >>> x = Variable(torch.randn(5, 5))
 >>> y = Variable(torch.randn(5, 5))
 >>> z = Variable(torch.randn(5, 5), requires_grad=True)
@@ -37,7 +37,7 @@ True

 例如，如果您想调整预训练的`CNN`，只要切换冻结模型中的`requires_grad`标志即可，直到计算到最后一层才会保存中间缓冲区，仿射变换和网络输出都需要使用梯度的权值。

-```
+```py
 model = torchvision.models.resnet18(pretrained=True)
 for param in model.parameters():
    param.requires_grad = False

--- a/docs/0.4/10.md
+++ b/docs/0.4/10.md
--- a/docs/0.4/11.md
+++ b/docs/0.4/11.md
@@ -21,7 +21,7 @@ Torch定义了七种CPU张量类型和八种GPU张量类型：

 张量可以从Python的`list`或序列构成：

-```
+```py
 >>> torch.FloatTensor([[1, 2, 3], [4, 5, 6]])
 1 2 3
 4 5 6
@@ -30,7 +30,7 @@ Torch定义了七种CPU张量类型和八种GPU张量类型：

 可以通过指定它的大小来构建一个空的张量：

-```
+```py
 >>> torch.IntTensor(2, 4).zero_()
 0 0 0 0
 0 0 0 0
@@ -39,7 +39,7 @@ Torch定义了七种CPU张量类型和八种GPU张量类型：

 可以使用Python的索引和切片符号来访问和修改张量的内容：

-```
+```py
 >>> x = torch.FloatTensor([[1, 2, 3], [4, 5, 6]])
 >>> print(x[1][2])
 6.0
@@ -54,7 +54,7 @@ Torch定义了七种CPU张量类型和八种GPU张量类型：

 > 注意： 改变张量的方法可以用一个下划线后缀来标示。比如，`torch.FloatTensor.abs_()`会在原地计算绝对值并返回修改的张量，而`tensor.FloatTensor.abs()`将会在新张量中计算结果。

-```
+```py
 class torch.Tensor
 class torch.Tensor(*sizes)
 class torch.Tensor(size)
@@ -319,7 +319,7 @@ class torch.Tensor(storage)

 返回单个元素的字节大小。 例：

-```
+```py
 >>> torch.FloatTensor().element_size()
 4
 >>> torch.ByteTensor().element_size()
@@ -356,7 +356,7 @@ class torch.Tensor(storage)

 例：

-```
+```py
 >>> x = torch.Tensor([[1], [2], [3]])
 >>> x.size()
 torch.Size([3, 1])
@@ -372,7 +372,7 @@ torch.Size([3, 1])

 将tensor扩展为参数tensor的大小。 该操作等效与：

-```
+```py
 self.expand(tensor.size())
 ```

@@ -480,7 +480,7 @@ self.expand(tensor.size())

 例：

-```
+```py
 >>> x = torch.Tensor([[1, 1, 1], [1, 1, 1], [1, 1, 1]])
 >>> t = torch.Tensor([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
 >>> index = torch.LongTensor([0, 2, 1])
@@ -504,7 +504,7 @@ self.expand(tensor.size())

 例：

-```
+```py
 >>> x = torch.Tensor(3， 3)
 >>> t = torch.Tensor([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
 >>> index = torch.LongTensor([0, 2, 1])
@@ -528,7 +528,7 @@ self.expand(tensor.size())

 例：

-```
+```py
 >>> x = torch.Tensor([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
 >>> index = torch.LongTensor([0, 2])
 >>> x.index_fill_(0, index, -1)
@@ -623,7 +623,7 @@ self.expand(tensor.size())

 将`callable`作用于本tensor和参数tensor中的每一个元素，并将结果存放在本tensor中。`callable`应该有下列标志：

-```
+```py
 def callable(a, b) -> number
 ```

@@ -703,7 +703,7 @@ def callable(a, b) -> number

 例:

-```
+```py
 >>> x = torch.Tensor([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
 >>> x.narrow(0, 0, 2)
 1  2  3
@@ -782,7 +782,7 @@ def callable(a, b) -> number

 例：

-```
+```py
 >>> x = torch.randn(2, 3, 5)
 >>> x.size()
 torch.Size([2, 3, 5])
@@ -864,7 +864,7 @@ torch.Size([5, 2, 3])

 例：

-```
+```py
 >>> x = torch.Tensor([1, 2, 3])
 >>> x.repeat(4, 2)
 1  2  3  1  2  3
@@ -886,7 +886,7 @@ torch.Size([4, 2, 3])

 例：

-```
+```py
 >>> x = torch.Tensor([[1, 2], [3, 4], [5, 6]])
 >>> x.resize_(2, 2)
 >>> x
@@ -899,7 +899,7 @@ torch.Size([4, 2, 3])

 将当前张量调整为与指定张量相同的大小。这相当于：

-```
+```py
 self.resize_(tensor.size())
 ```

@@ -934,7 +934,7 @@ self.resize_(tensor.size())

 例子：

-```
+```py
 >>> x = torch.rand(2, 5)
 >>> x

@@ -1025,7 +1025,7 @@ self.resize_(tensor.size())

 例：

-```
+```py
 >>> torch.Tensor(3, 4, 5).size()
 torch.Size([3, 4, 5])
 ```
@@ -1066,7 +1066,7 @@ torch.Size([3, 4, 5])

 以储存元素的个数的形式返回tensor在地城内存中的偏移量。 例：

-```
+```py
 >>> x = torch.Tensor([1, 2, 3, 4, 5])
 >>> x.storage_offset()
 0
@@ -1185,7 +1185,7 @@ torch.Size([3, 4, 5])

 将此张量转换为给定类型的张量。 如果张量已经是正确的类型，则不会执行操作。等效于：

-```
+```py
 self.type(tensor.type())
 ```

@@ -1207,7 +1207,7 @@ self.type(tensor.type())

 例子：

-```
+```py
 >>> x = torch.arange(1, 8)
 >>> x

@@ -1260,7 +1260,7 @@ self.type(tensor.type())

 例子：

-```
+```py
 >>> x = torch.randn(4, 4)
 >>> x.size()
 torch.Size([4, 4])
@@ -1276,7 +1276,7 @@ torch.Size([2, 8])

 返回被视作与给定的tensor相同大小的原tensor。 等效于：

-```
+```py
 self.view(tensor.size())
 ```


--- a/docs/0.4/12.md
+++ b/docs/0.4/12.md
@@ -25,7 +25,7 @@

 使用方法：

-```
+```py
 >>> x = torch.Tensor([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]])
 >>> print x.type()
 torch.FloatTensor
@@ -43,7 +43,7 @@ torch.FloatTensor

 通过一个字符串：

-```
+```py
 >>> torch.device('cuda:0')
 device(type='cuda', index=0)

@@ -56,7 +56,7 @@ device(type='cuda')

 通过字符串和设备序号：

-```
+```py
 >>> torch.device('cuda', 0)
 device(type='cuda', index=0)

@@ -67,13 +67,13 @@ device(type='cpu', index=0)
 > **注意**
 > `torch.device`函数中的参数通常可以用一个字符串替代。这允许使用代码快速构建原型。
 > 
-> ```
+> ```py
 > &gt;&gt; # Example of a function that takes in a torch.device
 > &gt;&gt; cuda1 = torch.device('cuda:1')
 > &gt;&gt; torch.randn((2,3), device=cuda1)
 > ```
 > 
-> ```
+> ```py
 > &gt;&gt; # You can substitute the torch.device with a string
 > &gt;&gt; torch.randn((2,3), 'cuda:1')
 > ```
@@ -83,7 +83,7 @@ device(type='cpu', index=0)
 > **注意**
 > 出于传统原因，可以通过单个设备序号构建设备，将其视为`cuda`设备。这匹配`Tensor.get_device()`，它为`cuda`张量返回一个序数，并且不支持`cpu`张量。
 > 
-> ```
+> ```py
 > &gt;&gt; torch.device(1)
 > device(type='cuda', index=1)
 > ```
@@ -93,7 +93,7 @@ device(type='cpu', index=0)
 > **注意**
 > 指定设备的方法可以使用（properly formatted）字符串或（legacy）整数型设备序数，即以下示例均等效：
 > 
-> ```
+> ```py
 > &gt;&gt; torch.randn((2,3), device=torch.device('cuda:1'))
 > &gt;&gt; torch.randn((2,3), device='cuda:1')
 > &gt;&gt; torch.randn((2,3), device=1)  # legacy
@@ -107,7 +107,7 @@ device(type='cpu', index=0)

 例：

-```
+```py
 >>> x = torch.Tensor([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]])
 >>> x.stride()
 (5, 1)

--- a/docs/0.4/13.md
+++ b/docs/0.4/13.md
@@ -8,7 +8,7 @@ torch支持COO（rdinate）格式的稀疏张量，可以有效地存储和处

 稀疏张量被表示为一对致密张量：值的张量和2D张量的索引。可以通过提供这两个张量来构造稀疏张量，以及稀疏张量的大小（不能从这些张量推断出！）假设我们要在位置（0,2）处定义具有条目3的稀疏张量， ，位置（1，0）的条目4，位置（1,2）的条目5。我们会写：

-```
+```py
 >>> i = torch.LongTensor([[0, 1, 1],
                          [2, 0, 2]])
 >>> v = torch.FloatTensor([3, 4, 5])
@@ -20,7 +20,7 @@ torch支持COO（rdinate）格式的稀疏张量，可以有效地存储和处

 请注意，LongTensor的输入不是索引元组的列表。如果要以这种方式编写索引，则在将它们传递给稀疏构造函数之前，应该进行转置：

-```
+```py
 >>> i = torch.LongTensor([[0, 2], [1, 0], [1, 2]])
 >>> v = torch.FloatTensor([3,      4,      5    ])
 >>> torch.sparse.FloatTensor(i.t(), v, torch.Size([2,3])).to_dense()
@@ -31,7 +31,7 @@ torch支持COO（rdinate）格式的稀疏张量，可以有效地存储和处

 您还可以构建混合稀疏张量，其中只有第一个n维是稀疏的，其余的维度是密集的。

-```
+```py
 >>> i = torch.LongTensor([[2, 4]])
 >>> v = torch.FloatTensor([[1, 3], [5, 7]])
 >>> torch.sparse.FloatTensor(i, v).to_dense()
@@ -45,7 +45,7 @@ torch支持COO（rdinate）格式的稀疏张量，可以有效地存储和处

 可以通过指定一个空的稀疏张量来构建一个空的稀疏张量：

-```
+```py
 print torch.sparse.FloatTensor(2, 3)
 # FloatTensor of size 2x3 with indices:
 # [torch.LongTensor with no dimension]

--- a/docs/0.4/14.md
+++ b/docs/0.4/14.md
@@ -14,25 +14,25 @@

 [CUDA语义](http://pytorch.org/docs/master/notes/cuda.html#cuda-semantics)有关于使用CUDA的更多细节。

-```
+```py
 torch.cuda.current_blas_handle()
 ```

 返回指向当前cuBLAS句柄的cublasHandle_t指针

-```
+```py
 torch.cuda.current_device()
 ```

 返回当前所选设备的索引。

-```
+```py
 torch.cuda.current_stream()
 ```

 返回当前选定的`Stream`

-```
+```py
 class torch.cuda.device(idx)
 ```

@@ -42,13 +42,13 @@ class torch.cuda.device(idx)

 *   idx(int) – 设备索引选择。如果这个参数是负的，则是无效操作。

-```
+```py
 torch.cuda.device_count()
 ```

 返回可用的GPU数量。

-```
+```py
 class torch.cuda.device_of(obj)
 ```

@@ -60,13 +60,13 @@ class torch.cuda.device_of(obj)

 *   obj (Tensor或者Storage) – 在选定设备上分配的对象。

-```
+```py
 torch.cuda.is_available()
 ```

 返回bool值，指示当前CUDA是否可用。

-```
+```py
 torch.cuda.set_device(device)
 ```

@@ -78,7 +78,7 @@ torch.cuda.set_device(device)

 *   device(int) - 选择的设备。如果此参数为负，则此函数是无操作的。

-```
+```py
 torch.cuda.stream(stream)
 ```

@@ -90,7 +90,7 @@ torch.cuda.stream(stream)

 *   stream(Stream) – 选择的流。如果为`None`，则这个管理器是无效的。

-```
+```py
 torch.cuda.synchronize()
 ```

@@ -98,7 +98,7 @@ torch.cuda.synchronize()

 ### 交流集

-```
+```py
 torch.cuda.comm.broadcast(tensor, devices)
 ```

@@ -111,7 +111,7 @@ torch.cuda.comm.broadcast(tensor, devices)

 返回： 包含张量副本的元组，放置在对应于索引的设备上。

-```
+```py
 torch.cuda.comm.reduce_add(inputs, destination=None)
 ```

@@ -126,7 +126,7 @@ torch.cuda.comm.reduce_add(inputs, destination=None)

 返回： 包含放置在`destination`设备上的所有输入的元素总和的张量。

-```
+```py
 torch.cuda.comm.scatter(tensor, devices, chunk_sizes=None, dim=0, streams=None)
 ```

@@ -141,7 +141,7 @@ torch.cuda.comm.scatter(tensor, devices, chunk_sizes=None, dim=0, streams=None)

 返回： 包含`tensor`块的元组，传播给`devices`。

-```
+```py
 torch.cuda.comm.gather(tensors, dim=0, destination=None)
 ```

@@ -159,7 +159,7 @@ torch.cuda.comm.gather(tensors, dim=0, destination=None)

 ## 流和事件

-```
+```py
 class torch.cuda.Stream
 ```

@@ -200,7 +200,7 @@ CUDA流的包装。

    提交到此流的所有未来工作将等待直到所有核心在调用完成时提交给给定的流。

-```
+```py
 class torch.cuda.Event(enable_timing=False, blocking=False, interprocess=False, _handle=None)
 ```

@@ -240,7 +240,7 @@ CUDA事件的包装。

    ## NVIDIA工具扩展（NVTX）

-    ```
+    ```py
    torch.cuda.nvtx.mark(msg)
    ```

@@ -248,7 +248,7 @@ CUDA事件的包装。

    *   msg（string） - 与事件关联的ASCII消息。

-    ```
+    ```py
    torch.cuda.nvtx.range_push(msg)
    ```

@@ -256,7 +256,7 @@ CUDA事件的包装。

    *   msg（string） - 与范围关联的ASCII消息

-        ```
+        ```py
        torch.cuda.nvtx.range_pop()
        ```


--- a/docs/0.4/15.md
+++ b/docs/0.4/15.md
@@ -6,7 +6,7 @@

 `torch.Storage`是单个数据类型的连续的`一维数组`，每个`torch.Tensor`都具有相同数据类型的相应存储。

-```
+```py
 class torch.FloatStorage
 ```


--- a/docs/0.4/16.md
+++ b/docs/0.4/16.md
@@ -31,7 +31,7 @@

 `Modules`还可以包含其他模块，允许将它们嵌套在树结构中。您可以将子模块分配为常规属性：

-```
+```py
 import torch.nn as nn
 import torch.nn.functional as F

@@ -52,7 +52,7 @@ class Model(nn.Module):

 将一个子模块添加到当前模块。 该模块可以使用给定的名称作为属性访问。 例：

-```
+```py
 import torch.nn as nn
 class Model(nn.Module):
    def __init__(self):
@@ -65,7 +65,7 @@ print(model.conv)

 输出：

-```
+```py
 Conv2d(10, 20, kernel_size=(4, 4), stride=(1, 1))
 ```

@@ -73,7 +73,7 @@ Conv2d(10, 20, kernel_size=(4, 4), stride=(1, 1))

 适用`fn`递归到每个子模块（如返回`.children()`），以及自我。典型用途包括初始化模型的参数（另见`torch-nn-init`）。 例如：

-```
+```py
 >>> def init_weights(m):
 >>>     print(m)
 >>>     if type(m) == nn.Linear:
@@ -150,7 +150,7 @@ Sequential (

 > NOTE： 重复的模块只返回一次。在以下示例中，`l`将仅返回一次。

-```
+```py
 >>> l = nn.Linear(2, 2)
 >>> net = nn.Sequential(l, l)
 >>> for idx, m in enumerate(net.modules()):
@@ -168,7 +168,7 @@ Sequential (

 例子：

-```
+```py
 >>> for name, module in model.named_children():
 >>>     if name in ['conv4', 'conv5']:
 >>>         print(module)
@@ -180,7 +180,7 @@ Sequential (

 > 注意： 重复的模块只返回一次。在以下示例中，`l`将仅返回一次。
 > 
-> ```
+> ```py
 > &gt;&gt; l = nn.Linear(2, 2)
 > &gt;&gt; net = nn.Sequential(l, l)
 > &gt;&gt; for idx, m in enumerate(net.named_modules()):
@@ -196,7 +196,7 @@ Sequential (
 > 
 > 返回模块参数的迭代器，同时产生参数的名称以及参数本身 例如：
 > 
-> ```
+> ```py
 > &gt;&gt; for name, param in self.named_parameters():
 > &gt;&gt;    if name in ['bias']:
 > &gt;&gt;        print(param.size())
@@ -208,7 +208,7 @@ Sequential (

 例子：

-```
+```py
 for param in model.parameters():
    print(type(param.data), param.size())

@@ -222,7 +222,7 @@ for param in model.parameters():

 每当计算相对于模块输入的梯度时，将调用该钩。挂钩应具有以下签名：

-```
+```py
 hook(module, grad_input, grad_output) -> Variable or None
 ```

@@ -240,7 +240,7 @@ hook(module, grad_input, grad_output) -> Variable or None

 例子：

-```
+```py
 self.register_buffer('running_mean', torch.zeros(num_features))
 ```

@@ -248,7 +248,7 @@ self.register_buffer('running_mean', torch.zeros(num_features))

 在模块上注册一个`forward hook`。 每次调用`forward()`计算输出的时候，这个`hook`就会被调用。它应该拥有以下签名：

-```
+```py
 hook(module, input, output) -> None
 ```

@@ -268,7 +268,7 @@ hook(module, input, output) -> None

 例子：

-```
+```py
 module.state_dict().keys()
 # ['bias', 'weight'] 
 ```
@@ -289,7 +289,7 @@ module.state_dict().keys()

 为了更容易理解，给出的是一个小例子：

-```
+```py
 # Example of using Sequential

 model = nn.Sequential(
@@ -319,7 +319,7 @@ model = nn.Sequential(OrderedDict([

 例子:

-```
+```py
 class MyModule(nn.Module):
    def __init__(self):
        super(MyModule, self).__init__()
@@ -360,7 +360,7 @@ ParameterList可以像普通Python列表一样进行索引，但是它包含的

 例子:

-```
+```py
 class MyModule(nn.Module):
    def __init__(self):
        super(MyModule, self).__init__()
@@ -428,7 +428,7 @@ $$L_{out}=floor((L_{in}+2_padding-dilation_(kernerl_size-1)-1)/stride+1)$$

 **example:**

-```
+```py
 >>> m = nn.Conv1d(16, 33, 3, stride=2)
 >>> input = autograd.Variable(torch.randn(20, 16, 50))
 >>> output = m(input)
@@ -471,7 +471,7 @@ bias(`tensor`) - 卷积的偏置系数，大小是（`out_channel`）

 Examples:

-```
+```py
 >>> # With square kernels and equal stride
 >>> m = nn.Conv2d(16, 33, 3, stride=2)
 >>> # non-square kernels and unequal stride and with padding
@@ -515,7 +515,7 @@ $$out(N_i, C_{out_j})=bias(C_{out_j})+\sum^{C_{in}-1}_{k=0}weight(C_{out_j},k)\b

 Examples:

-```
+```py
 >>> # With square kernels and equal stride
 >>> m = nn.Conv3d(16, 33, 3, stride=2)
 >>> # non-square kernels and unequal stride and with padding
@@ -588,7 +588,7 @@ Examples:

 **Example**

-```
+```py
 >>> # With square kernels and equal stride
 >>> m = nn.ConvTranspose2d(16, 33, 3, stride=2)
 >>> # non-square kernels and unequal stride and with padding
@@ -646,7 +646,7 @@ torch.Size([1, 16, 12, 12])

 **Example**

-```
+```py
 >>> # With square kernels and equal stride
 >>> m = nn.ConvTranspose3d(16, 33, 3, stride=2)
 >>> # non-square kernels and unequal stride and with padding
@@ -683,7 +683,7 @@ $$L_{out}=floor((L_{in} + 2_padding - dilation_(kernel_size - 1) - 1)/stride + 1

 **example:**

-```
+```py
 >>> # pool of size=3, stride=2
 >>> m = nn.MaxPool1d(3, stride=2)
 >>> input = autograd.Variable(torch.randn(20, 16, 50))
@@ -720,7 +720,7 @@ $$W_{out}=floor((W_{in} + 2_padding[1] - dilation[1]_(kernel_size[1] - 1) - 1)/s

 **example:**

-```
+```py
 >>> # pool of square window of size=3, stride=2
 >>> m = nn.MaxPool2d(3, stride=2)
 >>> # pool of non-square window
@@ -763,7 +763,7 @@ $$W_{out}=floor((W_{in} + 2_padding[2] - dilation[2]_(kernel_size[2] - 1) - 1)/s

 **example:**

-```
+```py
 >>> # pool of square window of size=3, stride=2
 >>>m = nn.MaxPool3d(3, stride=2)
 >>> # pool of non-square window
@@ -796,7 +796,7 @@ $$H_{out}=(H_{in}-1)_stride[0]-2_padding[0]+kernel_size[0]$$

 **Example：**

-```
+```py
 >>> pool = nn.MaxPool1d(2, stride=2, return_indices=True)
 >>> unpool = nn.MaxUnpool1d(2, stride=2)
 >>> input = Variable(torch.Tensor([[[1, 2, 3, 4, 5, 6, 7, 8]]]))
@@ -852,7 +852,7 @@ $$W_{out}=(W_{in}-1)_stride[1]-2_padding[1]+kernel_size[1]$$

 **Example：**

-```
+```py
 >>> pool = nn.MaxPool2d(2, stride=2, return_indices=True)
 >>> unpool = nn.MaxUnpool2d(2, stride=2)
 >>> input = Variable(torch.Tensor([[[[ 1,  2,  3,  4],
@@ -910,7 +910,7 @@ H_{out}=(H_{in}-1)_stride[1]-2_padding[0]+kernel_size[1]\ W_{out}=(W_{in}-1)_str

 **Example：**

-```
+```py
 >>> # pool of square window of size=3, stride=2
 >>> pool = nn.MaxPool3d(3, stride=2, return_indices=True)
 >>> unpool = nn.MaxUnpool3d(3, stride=2)
@@ -942,7 +942,7 @@ $$L_{out}=floor((L_{in}+2*padding-kernel_size)/stride+1)$$

 **Example:**

-```
+```py
 >>> # pool with window of size=3, stride=2
 >>> m = nn.AvgPool1d(3, stride=2)
 >>> m(Variable(torch.Tensor([[[1,2,3,4,5,6,7]]])))
@@ -977,7 +977,7 @@ W_{out}=floor((W_{in}+2*padding[1]-kernel_size[1])/stride[1]+1) \end{aligned} $$

 **Example:**

-```
+```py
 >>> # pool of square window of size=3, stride=2
 >>> m = nn.AvgPool2d(3, stride=2)
 >>> # pool of non-square window
@@ -1006,7 +1006,7 @@ W_{out}=floor((W_{in}+2*padding[2]-kernel_size[2])/stride[2]+1)

 **Example:**

-```
+```py
 >>> # pool of square window of size=3, stride=2
 >>> m = nn.AvgPool3d(3, stride=2)
 >>> # pool of non-square window
@@ -1028,7 +1028,7 @@ W_{out}=floor((W_{in}+2*padding[2]-kernel_size[2])/stride[2]+1)

 **Example：**

-```
+```py
 >>> # pool of square window of size=3, and target output size 13x12
 >>> m = nn.FractionalMaxPool2d(3, output_size=(13, 12))
 >>> # pool of square window and target output size being half of input image size
@@ -1064,7 +1064,7 @@ $$f(x)=pow(sum(X,p),1/p)$$

 **Example:**

-```
+```py
 >>> # power-2 pool of square window of size=3, stride=2
 >>> m = nn.LPPool2d(2, 3, stride=2)
 >>> # pool of non-square window of power 1.2
@@ -1084,7 +1084,7 @@ $$f(x)=pow(sum(X,p),1/p)$$

 **Example：**

-```
+```py
 >>> # target output size of 5
 >>> m = nn.AdaptiveMaxPool1d(5)
 >>> input = autograd.Variable(torch.randn(1, 64, 8))
@@ -1102,7 +1102,7 @@ $$f(x)=pow(sum(X,p),1/p)$$

 **Example：**

-```
+```py
 >>> # target output size of 5x7
 >>> m = nn.AdaptiveMaxPool2d((5,7))
 >>> input = autograd.Variable(torch.randn(1, 64, 8, 9))
@@ -1122,7 +1122,7 @@ $$f(x)=pow(sum(X,p),1/p)$$

 **Example：**

-```
+```py
 >>> # target output size of 5
 >>> m = nn.AdaptiveAvgPool1d(5)
 >>> input = autograd.Variable(torch.randn(1, 64, 8))
@@ -1139,7 +1139,7 @@ $$f(x)=pow(sum(X,p),1/p)$$

 **Example：**

-```
+```py
 >>> # target output size of 5x7
 >>> m = nn.AdaptiveAvgPool2d((5,7))
 >>> input = autograd.Variable(torch.randn(1, 64, 8, 9))
@@ -1164,7 +1164,7 @@ shape：

 例子：

-```
+```py
 >>> m = nn.ReLU()
 >>> input = autograd.Variable(torch.randn(2))
 >>> print(input)
@@ -1184,7 +1184,7 @@ shape：

 例子：

-```
+```py
 >>> m = nn.ReLU6()
 >>> input = autograd.Variable(torch.randn(2))
 >>> print(input)
@@ -1202,7 +1202,7 @@ shape：

 例子：

-```
+```py
 >>> m = nn.ELU()
 >>> input = autograd.Variable(torch.randn(2))
 >>> print(input)
@@ -1227,7 +1227,7 @@ shape：

 例子：

-```
+```py
 >>> m = nn.PReLU()
 >>> input = autograd.Variable(torch.randn(2))
 >>> print(input)
@@ -1250,7 +1250,7 @@ shape：

 例子：

-```
+```py
 >>> m = nn.LeakyReLU(0.1)
 >>> input = autograd.Variable(torch.randn(2))
 >>> print(input)
@@ -1276,7 +1276,7 @@ shape：

 例子：

-```
+```py
 >>> m = nn.Threshold(0.1, 20)
 >>> input = Variable(torch.randn(2))
 >>> print(input)
@@ -1304,7 +1304,7 @@ shape：

 例子：

-```
+```py
 >>> m = nn.Hardtanh()
 >>> input = autograd.Variable(torch.randn(2))
 >>> print(input)
@@ -1324,7 +1324,7 @@ shape：

 例子：

-```
+```py
 >>> m = nn.Sigmoid()
 >>> input = autograd.Variable(torch.randn(2))
 >>> print(input)
@@ -1344,7 +1344,7 @@ shape：

 例子：

-```
+```py
 >>> m = nn.Tanh()
 >>> input = autograd.Variable(torch.randn(2))
 >>> print(input)
@@ -1362,7 +1362,7 @@ shape：

 例子：

-```
+```py
 >>> m = nn.LogSigmoid()
 >>> input = autograd.Variable(torch.randn(2))
 >>> print(input)
@@ -1391,7 +1391,7 @@ shape：

 例子：

-```
+```py
 >>> m = nn.Softplus()
 >>> input = autograd.Variable(torch.randn(2))
 >>> print(input)
@@ -1415,7 +1415,7 @@ shape：

 例子：

-```
+```py
 >>> m = nn.Softshrink()
 >>> input = autograd.Variable(torch.randn(2))
 >>> print(input)
@@ -1433,7 +1433,7 @@ shape：

 例子：

-```
+```py
 >>> m = nn.Softsign()
 >>> input = autograd.Variable(torch.randn(2))
 >>> print(input)
@@ -1453,7 +1453,7 @@ shape：

 例子：

-```
+```py
 >>> m = nn.Tanhshrink()
 >>> input = autograd.Variable(torch.randn(2))
 >>> print(input)
@@ -1473,7 +1473,7 @@ shape：

 例子：

-```
+```py
 >>> m = nn.Softmin()
 >>> input = autograd.Variable(torch.randn(2, 3))
 >>> print(input)
@@ -1497,7 +1497,7 @@ shape：

 例子：

-```
+```py
 >>> m = nn.Softmax()
 >>> input = autograd.Variable(torch.randn(2, 3))
 >>> print(input)
@@ -1517,7 +1517,7 @@ shape：

 例子：

-```
+```py
 >>> m = nn.LogSoftmax()
 >>> input = autograd.Variable(torch.randn(2, 3))
 >>> print(input)
@@ -1552,7 +1552,7 @@ $$ y = \frac{x - mean[x]}{ \sqrt{Var[x]} + \epsilon} * gamma + beta $$

 **例子**

-```
+```py
 >>> # With Learnable Parameters
 >>> m = nn.BatchNorm1d(100)
 >>> # Without Learnable Parameters
@@ -1589,7 +1589,7 @@ $$ y = \frac{x - mean[x]}{ \sqrt{Var[x]} + \epsilon} * gamma + beta $$

 **例子**

-```
+```py
 >>> # With Learnable Parameters
 >>> m = nn.BatchNorm2d(100)
 >>> # Without Learnable Parameters
@@ -1626,7 +1626,7 @@ $$ y = \frac{x - mean[x]}{ \sqrt{Var[x]} + \epsilon} * gamma + beta $$

 **例子**

-```
+```py
 >>> # With Learnable Parameters
 >>> m = nn.BatchNorm3d(100)
 >>> # Without Learnable Parameters
@@ -1685,7 +1685,7 @@ $$ y = \frac{x - mean[x]}{ \sqrt{Var[x]} + \epsilon} * gamma + beta $$

 示例：

-```
+```py
 rnn = nn.RNN(10, 20, 2)
 input = Variable(torch.randn(5, 3, 10))
 h0 = Variable(torch.randn(2, 3, 20))

--- a/docs/0.4/17.md
+++ b/docs/0.4/17.md
--- a/docs/0.4/18.md
+++ b/docs/0.4/18.md
@@ -117,7 +117,7 @@ grad_outputs应该是output 包含每个输出的预先计算的梯度的长度

 每次`gradients`被计算的时候，这个`hook`都被调用。`hook`应该拥有以下签名：

-```
+```py
 hook(grad) -> Variable or None
 ```

@@ -127,7 +127,7 @@ hook(grad) -> Variable or None

 例:

-```
+```py
 >>> v = Variable(torch.Tensor([0, 0, 0]), requires_grad=True)
 >>> h = v.register_hook(lambda grad: grad * 2)  # double the gradient
 >>> v.backward(torch.Tensor([1, 1, 1]))

--- a/docs/0.4/19.md
+++ b/docs/0.4/19.md
@@ -25,7 +25,7 @@

 例子：

-```
+```py
 optimizer = optim.SGD(model.parameters(), lr = 0.01, momentum=0.9)
 optimizer = optim.Adam([var1, var2], lr = 0.0001)
 ```
@@ -40,7 +40,7 @@ optimizer = optim.Adam([var1, var2], lr = 0.0001)

 例如，当我们想指定每一层的学习率时，这是非常有用的：

-```
+```py
 optim.SGD([
            {'params': model.base.parameters()},
            {'params': model.classifier.parameters(), 'lr': 1e-3}
@@ -59,7 +59,7 @@ optim.SGD([

 例子

-```
+```py
 for input, target in dataset:
    optimizer.zero_grad()
    output = model(input)
@@ -74,7 +74,7 @@ for input, target in dataset:

 例子：

-```
+```py
 for input, target in dataset:
    def closure():
        optimizer.zero_grad()
@@ -87,7 +87,7 @@ for input, target in dataset:

 #### 算法

-```
+```py
 class torch.optim.Optimizer(params, defaults)
 ```

@@ -98,7 +98,7 @@ class torch.optim.Optimizer(params, defaults)
 1.  params (iterable) —— 可迭代的`Variable` 或者 `dict`。指定应优化哪些变量。
 2.  defaults-(dict)：包含优化选项的默认值的dict（一个参数组没有指定的参数选项将会使用默认值）。

-```
+```py
 load_state_dict(state_dict)
 ```

@@ -108,7 +108,7 @@ load_state_dict(state_dict)

 1.  state_dict (dict) —— `optimizer`的状态。应该是`state_dict()`调用返回的对象。

-```
+```py
 state_dict()
 ```

@@ -119,7 +119,7 @@ state_dict()
 1.  state - 持有当前`optimization`状态的`dict`。它包含了 优化器类之间的不同。
 2.  param_groups - 一个包含了所有参数组的`dict`。

-```
+```py
 step(closure)
 ```

@@ -131,7 +131,7 @@ step(closure)

 清除所有优化过的`Variable`的梯度。

-```
+```py
 class torch.optim.Adadelta(params, lr=1.0, rho=0.9, eps=1e-06, weight_decay=0)
 ```

@@ -147,7 +147,7 @@ class torch.optim.Adadelta(params, lr=1.0, rho=0.9, eps=1e-06, weight_decay=0)
 4.  lr (float, 可选) – 将delta应用于参数之前缩放的系数（默认值：1.0）
 5.  weight_decay (float, 可选) – 权重衰减 (L2范数)（默认值: 0）

-```
+```py
 step(closure)
 ```

@@ -157,7 +157,7 @@ step(closure)

 1.  closure (callable,可选) – 重新评估模型并返回损失的闭包。

-```
+```py
 class torch.optim.Adagrad(params, lr=0.01, lr_decay=0, weight_decay=0)
 ```

@@ -172,7 +172,7 @@ class torch.optim.Adagrad(params, lr=0.01, lr_decay=0, weight_decay=0)
 3.  lr_decay (float, 可选) – 学习率衰减（默认: 0）
 4.  weight_decay (float, 可选) – 权重衰减（L2范数）（默认: 0）

-```
+```py
 step(closure)
 ```

@@ -182,7 +182,7 @@ step(closure)

 1.  closure (callable,可选) – 重新评估模型并返回损失的闭包。

-```
+```py
 class torch.optim.Adam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0)[source]
 ```

@@ -198,7 +198,7 @@ class torch.optim.Adam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_d
 4.  eps (float, 可选) – 增加分母的数值以提高数值稳定性（默认：1e-8）
 5.  weight_decay (float, 可选) – 权重衰减（L2范数）（默认: 0）

-```
+```py
 step(closure) 
 ```

@@ -208,7 +208,7 @@ step(closure)

 1.  closure (callable,可选) – 重新评估模型并返回损失的闭包。

-```
+```py
 class torch.optim.Adamax(params, lr=0.002, betas=(0.9, 0.999), eps=1e-08, weight_decay=0)
 ```

@@ -224,7 +224,7 @@ class torch.optim.Adamax(params, lr=0.002, betas=(0.9, 0.999), eps=1e-08, weight
 4.  eps (float, 可选) – 增加分母的数值以提高数值稳定性（默认：1e-8）
 5.  weight_decay (float, 可选) – 权重衰减（L2范数）（默认: 0）

-```
+```py
 step(closure=None)
 ```

@@ -234,7 +234,7 @@ step(closure=None)

 1.  closure (callable,可选) – 重新评估模型并返回损失的闭包。

-```
+```py
 class torch.optim.ASGD(params, lr=0.01, lambd=0.0001, alpha=0.75, t0=1000000.0, weight_decay=0)
 ```

@@ -251,7 +251,7 @@ class torch.optim.ASGD(params, lr=0.01, lambd=0.0001, alpha=0.75, t0=1000000.0,
 5.  t0 (float, 可选) – 指明在哪一次开始平均化（默认：1e6）
 6.  weight_decay (float, 可选) – 权重衰减（L2范数）（默认: 0）

-```
+```py
 step(closure)
 ```

@@ -261,7 +261,7 @@ step(closure)

 1.  closure (callable,可选) – 重新评估模型并返回损失的闭包。

-```
+```py
 class torch.optim.LBFGS(params, lr=1, max_iter=20, max_eval=None, tolerance_grad=1e-05, tolerance_change=1e-09, history_size=100, line_search_fn=None)
 ```

@@ -280,7 +280,7 @@ class torch.optim.LBFGS(params, lr=1, max_iter=20, max_eval=None, tolerance_grad
 5.  tolerance_change (float) – 功能值/参数更改的终止公差（默认：1e-9）
 6.  history_size (int) – 更新历史记录大小（默认：100）

-```
+```py
 step(closure)
 ```

@@ -290,7 +290,7 @@ step(closure)

 1.  closure (callable,可选) – 重新评估模型并返回损失的闭包。

-```
+```py
 class torch.optim.RMSprop(params, lr=0.01, alpha=0.99, eps=1e-08, weight_decay=0, momentum=0, centered=False)[source]
 ```

@@ -310,7 +310,7 @@ class torch.optim.RMSprop(params, lr=0.01, alpha=0.99, eps=1e-08, weight_decay=0
 6.  centered (bool, 可选) – 如果为True，计算中心化的RMSProp，通过其方差的估计来对梯度进行归一化
 7.  weight_decay (float, 可选) – 权重衰减（L2范数）（默认: 0）

-```
+```py
 step(closure)
 ```

@@ -320,7 +320,7 @@ step(closure)

 1.  closure (callable,可选) – 重新评估模型并返回损失的闭包。

-```
+```py
 class torch.optim.Rprop(params, lr=0.01, etas=(0.5, 1.2), step_sizes=(1e-06, 50))
 ```

@@ -333,7 +333,7 @@ class torch.optim.Rprop(params, lr=0.01, etas=(0.5, 1.2), step_sizes=(1e-06, 50)
 3.  etas (Tuple[float, float], 可选) – 一对（etaminus，etaplis）, 它们是乘数增加和减少因子（默认：0.5，1.2）
 4.  step_sizes (Tuple[float, float], 可选) – 允许的一对最小和最大的步长（默认：1e-6，50）

-```
+```py
 step(closure) 
 ```

@@ -343,7 +343,7 @@ step(closure)

 1.  closure (callable,可选) – 重新评估模型并返回损失的闭包。

-```
+```py
 class torch.optim.SGD(params, lr=, momentum=0, dampening=0, weight_decay=0, nesterov=False)
 ```

@@ -362,7 +362,7 @@ Nesterov动量基于[On the importance of initialization and momentum in deep le

 例子：

-```
+```py
 >>> optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
 >>> optimizer.zero_grad()
 >>> loss_fn(model(input), target).backward()
@@ -373,7 +373,7 @@ Nesterov动量基于[On the importance of initialization and momentum in deep le

 > 带有动量/Nesterov的SGD的实现稍微不同于Sutskever等人以及其他框架中的实现。 考虑到Momentum的具体情况，更新可以写成 v=ρ∗v+g p=p−lr∗v 其中，p、g、v和ρ分别是参数、梯度、速度和动量。 这是在对比Sutskever et. al。和其他框架采用该形式的更新 v=ρ∗v+lr∗g p=p−v Nesterov版本被类似地修改。

-```
+```py
 step(closure) 
 ```

@@ -387,7 +387,7 @@ step(closure)

 `torch.optim.lr_scheduler` 提供了几种方法来根据epoches的数量调整学习率。 `torch.optim.lr_scheduler.ReduceLROnPlateau`允许基于一些验证测量来降低动态学习速率。

-```
+```py
 class torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda, last_epoch=-1)
 ```

@@ -401,7 +401,7 @@ class torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda, last_epoch=-1)

 例子：

-```
+```py
 >>> # Assuming optimizer has two groups.
 >>> lambda1 = lambda epoch: epoch // 30
 >>> lambda2 = lambda epoch: 0.95 ** epoch
@@ -412,7 +412,7 @@ class torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda, last_epoch=-1)
 >>>     validate(...)
 ```

-```
+```py
 class torch.optim.lr_scheduler.StepLR(optimizer, step_size, gamma=0.1, last_epoch=-1)
 ```

@@ -425,7 +425,7 @@ class torch.optim.lr_scheduler.StepLR(optimizer, step_size, gamma=0.1, last_epoc

 例子：

-```
+```py
 >>> # Assuming optimizer uses lr = 0.5 for all groups
 >>> # lr = 0.05     if epoch < 30
 >>> # lr = 0.005    if 30 <= epoch < 60
@@ -438,7 +438,7 @@ class torch.optim.lr_scheduler.StepLR(optimizer, step_size, gamma=0.1, last_epoc
 >>>     validate(...)
 ```

-```
+```py
 class torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones, gamma=0.1, last_epoch=-1)
 ```

@@ -453,7 +453,7 @@ class torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones, gamma=0.1, las

 例子：

-```
+```py
 >>> # Assuming optimizer uses lr = 0.5 for all groups
 >>> # lr = 0.05     if epoch < 30
 >>> # lr = 0.005    if 30 <= epoch < 80
@@ -465,7 +465,7 @@ class torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones, gamma=0.1, las
 >>>     validate(...)
 ```

-```
+```py
 class torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma, last_epoch=-1)
 ```

@@ -475,7 +475,7 @@ class torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma, last_epoch=-1)
 2.  gamma (float) – 学习率衰减的乘积因子。
 3.  last_epoch (int) – 最后一个指数。默认: -1.

-```
+```py
 class torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.1, patience=10, verbose=False, threshold=0.0001, threshold_mode='rel', cooldown=0, min_lr=0, eps=1e-08)
 ```

@@ -492,7 +492,7 @@ class torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0
 9.  min_lr (float or list) – 标量或标量的列表。对所有的组群或每组的学习速率的一个较低的限制。 默认: 0.
 10.  eps (float) – 适用于lr的最小衰减。如果新旧lr之间的差异小于eps，则更新将被忽略。默认: 1e-8.

-```
+```py
 >>> optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
 >>> scheduler = torch.optim.ReduceLROnPlateau(optimizer, 'min')
 >>> for epoch in range(10):

--- a/docs/0.4/2.md
+++ b/docs/0.4/2.md
@@ -19,7 +19,7 @@

 例如:

-```
+```py
 >>> x=torch.FloatTensor(5,7,3)
 >>> y=torch.FloatTensor(5,7,3)
 # 相同形状的质量可以被广播(上述规则总是成立的)
@@ -51,7 +51,7 @@

 例如:

-```
+```py
 # 可以排列尾部维度,使阅读更容易
 >>> x=torch.FloatTensor(5,1,4,1)
 >>> y=torch.FloatTensor(  3,1,1)
@@ -76,7 +76,7 @@ RuntimeError: The size of tensor a (2) must match the size of tensor b (3) at no

 例如：

-```
+```py
 >>> x=torch.FloatTensor(5,3,4,1)
 >>> y=torch.FloatTensor(3,1,1)
 >>> (x.add_(y)).size()
@@ -97,7 +97,7 @@ RuntimeError: The expanded size of the tensor (1) must match the existing size (

 例如：

-```
+```py
 >>> torch.add(torch.ones(4,1), torch.randn(4))
 ```

@@ -105,7 +105,7 @@ RuntimeError: The expanded size of the tensor (1) must match the existing size (

 例如：

-```
+```py
 >>> torch.utils.backcompat.broadcast_warning.enabled=True
 >>> torch.add(torch.ones(4,1), torch.ones(4))
 __main__:1: UserWarning: self and other do not have the same shape, but are broadcastable, and have the same number of elements.

--- a/docs/0.4/20.md
+++ b/docs/0.4/20.md
@@ -2,7 +2,7 @@

 # torch.nn.init

-```
+```py
 torch.nn.init.calculate_gain(nonlinearity,param=None)
 ```

@@ -24,11 +24,11 @@ torch.nn.init.calculate_gain(nonlinearity,param=None)

 例子：

-```
+```py
 gain = nn.init.gain('leaky_relu')
 ```

-```
+```py
 torch.nn.init.uniform(tensor, a=0, b=1)[source]
 ```

@@ -42,7 +42,7 @@ torch.nn.init.uniform(tensor, a=0, b=1)[source]

 例子：

-```
+```py
 w = torch.Tensor(3, 5)
 print nn.init.uniform(w)
 # 输出： 
@@ -52,7 +52,7 @@ print nn.init.uniform(w)
 # [torch.FloatTensor of size 3x5]
 ```

-```
+```py
 torch.nn.init.normal(tensor, mean=0, std=1)
 ```

@@ -66,12 +66,12 @@ torch.nn.init.normal(tensor, mean=0, std=1)

 例子：

-```
+```py
 w = torch.Tensor(3, 5)
 print torch.nn.init.normal(w)
 ```

-```
+```py
 torch.nn.init.constant(tensor, val)
 ```

@@ -84,12 +84,12 @@ torch.nn.init.constant(tensor, val)

 例子：

-```
+```py
 w = torch.Tensor(3, 5)
 print torch.nn.init.constant(w)
 ```

-```
+```py
 torch.nn.init.eye(tensor)
 ```

@@ -101,12 +101,12 @@ torch.nn.init.eye(tensor)

 例子：

-```
+```py
 w = torch.Tensor(3, 5)
 print torch.nn.init.eye(w)
 ```

-```
+```py
 torch.nn.init.dirac(tensor)
 ```

@@ -118,12 +118,12 @@ torch.nn.init.dirac(tensor)

 例子：

-```
+```py
 w = torch.Tensor(3, 16, 5, 5)
 print torch.nn.init.dirac(w)
 ```

-```
+```py
 torch.nn.init.xavier_uniform(tensor, gain=1)
 ```

@@ -136,12 +136,12 @@ torch.nn.init.xavier_uniform(tensor, gain=1)

 例子：

-```
+```py
 w = torch.Tensor(3, 5)
 print torch.nn.init.xavier_uniform(w, gain=nn.init.calculate_gain('relu'))
 ```

-```
+```py
 torch.nn.init.xavier_normal(tensor, gain=1)
 ```

@@ -154,12 +154,12 @@ torch.nn.init.xavier_normal(tensor, gain=1)

 例子：

-```
+```py
 >>> w = torch.Tensor(3, 5)
 >>> nn.init.xavier_normal(w)
 ```

-```
+```py
 torch.nn.init.kaiming_uniform(tensor, a=0, mode='fan_in')
 ```

@@ -173,12 +173,12 @@ torch.nn.init.kaiming_uniform(tensor, a=0, mode='fan_in')

 例子：

-```
+```py
 w = torch.Tensor(3, 5)
 torch.nn.init.kaiming_uniform(w, mode='fan_in')
 ```

-```
+```py
 torch.nn.init.kaiming_normal(tensor, a=0, mode='fan_in')
 ```

@@ -190,12 +190,12 @@ torch.nn.init.kaiming_normal(tensor, a=0, mode='fan_in')
 2.  a -此层后使用的整流器的负斜率（默认为ReLU为0）
 3.  mode - "fan_in"（默认）或"fan_out"。"fan_in"保留正向传播时权值方差的量级，"fan_out"保留反向传播时的量级。

-```
+```py
 w = torch.Tensor(3, 5)
 print torch.nn.init.kaiming_normal(w, mode='fan_out')
 ```

-```
+```py
 torch.nn.init.orthogonal(tensor, gain=1)
 ```

@@ -208,12 +208,12 @@ torch.nn.init.orthogonal(tensor, gain=1)

 例子：

-```
+```py
 w = torch.Tensor(3, 5)
 print torch.nn.init.orthogonal(w)
 ```

-```
+```py
 torch.nn.init.sparse(tensor, sparsity, std=0.01)
 ```

@@ -226,7 +226,7 @@ torch.nn.init.sparse(tensor, sparsity, std=0.01)
 3.  std - 用于生成的正态分布的标准差
 4.  non-zero values (the) – 例子：

-```
+```py
 w = torch.Tensor(3, 5)
 print torch.nn.init.sparse(w, sparsity=0.1)
 ```

--- a/docs/0.4/22.md
+++ b/docs/0.4/22.md
@@ -20,19 +20,19 @@

 ## 战略管理

-```
+```py
 torch.multiprocessing.get_all_sharing_strategies()
 ```

 返回一组当前系统支持的共享策略。

-```
+```py
 torch.multiprocessing.get_sharing_strategy()
 ```

 返回共享CPU张量的当前策略

-```
+```py
 torch.multiprocessing.set_sharing_strategy(new_strategy)
 ```


--- a/docs/0.4/23.md
+++ b/docs/0.4/23.md
@@ -76,7 +76,7 @@ Rank是分配给分布式组中每个进程的唯一标识符。它们总是连

 或者，地址必须是有效的IP多播地址，在这种情况下可以自动分配等级。组播初始化还支持一个group_name参数，只要使用不同的组名，就可以为多个作业使用相同的地址。

-```
+```py
 import torch.distributed as dist

 # Use address of one of the machines
@@ -95,7 +95,7 @@ dist.init_process_group(init_method='tcp://[ff15:1e18:5d4c:4cf0:d02d:b659:53ba:b

 该方法假设文件系统支持使用fcntl大多数本地系统进行锁定，NFS支持它。

-```
+```py
 import torch.distributed as dist

 # Rank will be assigned automatically if unspecified

--- a/docs/0.4/24.md
+++ b/docs/0.4/24.md
@@ -4,13 +4,13 @@

 用命令行运行它

-```
+```py
 python -m torch.utils.bottleneck /path/to/source/script.py [args]
 ```

 `[args]`是`script.py`中的任意参数，也可以运行如下代码获取更多使用说明。

-```
+```py
 python -m torch.utils.bottleneck -h
 ```


--- a/docs/0.4/25.md
+++ b/docs/0.4/25.md
@@ -46,7 +46,7 @@

 例:

-```
+```py
 >>> model = nn.Sequential(...)
 >>> input_var = checkpoint_sequential(model, chunks, input_var)
 ```

--- a/docs/0.4/26.md
+++ b/docs/0.4/26.md
@@ -12,7 +12,7 @@

 例

-```
+```py
 >>> from setuptools import setup
 >>> from torch.utils.cpp_extension import BuildExtension, CppExtension
 >>> setup(
@@ -34,7 +34,7 @@

 例

-```
+```py
 >>> from setuptools import setup
 >>> from torch.utils.cpp_extension import BuildExtension, CppExtension
 >>> setup(
@@ -85,7 +85,7 @@

 例

-```
+```py
 >>> from torch.utils.cpp_extension import load
 >>> module = load(
        name='extension',
@@ -103,7 +103,7 @@

 例如：

-```
+```py
 from setuptools import setup
 from torch.utils.cpp_extension import BuildExtension, CppExtension


--- a/docs/0.4/27.md
+++ b/docs/0.4/27.md
@@ -2,7 +2,7 @@

 ## torch.utils.data

-```
+```py
 class torch.utils.data.Dataset
 ```

@@ -10,7 +10,7 @@ class torch.utils.data.Dataset

 所有其他数据集都应该进行子类化。所有子类应该覆盖`__len__`和`__getitem__`，`__len__`提供了数据集的大小，`__getitem__`支持整数索引，范围从0到len(self)。

-```
+```py
 class torch.utils.data.TensorDataset(data_tensor, target_tensor)
 ```

@@ -25,7 +25,7 @@ class torch.utils.data.TensorDataset(data_tensor, target_tensor)

 例子：

-```
+```py
 x = torch.linspace(1, 10, 10)       # x data (torch tensor)
 y = torch.linspace(10, 1, 10)       # y data (torch tensor)

@@ -33,7 +33,7 @@ y = torch.linspace(10, 1, 10)       # y data (torch tensor)
 torch_dataset = torch.utils.data.TensorDataset(data_tensor=x, target_tensor=y)
 ```

-```
+```py
 class torch.utils.data.ConcatDataset(datasets)
 ```

@@ -44,7 +44,7 @@ class torch.utils.data.ConcatDataset(datasets)
 *   datasets的参数：要连接的数据集列表
 *   datasets样式：iterable

-```
+```py
 class torch.utils.data.DataLoader(dataset, batch_size=1, shuffle=False, sampler=None, num_workers=0, collate_fn=<function default_collate>, pin_memory=False, drop_last=False)
 ```

@@ -62,7 +62,7 @@ class torch.utils.data.DataLoader(dataset, batch_size=1, shuffle=False, sampler=
 8.  pin_memory (bool, optional) – 如果为True，数据加载器在返回前将张量复制到CUDA固定内存中。
 9.  drop_last (bool, optional) – 如果数据集大小不能被batch_size整除，设置为True可删除最后一个不完整的批处理。如果设为False并且数据集的大小不能被batch_size整除，则最后一个batch将更小。(默认: False)

-```
+```py
 class torch.utils.data.sampler.Sampler(data_source)
 ```

@@ -70,7 +70,7 @@ class torch.utils.data.sampler.Sampler(data_source)

 每个采样器子类必须提供一个`__iter__`方法，提供一种迭代数据集元素的索引的方法，以及返回迭代器长度的`__len__`方法。

-```
+```py
 class torch.utils.data.sampler.SequentialSampler(data_source)
 ```

@@ -80,7 +80,7 @@ class torch.utils.data.sampler.SequentialSampler(data_source)

 *   `data_source (Dataset)` – 采样的数据集。

-```
+```py
 class torch.utils.data.sampler.RandomSampler(data_source)
 ```

@@ -88,7 +88,7 @@ class torch.utils.data.sampler.RandomSampler(data_source)

 参数： - `data_source (Dataset)` – 采样的数据集。

-```
+```py
 class torch.utils.data.sampler.SubsetRandomSampler(indices)
 ```

@@ -96,7 +96,7 @@ class torch.utils.data.sampler.SubsetRandomSampler(indices)

 参数： - `indices (list)` – 索引的列表

-```
+```py
 class torch.utils.data.sampler.WeightedRandomSampler(weights, num_samples, replacement=True)
 ```

@@ -107,7 +107,7 @@ class torch.utils.data.sampler.WeightedRandomSampler(weights, num_samples, repla
 *   `weights (list)` – 权重列表。不需要加起来为1
 *   `num_samples (int)` – 要绘制的样本数

-```
+```py
 class torch.utils.data.distributed.DistributedSampler(dataset, num_replicas=None, rank=None)
 ```


--- a/docs/0.4/28.md
+++ b/docs/0.4/28.md
@@ -2,7 +2,7 @@

 # torch.utils.ffi

-```
+```py
 torch.utils.ffi.create_extension(name, headers, sources, verbose=True, with_cuda=False, package=False, relative_to='.', **kwargs)
 ```


--- a/docs/0.4/29.md
+++ b/docs/0.4/29.md
@@ -2,7 +2,7 @@

 # torch.utils.model_zoo

-```
+```py
 torch.utils.model_zoo.load_url(url, model_dir=None)
 ```

@@ -20,7 +20,7 @@ torch.utils.model_zoo.load_url(url, model_dir=None)

 例如：

-```
+```py
 >>> state_dict = torch.utils.model_zoo.load_url('https://s3.amazonaws.com/pytorch/models/resnet18-5c106cde.pth')
 ```


--- a/docs/0.4/3.md
+++ b/docs/0.4/3.md
@@ -20,7 +20,7 @@

 下面可以用一个小例子来展示：

-```
+```py
 cuda = torch.device("cuda")  # 默认为CUDA设备
 cuda0 = torch.device("cuda:0")
 cuda2 = torch.device("cuda:2") # GPU 2
@@ -63,7 +63,7 @@ CUDA 流是属于特定设备的线性执行序列。您通常不需要明确创

 除非显式的使用同步函数（例如 `synchronize()` 或 `wait_stream()` ），否则每个流内的操作都按照它们创建的顺序进行序列化，但是来自不同流的操作可以以任意相对顺序并发执行。例如，下面的代码是不正确的：

-```
+```py
 cuda = torch.device("cuda")
 s = torch.cuda.stream()  # 在当前流中创建一个新的流
 A = torch.empty((100,100), device = cuda).normal_(0.0, 1.0)
@@ -87,7 +87,7 @@ with torch.cuda.stream(s):

 第一步是确定是否应该使用 GPU。一种常见的模式是使用 Python 的 `argparse` 模块来读入用户参数，并且有一个标志可用于禁用 CUDA，并结合 `is_available()` 使用。在下面的内容中，`args.device` 会生成一个 `torch.device` 对象，该对象可用于将张量移动到 CPU 或 CUDA。

-```
+```py
 import argparse
 import torch

@@ -103,14 +103,14 @@ else:

 现在我们有了 `args.device`，我们可以使用它在所需的设备上创建一个张量。

-```
+```py
 x = torch.empty((8, 42), device = args.device)
 net = Network().to(device = args.device)
 ```

 这可以在许多情况下用于生成设备不可知代码。以下是使用 `dataloader` 的例子：

-```
+```py
 cuda0 = torch.device('cuda:0')  # CUDA GPU 0
 for i, x in enumerate(train_loader):
    x = x.to(cuda0)
@@ -118,7 +118,7 @@ for i, x in enumerate(train_loader):

 在系统上使用多个 GPU 时，您可以使用 `CUDA_VISIBLE_DEVICES` 环境标志来管理 PyTorch 可用的 GPU。如上所述，要手动控制在哪个 GPU 上创建张量，最佳做法是使用 `torch.cuda.device` 上下文管理器。

-```
+```py
 print("外部的设备是0") # 在设备0上
 with torch.cuda.device(1):
    print("内部的设备是1")  # 设备1
@@ -129,7 +129,7 @@ print("外部的设备仍是0")  # 设备0

 这是建立模块时推荐的做法，在前向传递期间需要在内部创建新的张量

-```
+```py
 cuda = torch.device("cuda")
 x_cpu = torch.empty(2)
 y_gpu = torch.empty(2, device = cuda)
@@ -153,7 +153,7 @@ print(y_cpu_long)

 如果要创建与另一个张量相同类型和大小的张量，并将其填充为1或0，则可以使用 `ones_like()` 或 `zeros_like()` 作为便捷的辅助函数（也可以保留 `torch.device` 和 `torch.dtype` 的张量）。

-```
+```py
 x_cpu = torch.empty(2,3)
 x_gpu = torch.empty(2,3)


--- a/docs/0.4/30.md
+++ b/docs/0.4/30.md
@@ -6,7 +6,7 @@

 这是一个简单的脚本，将`torchvision`中定义的预训练的`AlexNet`导出到`ONNX`中。它运行一轮推理，然后将结果跟踪模型保存到`alexnet.proto`：

-```
+```py
 from torch.autograd import Variable
 import torch.onnx
 import torchvision
@@ -18,7 +18,7 @@ torch.onnx.export(model, dummy_input, "alexnet.proto", verbose=True)

 保存文件`alexnet.proto`是一个二进制`protobuf`文件，其中包含您导出的模型（在本例中为`AlexNet`）的网络结构和参数。关键字参数`verbose=True`导致导出器打印出一个人类可读的网络表示：

-```
+```py
 # All parameters are encoded explicitly as inputs.  By convention,
 # learned parameters (ala nn.Module.state_dict) are first, and the
 # actual inputs are last.
@@ -50,13 +50,13 @@ graph(%1 : Float(64, 3, 11, 11)

 您也可以使用[onnx](https://github.com/onnx/onnx/)库来验证`protobuf`。你可以`onnx`用`conda`安装：

-```
+```py
 conda install -c conda-forge onnx
 ```

 然后，你可以运行：

-```
+```py
 import onnx

 # Load the ONNX model
@@ -75,13 +75,13 @@ onnx.helper.printable_graph(model.graph)

 *   2、你需要`onnx-caffe2`，一个纯`Python`库，为`ONNX`提供一个`Caffe2`后端。`onnx-caffe2`你可以用`pip`来安装：

-```
+```py
 pip install onnx-caffe2
 ```

 安装完成后，您可以使用`Caffe2`的后端：

-```
+```py
 # ...continuing from above
 import onnx_caffe2.backend as backend
 import numpy as np
@@ -114,7 +114,7 @@ print(outputs[0])

 ### torch.onnx功能

-```
+```py
 torch.onnx.export(model, args, f, export_params=True, verbose=False, training=False)
 ```


--- a/docs/0.4/33.md
+++ b/docs/0.4/33.md
@@ -6,13 +6,13 @@

 如下代码用于获取加载图像的包的名称。

-```
+```py
 torchvision.get_image_backend()
 ```

 指定用于加载图像的包。

-```
+```py
 torchvision.set_image_backend(backend)
 ```


--- a/docs/0.4/34.md
+++ b/docs/0.4/34.md
@@ -18,7 +18,7 @@

 所有数据集都是`torch.utils.data.Dataset`的子类， 即它们具有**getitem**和**len**实现方法。因此，它们都可以传递给`torch.utils.data.DataLoader` 可以使用`torch.multiprocessing`工作人员并行加载多个样本的数据。例如：

-```
+```py
 imagenet_data = torchvision.datasets.ImageFolder('path/to/imagenet_root/')
 data_loader = torch.utils.data.DataLoader(imagenet_data,
                                          batch_size=4,
@@ -30,7 +30,7 @@ data_loader = torch.utils.data.DataLoader(imagenet_data,

 #### MNIST

-```
+```py
 dset.MNIST(root, train=True, transform=None, target_transform=None, download=False)
 ```

@@ -46,7 +46,7 @@ dset.MNIST(root, train=True, transform=None, target_transform=None, download=Fal

 需要安装[COCO API](https://github.com/pdollar/coco/tree/master/PythonAPI)

-```
+```py
 dset.CocoCaptions(root="dir where images are", annFile="json annotation file", [transform, target_transform])
 ```

@@ -59,7 +59,7 @@ dset.CocoCaptions(root="dir where images are", annFile="json annotation file", [

 例子:

-```
+```py
 import torchvision.datasets as dset
 import torchvision.transforms as transforms
 cap = dset.CocoCaptions(root = 'dir where images are',
@@ -75,7 +75,7 @@ print(target)

 输出:

-```
+```py
 Number of samples: 82783
 Image Size: (3L, 427L, 640L)
 [u'A plane emitting smoke stream flying over a mountain.',
@@ -89,7 +89,7 @@ u'A mountain view with a plume of smoke in the background']

 检测:

-```
+```py
 dset.CocoDetection(root="dir where images are", annFile="json annotation file", [transform, target_transform])
 ```

@@ -104,7 +104,7 @@ dset.CocoDetection(root="dir where images are", annFile="json annotation file",

 #### LSUN

-```
+```py
 dset.LSUN(db_path, classes='train', [transform, target_transform])
 ```

@@ -119,7 +119,7 @@ dset.LSUN(db_path, classes='train', [transform, target_transform])

 一个通用的数据加载器，数据集中的数据以以下方式组织

-```
+```py
 root/dog/xxx.png
 root/dog/xxy.png
 root/dog/xxz.png
@@ -143,7 +143,7 @@ dset.ImageFolder(root="root folder path", [transform, target_transform])

 #### CIFAR

-```
+```py
 dset.CIFAR10(root, train=True, transform=None, target_transform=None, download=False)

 dset.CIFAR100(root, train=True, transform=None, target_transform=None, download=False)
@@ -159,7 +159,7 @@ dset.CIFAR100(root, train=True, transform=None, target_transform=None, download=

 #### STL10

-```
+```py
 dset.STL10(root, split='train', transform=None, target_transform=None, download=False)
 ```

@@ -173,7 +173,7 @@ dset.STL10(root, split='train', transform=None, target_transform=None, download=

 #### SVHN

-```
+```py
 class torchvision.datasets.SVHN(root, split='train', transform=None, target_transform=None, download=False)
 ```

@@ -187,7 +187,7 @@ class torchvision.datasets.SVHN(root, split='train', transform=None, target_tran

 #### PhotoTour

-```
+```py
 class torchvision.datasets.PhotoTour(root, name, train=True, transform=None, download=False)
 ```


--- a/docs/0.4/35.md
+++ b/docs/0.4/35.md
@@ -12,7 +12,7 @@

 可以通过调用构造函数来构造具有随机权重的模型：

-```
+```py
 import torchvision.models as models
 resnet18 = models.resnet18()
 alexnet = models.alexnet()
@@ -22,7 +22,7 @@ densenet = models.densenet_161()

 我们提供的Pathway变体和alexnet预训练的模型，利用pytorch 的`torch.utils.model_zoo`。这些可以通过构建`pretrained=True`：

-```
+```py
 import torchvision.models as models
 resnet18 = models.resnet18(pretrained=True)
 alexnet = models.alexnet(pretrained=True)
@@ -30,7 +30,7 @@ alexnet = models.alexnet(pretrained=True)

 所有预训练的模型的期望输入图像相同的归一化，即小批量形状通道的RGB图像（3 x H x W），其中H和W预计将至少224。这些图像必须被加载到[ 0, 1 ]的范围内，然后使用平均= [ 0.485，0.456，0.406 ]和STD＝[ 0.229，0.224，0.225 ]进行归一化。您可以使用以下转换来正常化：

-```
+```py
 normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],std=[0.229, 0.224, 0.225])
 ```

@@ -56,7 +56,7 @@ normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],std=[0.229, 0.224, 0
 | Densenet-201 | 22.80 | 6.43 |
 | Densenet-161 | 22.35 | 6.20 |

-```
+```py
 torchvision.models.alexnet(pretrained=False, ** kwargs)
 ```

@@ -64,19 +64,19 @@ AlexNet 模型结构 paper地址

 pretrained (bool) – True, 返回在ImageNet上训练好的模型。

-```
+```py
 torchvision.models.resnet18(pretrained=False, ** kwargs)
 ```

 构建一个resnet18模型 pretrained (bool) – True, 返回在ImageNet上训练好的模型。

-```
+```py
 torchvision.models.resnet34(pretrained=False, ** kwargs)
 ```

 构建一个ResNet-34 模型. Parameters: pretrained (bool) – True, 返回在ImageNet上训练好的模型。

-```
+```py
 torchvision.models.resnet50(pretrained=False, ** kwargs)
 ```

@@ -84,7 +84,7 @@ torchvision.models.resnet50(pretrained=False, ** kwargs)

 pretrained (bool) – True, 返回在ImageNet上训练好的模型。

-```
+```py
 torchvision.models.resnet101(pretrained=False, ** kwargs)
 ```

@@ -92,7 +92,7 @@ torchvision.models.resnet101(pretrained=False, ** kwargs)

 pretrained (bool) – True, 返回在ImageNet上训练好的模型。

-```
+```py
 torchvision.models.resnet152(pretrained=False, ** kwargs)
 ```

@@ -100,7 +100,7 @@ torchvision.models.resnet152(pretrained=False, ** kwargs)

 pretrained (bool) – True, 返回在ImageNet上训练好的模型。

-```
+```py
 torchvision.models.vgg11(pretrained=False, ** kwargs)
 ```

@@ -108,13 +108,13 @@ VGG 11-layer model (configuration “A”) -

 pretrained (bool) – True, 返回在ImageNet上训练好的模型。

-```
+```py
 torchvision.models.vgg11_bn(** kwargs)
 ```

 VGG 11-layer model (configuration “A”) with batch normalization

-```
+```py
 torchvision.models.vgg13(pretrained=False, ** kwargs)
 ```

@@ -122,13 +122,13 @@ VGG 13-layer model (configuration “B”)

 pretrained (bool) – True, 返回在ImageNet上训练好的模型。

-```
+```py
 torchvision.models.vgg13_bn(** kwargs)
 ```

 VGG 13-layer model (configuration “B”) with batch normalization

-```
+```py
 torchvision.models.vgg16(pretrained=False, ** kwargs)
 ```

@@ -136,13 +136,13 @@ VGG 16-layer model (configuration “D”)

 Parameters: pretrained (bool) – If True, returns a model pre-trained on ImageNet

-```
+```py
 torchvision.models.vgg16_bn(** kwargs)
 ```

 VGG 16-layer model (configuration “D”) with batch normalization

-```
+```py
 torchvision.models.vgg19(pretrained=False, ** kwargs)
 ```

@@ -150,7 +150,7 @@ VGG 19-layer model (configuration “E”)

 pretrained (bool) – True, 返回在ImageNet上训练好的模型。

-```
+```py
 torchvision.models.vgg19_bn(** kwargs)
 ```


--- a/docs/0.4/36.md
+++ b/docs/0.4/36.md
@@ -11,7 +11,7 @@

 变换是常用的图像变换。它们可以用`Compose`连接在一起。

-```
+```py
 class torchvision.transforms.Compose(transforms)
 ```

@@ -19,7 +19,7 @@ class torchvision.transforms.Compose(transforms)

 transforms： 由transform构成的列表. 例子：

-```
+```py
 transforms.Compose([
     transforms.CenterCrop(10),
    transforms.ToTensor(),
@@ -30,7 +30,7 @@ transforms.Compose([

 * * *

-```
+```py
 class torchvision.transforms.Scale(size, interpolation=2)
 ```

@@ -41,31 +41,31 @@ class torchvision.transforms.Scale(size, interpolation=2)
 1.  size (sequence or int) - 期望输出尺寸。如果size是一个像(w, h)的序列，输出大小将按照w,h匹配到。如果大小是int，则图像将匹配到这个数字。例如，如果原图的`height&gt;width`,那么改变大小后的图片大小是`(size*height/width, size)`。
 2.  interpolation (int, optional) -需要添加值。默认的是`PIL.Image.BILINEAR`

-```
+```py
 class torchvision.transforms.CenterCrop(size)
 ```

 将给定的PIL.Image进行中心切割，得到给定的size，size可以是tuple，(target_height, target_width)。size也可以是一个Integer，在这种情况下，切出来的图片的形状是正方形。

-```
+```py
 class torchvision.transforms.RandomCrop(size, padding=0)
 ```

 切割中心点的位置随机选取。size可以是tuple也可以是Integer。

-```
+```py
 class torchvision.transforms.RandomHorizontalFlip
 ```

 随机水平翻转给定的PIL.Image,概率为0.5。即：一半的概率翻转，一半的概率不翻转。

-```
+```py
 class torchvision.transforms.RandomSizedCrop(size, interpolation=2)
 ```

 先将给定的PIL.Image随机切，然后再resize成给定的size大小。

-```
+```py
 class torchvision.transforms.Pad(padding, fill=0)
 ```

@@ -75,7 +75,7 @@ class torchvision.transforms.Pad(padding, fill=0)

 * * *

-```
+```py
 class torchvision.transforms.Normalize(mean, std)
 ```

@@ -92,7 +92,7 @@ class torchvision.transforms.Normalize(mean, std)

 * * *

-```
+```py
 class torchvision.transforms.ToTensor
 ```

@@ -104,7 +104,7 @@ class torchvision.transforms.ToTensor
 2.  返回结果: 转换后的图像。
 3.  返回样式: Tensor张量

-```
+```py
 class torchvision.transforms.ToPILImage
 ```

@@ -120,7 +120,7 @@ class torchvision.transforms.ToPILImage

 * * *

-```
+```py
 class torchvision.transforms.Lambda(lambd)
 ```


--- a/docs/0.4/37.md
+++ b/docs/0.4/37.md
@@ -2,7 +2,7 @@

 # torchvision.utils

-```
+```py
 torchvision.utils.make_grid(tensor, nrow=8, padding=2, normalize=False, range=None, scale_each=False, pad_value=0)
 ```

@@ -21,7 +21,7 @@ torchvision.utils.make_grid(tensor, nrow=8, padding=2, normalize=False, range=No

 查看下面的例子：

-```
+```py
 torchvision.utils.save_image(tensor, filename, nrow=8, padding=2, normalize=False, range=None, scale_each=False, pad_value=0)
 ```


--- a/docs/0.4/4.md
+++ b/docs/0.4/4.md
@@ -19,7 +19,7 @@

 你可以从下面的代码看到`torch.nn`模块的`Linear`函数, 以及注解

-```
+```py
 # Inherit from Function
 class Linear(Function):

@@ -57,13 +57,13 @@ class Linear(Function):

 现在，为了更方便使用这些自定义操作，推荐使用`apply`方法：

-```
+```py
 linear = LinearFunction.apply
 ```

 我们下面给出一个由非变量参数进行参数化的函数的例子:

-```
+```py
 class MulConstant(Function):
    @staticmethod
    def forward(ctx, tensor, constant):
@@ -81,7 +81,7 @@ class MulConstant(Function):

 你可能想检测你刚刚实现的`backward`方法是否正确的计算了梯度。你可以使用小的有限差分法(`Finite Difference`)进行数值估计。

-```
+```py
 from torch.autograd import gradcheck

 # gradcheck takes a tuple of tensors as input, check if your gradient
@@ -107,7 +107,7 @@ print(test)

 下面是实现`Linear`模块的方式：

-```
+```py
 class Linear(nn.Module):
    def __init__(self, input_features, output_features, bias=True):
        super(Linear, self).__init__()

--- a/docs/0.4/5.md
+++ b/docs/0.4/5.md
@@ -10,7 +10,7 @@

 有时，当可微分变量可能发生时，它可能并不明显。考虑以下训练循环（从[源代码](https://discuss.pytorch.org/t/high-memory-usage-while-training/162)节选）：

-```
+```py
 total_loss = 0
 for i in range(10000):
    optimizer.zero_grad()
@@ -29,7 +29,7 @@ for i in range(10000):

 作用域的范围可能比你想象的要大。例如：

-```
+```py
 for i in range(5):
    intermdeiate = f(input[i])
    result += g(intermediate)
@@ -61,7 +61,7 @@ PyTorch 使用缓存内存分配器来加速内存分配。因此，`nvidia-smi`

 的序列。例如，你可以写：

-```
+```py
 from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_squence

 class MyModule(nn.Module):

--- a/docs/0.4/6.md
+++ b/docs/0.4/6.md
@@ -53,7 +53,7 @@

 具体的 Hogwild 实现可以在 [示例库](https://github.com/pytorch/examples/tree/master/mnist_hogwild) 中找到，但为了展示代码的整体结构，下面还有一个最简单的示例：

-```
+```py
 import torch.multiprocessing as mp
 from model import MyModel


--- a/docs/0.4/7.md
+++ b/docs/0.4/7.md
@@ -10,26 +10,26 @@

 第一个（推荐）只保存和加载模型参数：

-```
+```py
 torch.save(the_model.state_dict(), PATH)
 ```

 然后：

-```
+```py
 the_model = TheModelClass(*args, **kwargs)
 the_mdel.load_state_dict(torch.load(PATH))
 ```

 第二个方法是保存并加载整个模型：

-```
+```py
 torch.save(the_model, PATH)
 ```

 然后：

-```
+```py
 the_model = torch.load(PATH)
 ```


--- a/docs/0.4/8.md
+++ b/docs/0.4/8.md