[cherry-pick2.4]docs fix (#47669)

* #46165 * #45752 * fix some doc bug test=document_fix (#45488) * fix some doc bug test=document_fix * fix some docs issues, test=document_fix * beta -> \beta in softplus * threshold -> \varepsilon in softplus * parameter name * delta -> \delta in smooth_l1_loss * fix some docs test=document_fix * fix docs test=document_fix * fix docs && 增加空行 test=document_fix * Update python/paddle/nn/functional/activation.py, test=document_fix * Update python/paddle/nn/layer/activation.py, test=document_fix Co-authored-by: N SigureMo <sigure.qaq@gmail.com> * [docs] add ipustrategy Hyperlink (#46422) * [docs] add ipustrategy Hyperlink * fix ipu_shard_guard docs; test=document_fix * [docs] add set_ipu_shard note * [docs] fix hyperlink * update framework.py * fix mlu_places docs; test=document_fix * fix put_along_axis docs; test=document_fix * fix flake8 W293 error, test=document_fix * fix typo in typing, test=document_fix Co-authored-by: N Ligoml <39876205+Ligoml@users.noreply.github.com> Co-authored-by: N Nyakku Shigure <sigure.qaq@gmail.com> * #46659 * Update README_cn.md (#46927) 修复了错别字 * #46738 * fix paddle.get_default_dtype (#47040) Chinese and English return values are inconsistent * fix bug Co-authored-by: N 张春乔 <83450930+Liyulingyue@users.noreply.github.com> Co-authored-by: N Infinity_lee <luhputu0815@gmail.com> Co-authored-by: N mrcangye <chenloong@88.com> Co-authored-by: N SigureMo <sigure.qaq@gmail.com> Co-authored-by: N gouzil <66515297+gouzil@users.noreply.github.com> Co-authored-by: N Hamid Zare <12127420+hamidzr@users.noreply.github.com> Co-authored-by: N Sqhttwl <61459740+Sqhttwl@users.noreply.github.com> Co-authored-by: N OccupyMars2025 <31559413+OccupyMars2025@users.noreply.github.com> Co-authored-by: N 超级码牛 <54444805+SuperCodebull@users.noreply.github.com> Co-authored-by: N jzhang533 <jzhang533@gmail.com>

[cherry-pick2.4]docs fix (#47669)
* #46165 * #45752 * fix some doc bug test=document_fix (#45488) * fix some doc bug test=document_fix * fix some docs issues, test=document_fix * beta -> \beta in softplus * threshold -> \varepsilon in softplus * parameter name * delta -> \delta in smooth_l1_loss * fix some docs test=document_fix * fix docs test=document_fix * fix docs && 增加空行 test=document_fix * Update python/paddle/nn/functional/activation.py, test=document_fix * Update python/paddle/nn/layer/activation.py, test=document_fix Co-authored-by: N SigureMo <sigure.qaq@gmail.com> * [docs] add ipustrategy Hyperlink (#46422) * [docs] add ipustrategy Hyperlink * fix ipu_shard_guard docs; test=document_fix * [docs] add set_ipu_shard note * [docs] fix hyperlink * update framework.py * fix mlu_places docs; test=document_fix * fix put_along_axis docs; test=document_fix * fix flake8 W293 error, test=document_fix * fix typo in typing, test=document_fix Co-authored-by: N Ligoml <39876205+Ligoml@users.noreply.github.com> Co-authored-by: N Nyakku Shigure <sigure.qaq@gmail.com> * #46659 * Update README_cn.md (#46927) 修复了错别字 * #46738 * fix paddle.get_default_dtype (#47040) Chinese and English return values are inconsistent * fix bug Co-authored-by: N 张春乔 <83450930+Liyulingyue@users.noreply.github.com> Co-authored-by: N Infinity_lee <luhputu0815@gmail.com> Co-authored-by: N mrcangye <chenloong@88.com> Co-authored-by: N SigureMo <sigure.qaq@gmail.com> Co-authored-by: N gouzil <66515297+gouzil@users.noreply.github.com> Co-authored-by: N Hamid Zare <12127420+hamidzr@users.noreply.github.com> Co-authored-by: N Sqhttwl <61459740+Sqhttwl@users.noreply.github.com> Co-authored-by: N OccupyMars2025 <31559413+OccupyMars2025@users.noreply.github.com> Co-authored-by: N 超级码牛 <54444805+SuperCodebull@users.noreply.github.com> Co-authored-by: N jzhang533 <jzhang533@gmail.com>
cf668ab3 · Ligoml · GitHub · 3a014783 · cf668ab3 · cf668ab3
52 changed file
--- a/README.md
+++ b/README.md
@@ -89,8 +89,8 @@ We provide [English](https://www.paddlepaddle.org.cn/documentation/docs/en/guide
 ## Courses
- [Server Deployments](https://aistudio.baidu.com/aistudio/course/introduce/19084): Courses intorducing high performance server deployments via local and remote services.
+- [Server Deployments](https://aistudio.baidu.com/aistudio/course/introduce/19084): Courses introducing high performance server deployments via local and remote services.
- [Edge Deployments](https://aistudio.baidu.com/aistudio/course/introduce/22690): Courses intorducing edge deployments from mobile, IoT to web and applets.   
+- [Edge Deployments](https://aistudio.baidu.com/aistudio/course/introduce/22690): Courses introducing edge deployments from mobile, IoT to web and applets.
 ## Copyright and License
 PaddlePaddle is provided under the [Apache-2.0 license](LICENSE).
--- a/README_cn.md
+++ b/README_cn.md
@@ -88,7 +88,7 @@ PaddlePaddle用户可领取**免费Tesla V100在线算力资源**，训练模型
 ## 课程
 - [服务器部署](https://aistudio.baidu.com/aistudio/course/introduce/19084): 详细介绍高性能服务器端部署实操，包含本地端及服务化Serving部署等
- [端侧部署](https://aistudio.baidu.com/aistudio/course/introduce/22690): 详细介绍端侧多场景部署实操，从移端端设备、IoT、网页到小程序部署
+- [端侧部署](https://aistudio.baidu.com/aistudio/course/introduce/22690): 详细介绍端侧多场景部署实操，从移动端设备、IoT、网页到小程序部署
 ## 版权和许可证
 PaddlePaddle由[Apache-2.0 license](LICENSE)提供
--- a/paddle/fluid/operators/activation_op.cc
+++ b/paddle/fluid/operators/activation_op.cc
@@ -172,9 +172,9 @@ class ActivationOpGrad : public framework::OperatorWithKernel {
 };
 UNUSED constexpr char SigmoidDoc[] = R"DOC(
-Sigmoid Activation Operator
+Sigmoid Activation
-$$out = \\frac{1}{1 + e^{-x}}$$
+$$out = \frac{1}{1 + e^{-x}}$$
 )DOC";

--- a/python/paddle/autograd/py_layer.py
+++ b/python/paddle/autograd/py_layer.py
@@ -55,7 +55,7 @@ class LegacyPyLayerContext(object):
        """
        Saves given tensors that backward need. Use ``saved_tensor`` in the `backward` to get the saved tensors.
-        .. note::
+        Note:
            This API should be called at most once, and only inside `forward`.
        Args:
@@ -341,7 +341,7 @@ class EagerPyLayerContext(object):
        """
        Saves given tensors that backward need. Use ``saved_tensor`` in the `backward` to get the saved tensors.
-        .. note::
+        Note:
            This API should be called at most once, and only inside `forward`.
        Args:

--- a/python/paddle/device/cuda/__init__.py
+++ b/python/paddle/device/cuda/__init__.py
@@ -203,7 +203,7 @@ def max_memory_allocated(device=None):
    '''
    Return the peak size of gpu memory that is allocated to tensor of the given device.
-    .. note::
+    Note:
        The size of GPU memory allocated to tensor is 256-byte aligned in Paddle, which may larger than the memory size that tensor actually need.
        For instance, a float32 tensor with shape [1] in GPU will take up 256 bytes memory, even though storing a float32 data requires only 4 bytes.
@@ -269,7 +269,7 @@ def memory_allocated(device=None):
    '''
    Return the current size of gpu memory that is allocated to tensor of the given device.
-    .. note::
+    Note:
        The size of GPU memory allocated to tensor is 256-byte aligned in Paddle, which may be larger than the memory size that tensor actually need.
        For instance, a float32 tensor with shape [1] in GPU will take up 256 bytes memory, even though storing a float32 data requires only 4 bytes.

--- a/python/paddle/distributed/collective.py
+++ b/python/paddle/distributed/collective.py
@@ -1349,7 +1349,7 @@ def alltoall_single(
    """
    Scatter a single input tensor to all participators and gather the received tensors in out_tensor.
-    .. note::
+    Note:
        ``alltoall_single`` is only supported in eager mode.
    Args:

--- a/python/paddle/distributed/fleet/base/private_helper_function.py
+++ b/python/paddle/distributed/fleet/base/private_helper_function.py
@@ -30,9 +30,9 @@ def wait_server_ready(endpoints):
    ["127.0.0.1:8080", "127.0.0.1:8081"]
    Examples:
-    .. code-block:: python
+        .. code-block:: python
-         wait_server_ready(["127.0.0.1:8080", "127.0.0.1:8081"])
+             wait_server_ready(["127.0.0.1:8080", "127.0.0.1:8081"])
    """
    assert not isinstance(endpoints, str)
    while True:

--- a/python/paddle/distributed/parallel.py
+++ b/python/paddle/distributed/parallel.py
@@ -105,7 +105,7 @@ def init_parallel_env():
    """
    Initialize parallel training environment in dynamic graph mode.
-    .. note::
+    Note:
        Now initialize both `NCCL` and `GLOO` contexts for communication.
    Args:

--- a/python/paddle/distributed/sharding/group_sharded.py
+++ b/python/paddle/distributed/sharding/group_sharded.py
@@ -209,7 +209,7 @@ def save_group_sharded_model(model, output, optimizer=None):
    """
    Group sharded encapsulated model and optimizer state saving module.
-    .. note::
+    Note:
        If using save_group_sharded_model saves the model. When loading again, you need to set the model or optimizer state before using group_sharded_parallel.
    Args:

--- a/python/paddle/distribution/distribution.py
+++ b/python/paddle/distribution/distribution.py
@@ -140,7 +140,7 @@ class Distribution(object):
    def probs(self, value):
        """Probability density/mass function.
-        .. note::
+        Note:
            This method will be deprecated in the future, please use `prob`
            instead.

--- a/python/paddle/distribution/kl.py
+++ b/python/paddle/distribution/kl.py
@@ -38,11 +38,11 @@ def kl_divergence(p, q):
        KL(p||q) = \int p(x)log\frac{p(x)}{q(x)} \mathrm{d}x
    Args:
-        p (Distribution): ``Distribution`` object.
+        p (Distribution): ``Distribution`` object. Inherits from the Distribution Base class.
-        q (Distribution): ``Distribution`` object.
+        q (Distribution): ``Distribution`` object. Inherits from the Distribution Base class.
    Returns:
-        Tensor: Batchwise KL-divergence between distribution p and q.
+        Tensor, Batchwise KL-divergence between distribution p and q.
    Examples:
@@ -71,8 +71,8 @@ def register_kl(cls_p, cls_q):
    implemention funciton by the decorator.
    Args:
-        cls_p(Distribution): Subclass derived from ``Distribution``.
+        cls_p (Distribution): The Distribution type of Instance p. Subclass derived from ``Distribution``.
-        cls_q(Distribution): Subclass derived from ``Distribution``.
+        cls_q (Distribution): The Distribution type of Instance q. Subclass derived from ``Distribution``.
    Examples:
        .. code-block:: python

--- a/python/paddle/distribution/normal.py
+++ b/python/paddle/distribution/normal.py
@@ -47,7 +47,7 @@ class Normal(distribution.Distribution):
    .. math::
-        pdf(x; \mu, \sigma) = \\frac{1}{Z}e^{\\frac {-0.5 (x - \mu)^2}  {\sigma^2} }
+        pdf(x; \mu, \sigma) = \frac{1}{Z}e^{\frac {-0.5 (x - \mu)^2}  {\sigma^2} }
    .. math::
@@ -60,43 +60,43 @@ class Normal(distribution.Distribution):
    * :math:`Z`: is the normalization constant.
    Args:
-        loc(int|float|list|tuple|numpy.ndarray|Tensor): The mean of normal distribution.The data type is int, float, list, numpy.ndarray or Tensor.
+        loc(int|float|list|tuple|numpy.ndarray|Tensor): The mean of normal distribution.The data type is float32 and float64.
-        scale(int|float|list|tuple|numpy.ndarray|Tensor): The std of normal distribution.The data type is int, float, list, numpy.ndarray or Tensor.
+        scale(int|float|list|tuple|numpy.ndarray|Tensor): The std of normal distribution.The data type is float32 and float64.
        name(str, optional): Name for the operation (optional, default is None). For more information, please refer to :ref:`api_guide_Name`.
    Examples:
        .. code-block:: python
-          import paddle
+            import paddle
-          from paddle.distribution import Normal
+            from paddle.distribution import Normal
-          # Define a single scalar Normal distribution.
+            # Define a single scalar Normal distribution.
-          dist = Normal(loc=0., scale=3.)
+            dist = Normal(loc=0., scale=3.)
-          # Define a batch of two scalar valued Normals.
+            # Define a batch of two scalar valued Normals.
-          # The first has mean 1 and standard deviation 11, the second 2 and 22.
+            # The first has mean 1 and standard deviation 11, the second 2 and 22.
-          dist = Normal(loc=[1., 2.], scale=[11., 22.])
+            dist = Normal(loc=[1., 2.], scale=[11., 22.])
-          # Get 3 samples, returning a 3 x 2 tensor.
+            # Get 3 samples, returning a 3 x 2 tensor.
-          dist.sample([3])
+            dist.sample([3])
-          # Define a batch of two scalar valued Normals.
+            # Define a batch of two scalar valued Normals.
-          # Both have mean 1, but different standard deviations.
+            # Both have mean 1, but different standard deviations.
-          dist = Normal(loc=1., scale=[11., 22.])
+            dist = Normal(loc=1., scale=[11., 22.])
-          # Complete example
+            # Complete example
-          value_tensor = paddle.to_tensor([0.8], dtype="float32")
+            value_tensor = paddle.to_tensor([0.8], dtype="float32")
-          normal_a = Normal([0.], [1.])
+            normal_a = Normal([0.], [1.])
-          normal_b = Normal([0.5], [2.])
+            normal_b = Normal([0.5], [2.])
-          sample = normal_a.sample([2])
+            sample = normal_a.sample([2])
-          # a random tensor created by normal distribution with shape: [2, 1]
+            # a random tensor created by normal distribution with shape: [2, 1]
-          entropy = normal_a.entropy()
+            entropy = normal_a.entropy()
-          # [1.4189385] with shape: [1]
+            # [1.4189385] with shape: [1]
-          lp = normal_a.log_prob(value_tensor)
+            lp = normal_a.log_prob(value_tensor)
-          # [-1.2389386] with shape: [1]
+            # [-1.2389386] with shape: [1]
-          p = normal_a.probs(value_tensor)
+            p = normal_a.probs(value_tensor)
-          # [0.28969154] with shape: [1]
+            # [0.28969154] with shape: [1]
-          kl = normal_a.kl_divergence(normal_b)
+            kl = normal_a.kl_divergence(normal_b)
-          # [0.34939718] with shape: [1]
+            # [0.34939718] with shape: [1]
    """
    def __init__(self, loc, scale, name=None):
@@ -153,11 +153,11 @@ class Normal(distribution.Distribution):
        """Generate samples of the specified shape.
        Args:
-          shape (list): 1D `int32`. Shape of the generated samples.
+            shape (list): 1D `int32`. Shape of the generated samples.
-          seed (int): Python integer number.
+            seed (int): Python integer number.
        Returns:
-          Tensor: A tensor with prepended dimensions shape.The data type is float32.
+            Tensor, A tensor with prepended dimensions shape.The data type is float32.
        """
        if not _non_static_mode():
@@ -198,14 +198,14 @@ class Normal(distribution.Distribution):
        .. math::
-            entropy(\sigma) = 0.5 \\log (2 \pi e \sigma^2)
+            entropy(\sigma) = 0.5 \log (2 \pi e \sigma^2)
        In the above equation:
        * :math:`scale = \sigma`: is the std.
        Returns:
-          Tensor: Shannon entropy of normal distribution.The data type is float32.
+            Tensor, Shannon entropy of normal distribution.The data type is float32.
        """
        name = self.name + '_entropy'
@@ -244,10 +244,10 @@ class Normal(distribution.Distribution):
        """Probability density/mass function.
        Args:
-          value (Tensor): The input tensor.
+            value (Tensor): The input tensor.
        Returns:
-          Tensor: probability.The data type is same with value.
+            Tensor, probability. The data type is same with value.
        """
        name = self.name + '_probs'
@@ -269,11 +269,11 @@ class Normal(distribution.Distribution):
        .. math::
-            KL\_divergence(\mu_0, \sigma_0; \mu_1, \sigma_1) = 0.5 (ratio^2 + (\\frac{diff}{\sigma_1})^2 - 1 - 2 \\ln {ratio})
+            KL\_divergence(\mu_0, \sigma_0; \mu_1, \sigma_1) = 0.5 (ratio^2 + (\frac{diff}{\sigma_1})^2 - 1 - 2 \ln {ratio})
        .. math::
-            ratio = \\frac{\sigma_0}{\sigma_1}
+            ratio = \frac{\sigma_0}{\sigma_1}
        .. math::
@@ -292,7 +292,7 @@ class Normal(distribution.Distribution):
            other (Normal): instance of Normal.
        Returns:
-            Tensor: kl-divergence between two normal distributions.The data type is float32.
+            Tensor, kl-divergence between two normal distributions.The data type is float32.
        """
        if not _non_static_mode():

--- a/python/paddle/distribution/transform.py
+++ b/python/paddle/distribution/transform.py
@@ -67,11 +67,11 @@ class Transform(object):
    used for transforming a random sample generated by ``Distribution`` 
    instance. 
-    Suppose :math:`X` is a K-dimensional random variable with probability 
+    Suppose :math:`X` is a K-dimensional random variable with probability
-    density function :math:`p_X(x)`. A new random variable :math:`Y = f(X)` may 
+    density function :math:`p_X(x)`. A new random variable :math:`Y = f(X)` may
-    be defined by transforming :math:`X` with a suitably well-behaved funciton 
+    be defined by transforming :math:`X` with a suitably well-behaved funciton
-    :math:`f`. It suffices for what follows to note that if f is one-to-one and 
+    :math:`f`. It suffices for what follows to note that if `f` is one-to-one and
-    its inverse :math:`f^{-1}` have a well-defined Jacobian, then the density of 
+    its inverse :math:`f^{-1}` have a well-defined Jacobian, then the density of
    :math:`Y` is
    .. math::
@@ -1049,8 +1049,9 @@ class StackTransform(Transform):
    specific axis.
    Args:
-        transforms(Sequence[Transform]): The sequence of transformations.
+        transforms (Sequence[Transform]): The sequence of transformations.
-        axis(int): The axis along which will be transformed.
+        axis (int, optional): The axis along which will be transformed. default
+            value is 0.
    Examples:
@@ -1058,7 +1059,6 @@ class StackTransform(Transform):
            import paddle
            x = paddle.stack(
                (paddle.to_tensor([1., 2., 3.]), paddle.to_tensor([1, 2., 3.])), 1)
            t = paddle.distribution.StackTransform(
@@ -1071,11 +1071,13 @@ class StackTransform(Transform):
            #        [[2.71828175 , 1.         ],
            #         [7.38905621 , 4.         ],
            #         [20.08553696, 9.         ]])
            print(t.inverse(t.forward(x)))
            # Tensor(shape=[3, 2], dtype=float32, place=Place(gpu:0), stop_gradient=True,
            #        [[1., 1.],
            #         [2., 2.],
            #         [3., 3.]])
            print(t.forward_log_det_jacobian(x))
            # Tensor(shape=[3, 2], dtype=float32, place=Place(gpu:0), stop_gradient=True,
            #        [[1.        , 0.69314718],

--- a/python/paddle/distribution/uniform.py
+++ b/python/paddle/distribution/uniform.py
@@ -52,7 +52,7 @@ class Uniform(distribution.Distribution):
    .. math::
-        pdf(x; a, b) = \\frac{1}{Z}, \ a <=x <b
+        pdf(x; a, b) = \frac{1}{Z}, \ a <=x <b
    .. math::
@@ -65,43 +65,45 @@ class Uniform(distribution.Distribution):
    * :math:`Z`: is the normalizing constant.
    The parameters `low` and `high` must be shaped in a way that supports
-    [broadcasting](https://www.paddlepaddle.org.cn/documentation/docs/en/develop/beginners_guide/basic_concept/broadcasting_en.html) (e.g., `high - low` is a valid operation).
+    :ref:`user_guide_broadcasting` (e.g., `high - low` is a valid operation).
    Args:
-        low(int|float|list|tuple|numpy.ndarray|Tensor): The lower boundary of uniform distribution.The data type is int, float, list, numpy.ndarray or Tensor
+        low(int|float|list|tuple|numpy.ndarray|Tensor): The lower boundary of
-        high(int|float|list|tuple|numpy.ndarray|Tensor): The higher boundary of uniform distribution.The data type is int, float, list, numpy.ndarray or Tensor
+            uniform distribution.The data type is float32 and float64.
-        name(str, optional): Name for the operation (optional, default is None). For more information, please refer to :ref:`api_guide_Name`.
+        high(int|float|list|tuple|numpy.ndarray|Tensor): The higher boundary
+            of uniform distribution.The data type is float32 and float64.
+        name (str, optional): For details, please refer to :ref:`api_guide_Name`. Generally, no setting is required. Default: None.
    Examples:
        .. code-block:: python
-          import paddle
+            import paddle
-          from paddle.distribution import Uniform
+            from paddle.distribution import Uniform
-          # Without broadcasting, a single uniform distribution [3, 4]:
+            # Without broadcasting, a single uniform distribution [3, 4]:
-          u1 = Uniform(low=3.0, high=4.0)
+            u1 = Uniform(low=3.0, high=4.0)
-          # 2 distributions [1, 3], [2, 4]
+            # 2 distributions [1, 3], [2, 4]
-          u2 = Uniform(low=[1.0, 2.0], high=[3.0, 4.0])
+            u2 = Uniform(low=[1.0, 2.0], high=[3.0, 4.0])
-          # 4 distributions
+            # 4 distributions
-          u3 = Uniform(low=[[1.0, 2.0], [3.0, 4.0]],
+            u3 = Uniform(low=[[1.0, 2.0], [3.0, 4.0]],
-                    high=[[1.5, 2.5], [3.5, 4.5]])
+                        high=[[1.5, 2.5], [3.5, 4.5]])
-          # With broadcasting:
+            # With broadcasting:
-          u4 = Uniform(low=3.0, high=[5.0, 6.0, 7.0])
+            u4 = Uniform(low=3.0, high=[5.0, 6.0, 7.0])
-          # Complete example
+            # Complete example
-          value_tensor = paddle.to_tensor([0.8], dtype="float32")
+            value_tensor = paddle.to_tensor([0.8], dtype="float32")
-          uniform = Uniform([0.], [2.])
+            uniform = Uniform([0.], [2.])
-          sample = uniform.sample([2])
+            sample = uniform.sample([2])
-          # a random tensor created by uniform distribution with shape: [2, 1]
+            # a random tensor created by uniform distribution with shape: [2, 1]
-          entropy = uniform.entropy()
+            entropy = uniform.entropy()
-          # [0.6931472] with shape: [1]
+            # [0.6931472] with shape: [1]
-          lp = uniform.log_prob(value_tensor)
+            lp = uniform.log_prob(value_tensor)
-          # [-0.6931472] with shape: [1]
+            # [-0.6931472] with shape: [1]
-          p = uniform.probs(value_tensor)
+            p = uniform.probs(value_tensor)
-          # [0.5] with shape: [1]
+            # [0.5] with shape: [1]
    """
    def __init__(self, low, high, name=None):
@@ -157,11 +159,11 @@ class Uniform(distribution.Distribution):
        """Generate samples of the specified shape.
        Args:
-          shape (list): 1D `int32`. Shape of the generated samples.
+            shape (list): 1D `int32`. Shape of the generated samples.
-          seed (int): Python integer number.
+            seed (int): Python integer number.
        Returns:
-          Tensor: A tensor with prepended dimensions shape.The data type is float32.
+            Tensor, A tensor with prepended dimensions shape. The data type is float32.
        """
        if not _non_static_mode():
@@ -210,10 +212,10 @@ class Uniform(distribution.Distribution):
        """Log probability density/mass function.
        Args:
-          value (Tensor): The input tensor.
+            value (Tensor): The input tensor.
        Returns:
-          Tensor: log probability.The data type is same with value.
+            Tensor, log probability.The data type is same with value.
        """
        value = self._check_values_dtype_in_probs(self.low, value)
@@ -249,10 +251,10 @@ class Uniform(distribution.Distribution):
        """Probability density/mass function.
        Args:
-          value (Tensor): The input tensor.
+            value (Tensor): The input tensor.
        Returns:
-          Tensor: probability.The data type is same with value.
+            Tensor, probability. The data type is same with value.
        """
        value = self._check_values_dtype_in_probs(self.low, value)
@@ -291,7 +293,7 @@ class Uniform(distribution.Distribution):
            entropy(low, high) = \\log (high - low)
        Returns:
-          Tensor: Shannon entropy of uniform distribution.The data type is float32.
+            Tensor, Shannon entropy of uniform distribution.The data type is float32.
        """
        name = self.name + '_entropy'

--- a/python/paddle/fluid/framework.py
+++ b/python/paddle/fluid/framework.py
-#   Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
+#   Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -293,13 +293,13 @@ def ipu_shard_guard(index=-1, stage=-1):
            The sharded model will be computed from small to large. The default value is -1,
            which means no pipelining computation order and run Ops in terms of graph.
-    **Note**:
+    Note:
-    Only if the enable_manual_shard=True, the 'index' is able to be set not -1. Please refer
+        Only if the enable_manual_shard=True, the 'index' is able to be set not -1. Please refer
-    to :code:`paddle.static.IpuStrategy` .
+        to :ref:`api_paddle_static_IpuStrategy`.
-    Only if the enable_pipelining=True, the 'stage' is able to be set not -1. Please refer
+        Only if the enable_pipelining=True, the 'stage' is able to be set not -1. Please refer
-    to :code:`paddle.static.IpuStrategy` .
+        to :ref:`api_paddle_static_IpuStrategy`.
-    A index is allowed to match none stage or a stage. A stage is only allowed to match a new or
+        A index is allowed to match none stage or a stage. A stage is only allowed to match a new or
-    duplicated index.
+        duplicated index.
    Examples:
        .. code-block:: python
@@ -338,6 +338,11 @@ def set_ipu_shard(call_func, index=-1, stage=-1):
    """
    Shard the ipu with the given call function. Set every ops in call function to the given ipu sharding.
+    Note:
+        Only when enable_manual_shard=True to set the index to a value other than -1. please refer to :ref:`api_paddle_static_IpuStrategy` .
+        Only when enable_pipelining=True to set stage to a value other than -1. please refer to :ref:`api_paddle_static_IpuStrategy` .
+        An index supports a corresponding None stage or a stage, and a stage only supports a new index or a duplicate index.
    Args:
        call_func(Layer|function): Specify the call function to be wrapped.
        index(int, optional): Specify which ipu the Tensor is computed on, (such as ‘0, 1, 2, 3’).
@@ -349,7 +354,6 @@ def set_ipu_shard(call_func, index=-1, stage=-1):
    Returns:
        The wrapped call function.
    Examples:
        .. code-block:: python
@@ -1038,19 +1042,20 @@ def cuda_pinned_places(device_count=None):
 def mlu_places(device_ids=None):
    """
-    **Note**:
+    This function creates a list of :code:`paddle.device.MLUPlace` objects.
+    If :code:`device_ids` is None, environment variable of
+    :code:`FLAGS_selected_mlus` would be checked first. For example, if
+    :code:`FLAGS_selected_mlus=0,1,2`, the returned list would
+    be [paddle.device.MLUPlace(0), paddle.device.MLUPlace(1), paddle.device.MLUPlace(2)].
+    If :code:`FLAGS_selected_mlus` is not set, all visible
+    mlu places would be returned.
+    If :code:`device_ids` is not None, it should be the device
+    ids of MLUs. For example, if :code:`device_ids=[0,1,2]`,
+    the returned list would be
+    [paddle.device.MLUPlace(0), paddle.device.MLUPlace(1), paddle.device.MLUPlace(2)].
+    Note:
        For multi-card tasks, please use `FLAGS_selected_mlus` environment variable to set the visible MLU device.
-        This function creates a list of :code:`paddle.device.MLUPlace` objects.
-        If :code:`device_ids` is None, environment variable of
-        :code:`FLAGS_selected_mlus` would be checked first. For example, if
-        :code:`FLAGS_selected_mlus=0,1,2`, the returned list would
-        be [paddle.device.MLUPlace(0), paddle.device.MLUPlace(1), paddle.device.MLUPlace(2)].
-        If :code:`FLAGS_selected_mlus` is not set, all visible
-        mlu places would be returned.
-        If :code:`device_ids` is not None, it should be the device
-        ids of MLUs. For example, if :code:`device_ids=[0,1,2]`,
-        the returned list would be
-        [paddle.device.MLUPlace(0), paddle.device.MLUPlace(1), paddle.device.MLUPlace(2)].
    Parameters:
        device_ids (list or tuple of int, optional): list of MLU device ids.

--- a/python/paddle/framework/framework.py
+++ b/python/paddle/framework/framework.py
@@ -79,7 +79,7 @@ def get_default_dtype():
    Args:
        None.
    Returns:
-        The default dtype.
+        String, this global dtype only supports float16, float32, float64.
    Examples:
        .. code-block:: python

--- a/python/paddle/framework/io.py
+++ b/python/paddle/framework/io.py
@@ -647,10 +647,10 @@ def save(obj, path, protocol=4, **configs):
    '''
    Save an object to the specified path.
-    .. note::
+    Note:
        Now supports saving ``state_dict`` of Layer/Optimizer, Tensor and nested structure containing Tensor, Program.
-    .. note::
+    Note:
        Different from ``paddle.jit.save``, since the save result of ``paddle.save`` is a single file,
        there is no need to distinguish multiple saved files by adding a suffix. The argument ``path``
        of ``paddle.save`` will be directly used as the saved file name instead of a prefix.
@@ -877,10 +877,10 @@ def load(path, **configs):
    '''
    Load an object can be used in paddle from specified path.
-    .. note::
+    Note:
        Now supports loading ``state_dict`` of Layer/Optimizer, Tensor and nested structure containing Tensor, Program.
-    .. note::
+    Note:
        In order to use the model parameters saved by paddle more efficiently,
        ``paddle.load`` supports loading ``state_dict`` of Layer from the result of
        other save APIs except ``paddle.save`` , but the argument ``path`` format is
@@ -896,7 +896,7 @@ def load(path, **configs):
        ``paddle.fluid.io.save_params/save_persistables`` , ``path`` need to be a
        directory, such as ``model`` and model is a directory.
-    .. note::
+    Note:
        If you load ``state_dict`` from the saved result of static mode API such as
        ``paddle.static.save`` or ``paddle.static.save_inference_model`` ,
        the structured variable name in dynamic mode will cannot be restored.

--- a/python/paddle/incubate/autograd/primapi.py
+++ b/python/paddle/incubate/autograd/primapi.py
@@ -22,7 +22,7 @@ from paddle.incubate.autograd import primx, utils
 def forward_grad(outputs, inputs, grad_inputs=None):
    """Forward mode of automatic differentiation.
-    .. note::
+    Note:
        **ONLY available in the static mode and primitive operators.**
    Args:
@@ -105,7 +105,7 @@ def forward_grad(outputs, inputs, grad_inputs=None):
 def grad(outputs, inputs, grad_outputs=None):
    """Reverse mode of automatic differentiation.
-    .. note::
+    Note:
        **ONLY available in the static mode and primitive operators**
    Args:

--- a/python/paddle/incubate/autograd/primx.py
+++ b/python/paddle/incubate/autograd/primx.py
@@ -547,7 +547,7 @@ def _lower(block, reverse, blacklist):
 @framework.static_only
 def orig2prim(block=None):
    """
-    .. note::
+    Note:
        **This API is ONLY available in the static mode.**
        **Args block must be None or current block of main program.**
@@ -572,7 +572,7 @@ def orig2prim(block=None):
 @framework.static_only
 def prim2orig(block=None, blacklist=None):
    """
-    .. note::
+    Note:
        **ONLY available in the static mode.**
        **Args block must be None or current block of main program.**

--- a/python/paddle/incubate/autograd/utils.py
+++ b/python/paddle/incubate/autograd/utils.py
@@ -34,7 +34,7 @@ prim_option = PrimOption()
 @framework.static_only
 def prim_enabled():
    """
-    .. note::
+    Note:
        **ONLY available in the static mode.**
    Shows whether the automatic differentiation mechanism based on
@@ -65,7 +65,7 @@ def prim_enabled():
 @framework.static_only
 def enable_prim():
    """
-    .. note::
+    Note:
        **ONLY available in the static mode.**
    Turns ON automatic differentiation mechanism based on automatic
@@ -89,7 +89,7 @@ def enable_prim():
 @framework.static_only
 def disable_prim():
    """
-    .. note::
+    Note:
        **ONLY available in the static mode.**
    Turns OFF automatic differentiation mechanism based on automatic

--- a/python/paddle/nn/functional/activation.py
+++ b/python/paddle/nn/functional/activation.py
--- a/python/paddle/nn/functional/common.py
+++ b/python/paddle/nn/functional/common.py
@@ -193,6 +193,7 @@ def interpolate(
    """
    This API resizes a batch of images.
    The input must be a 3-D Tensor of the shape (num_batches, channels, in_w)
    or 4-D (num_batches, channels, in_h, in_w), or a 5-D Tensor of the shape
    (num_batches, channels, in_d, in_h, in_w) or (num_batches, in_d, in_h, in_w, channels),
@@ -201,12 +202,13 @@ def interpolate(
    and the resizing only applies on the three dimensions(depth, height and width).
    Supporting resample methods:
-        'linear' : Linear interpolation
-        'bilinear' : Bilinear interpolation
+    - 'linear' : Linear interpolation
-        'trilinear' : Trilinear interpolation
+    - 'bilinear' : Bilinear interpolation
-        'nearest' : Nearest neighbor interpolation
+    - 'trilinear' : Trilinear interpolation
-        'bicubic' : Bicubic interpolation
+    - 'nearest' : Nearest neighbor interpolation
-        'area': Area interpolation
+    - 'bicubic' : Bicubic interpolation
+    - 'area': Area interpolation
    Linear interpolation is the method of using a line connecting two known quantities
    to determine the value of an unknown quantity between the two known quantities.
@@ -243,13 +245,13 @@ def interpolate(
    .. code-block:: text
-        For scale_factor:
+        # For scale_factor:
            if align_corners = True && out_size > 1 :
              scale_factor = (in_size-1.0)/(out_size-1.0)
            else:
              scale_factor = float(in_size/out_size)
-        Linear interpolation:
+        # Linear interpolation:
            if:
                align_corners = False , align_mode = 0
                input : (N,C,W_in)
@@ -260,7 +262,7 @@ def interpolate(
                output: (N,C,W_out) where:
                W_out = W_{in} * scale_{factor}
-        Nearest neighbor interpolation:
+        # Nearest neighbor interpolation:
              align_corners = False
              input : (N,C,H_in,W_in)
@@ -268,7 +270,7 @@ def interpolate(
              H_out = floor (H_{in} * scale_{factor})
              W_out = floor (W_{in} * scale_{factor})
-        Bilinear interpolation:
+        # Bilinear interpolation:
          if:
              align_corners = False , align_mode = 0
              input : (N,C,H_in,W_in)
@@ -281,7 +283,7 @@ def interpolate(
              H_out = H_{in} * scale_{factor}
              W_out = W_{in} * scale_{factor}
-        Bicubic interpolation:
+        # Bicubic interpolation:
          if:
              align_corners = False
              input : (N,C,H_in,W_in)
@@ -294,7 +296,7 @@ def interpolate(
              H_out = H_{in} * scale_{factor}
              W_out = W_{in} * scale_{factor}
-        Trilinear interpolation:
+        # Trilinear interpolation:
          if:
              align_corners = False , align_mode = 0
              input : (N,C,D_in,H_in,W_in)
@@ -969,15 +971,16 @@ def dropout(
        training (bool, optional): A flag indicating whether it is in train phrase or not. Default True.
        mode(str, optional): ['upscale_in_train'(default) | 'downscale_in_infer'].
-                           1. upscale_in_train(default), upscale the output at training time
+            1. upscale_in_train(default), upscale the output at training time
+                - train: out = input * mask / ( 1.0 - dropout_prob )
+                - inference: out = input
-                              - train: out = input * mask / ( 1.0 - dropout_prob )
+            2. downscale_in_infer, downscale the output at inference
-                              - inference: out = input
-                           2. downscale_in_infer, downscale the output at inference
+                - train: out = input * mask
+                - inference: out = input * (1.0 - dropout_prob)
-                              - train: out = input * mask
-                              - inference: out = input * (1.0 - dropout_prob)
        name (str, optional): Name for the operation (optional, default is None). For more information, please refer to :ref:`api_guide_Name`.
    Returns:
@@ -1923,12 +1926,12 @@ def linear(x, weight, bias=None, name=None):
 def label_smooth(label, prior_dist=None, epsilon=0.1, name=None):
    r"""
    Label smoothing is a mechanism to regularize the classifier layer and is called
-    label-smoothing regularization (LSR).
+    label-smoothing regularization (LSR).Label smoothing is proposed to encourage
+    the model to be less confident, since optimizing the log-likelihood of the
+    correct label directly may cause overfitting and reduce the ability of the
+    model to adapt.
-    Label smoothing is proposed to encourage the model to be less confident,
+    Label smoothing replaces the ground-truth label :math:`y` with the weighted sum
-    since optimizing the log-likelihood of the correct label directly may
-    cause overfitting and reduce the ability of the model to adapt. Label
-    smoothing replaces the ground-truth label :math:`y` with the weighted sum
    of itself and some fixed distribution :math:`\mu`. For class :math:`k`,
    i.e.

--- a/python/paddle/nn/functional/conv.py
+++ b/python/paddle/nn/functional/conv.py
@@ -379,18 +379,6 @@ def conv1d(
        A tensor representing the conv1d, whose data type is the
        same with input.
-    Raises:
-        ValueError: If the channel dimension of the input is less than or equal to zero.
-        ValueError: If `data_format` is not "NCL" or "NLC".
-        ValueError: If `padding` is a string, but not "SAME" or "VALID".
-        ValueError: If `padding` is a list/tuple, but the element corresponding to the input's batch size is not 0
-            or the element corresponding to the input's channel is not 0.
-        ShapeError: If the input is not 3-D Tensor.
-        ShapeError: If the input's dimension size and filter's dimension size not equal.
-        ShapeError: If the dimension size of input minus the size of `stride` is not 1.
-        ShapeError: If the number of input channels is not equal to filter's channels * groups.
-        ShapeError: If the number of output channels is not be divided by groups.
    Examples:
        .. code-block:: python
@@ -672,18 +660,6 @@ def conv2d(
    Returns:
        A Tensor representing the conv2d result, whose data type is the same with input. 
-    Raises:
-        ValueError: If `data_format` is not "NCHW" or "NHWC".
-        ValueError: If the channel dimension of the input is less than or equal to zero.
-        ValueError: If `padding` is a string, but not "SAME" or "VALID".
-        ValueError: If `padding` is a list/tuple, but the element corresponding to the input's batch size is not 0 
-            or the element corresponding to the input's channel is not 0.
-        ShapeError: If the input is not 4-D Tensor.
-        ShapeError: If the input's dimension size and filter's dimension size not equal.
-        ShapeError: If the dimension size of input minus the size of `stride` is not 2.
-        ShapeError: If the number of input channels is not equal to filter's channels * groups.
-        ShapeError: If the number of output channels is not be divided by groups.
    Examples:
        .. code-block:: python
@@ -929,19 +905,6 @@ def conv1d_transpose(
        when data_format is `"NCL"` and (num_batches, length, channels) when data_format is
        `"NLC"`.
-    Raises:
-        ValueError: If `data_format` is a string, but not "NCL" or "NLC".
-        ValueError: If `padding` is a string, but not "SAME" or "VALID".
-        ValueError: If `padding` is a list/tuple, but the element corresponding to the input's batch size is not 0 
-            or the element corresponding to the input's channel is not 0.
-        ValueError: If `output_size` and filter_size are None at the same time.
-        ValueError: If `output_padding` is greater than `stride`.
-        ShapeError: If the input is not 3-D Tensor.
-        ShapeError: If the input's dimension size and filter's dimension size not equal.
-        ShapeError: If the dimension size of input minus the size of `stride` is not 1.
-        ShapeError: If the number of input channels is not equal to filter's channels.
-        ShapeError: If the size of `output_size` is not equal to that of `stride`.
    Examples:
        .. code-block:: python
@@ -1255,18 +1218,6 @@ def conv2d_transpose(
        out_w) or (num_batches, out_h, out_w, channels). The tensor variable storing 
        transposed convolution result.
-    Raises:
-        ValueError: If `data_format` is not "NCHW" or "NHWC".
-        ValueError: If `padding` is a string, but not "SAME" or "VALID".
-        ValueError: If `padding` is a list/tuple, but the element corresponding to the input's batch size is not 0 
-            or the element corresponding to the input's channel is not 0.
-        ValueError: If `output_size` and kernel_size are None at the same time.
-        ShapeError: If the input is not 4-D Tensor.
-        ShapeError: If the input's dimension size and filter's dimension size not equal.
-        ShapeError: If the dimension size of input minus the size of `stride` is not 2.
-        ShapeError: If the number of input channels is not equal to filter's channels.
-        ShapeError: If the size of `output_size` is not equal to that of `stride`.
    Examples:
        .. code-block:: python
@@ -1771,18 +1722,6 @@ def conv3d_transpose(
        variable storing the transposed convolution result, and if act is not None, the tensor 
        variable storing transposed convolution and non-linearity activation result.
-    Raises:
-        ValueError: If `data_format` is not "NCDHW" or "NDHWC".
-        ValueError: If `padding` is a string, but not "SAME" or "VALID".
-        ValueError: If `padding` is a list/tuple, but the element corresponding to the input's batch size is not 0 
-            or the element corresponding to the input's channel is not 0.
-        ValueError: If `output_size` and kernel_size are None at the same time.
-        ShapeError: If the input is not 5-D Tensor.
-        ShapeError: If the input's dimension size and filter's dimension size not equal.
-        ShapeError: If the dimension size of input minus the size of `stride` is not 2.
-        ShapeError: If the number of input channels is not equal to filter's channels.
-        ShapeError: If the size of `output_size` is not equal to that of `stride`.
    Examples:
       .. code-block:: python

--- a/python/paddle/nn/functional/extension.py
+++ b/python/paddle/nn/functional/extension.py
@@ -366,9 +366,6 @@ def temporal_shift(x, seg_num, shift_ratio=0.25, name=None, data_format="NCHW"):
        out(Tensor): The temporal shifting result is a tensor with the
        same shape and same data type as the input.
-    Raises:
-        TypeError: seg_num must be int type.
    Examples:
        .. code-block:: python

--- a/python/paddle/nn/functional/loss.py
+++ b/python/paddle/nn/functional/loss.py
@@ -939,15 +939,18 @@ def hsigmoid_loss(
    """
    The hierarchical sigmoid organizes the classes into a complete binary tree to reduce the computational complexity
    and speed up the model training, especially the training of language model.
    Each leaf node of the complete binary tree represents a class(word) and each non-leaf node acts as a binary classifier.
    For each class(word), there's a unique path from root to itself, hsigmoid calculate the cost for each non-leaf node on
    the path, and sum them to get a total cost.
-    Comparing to softmax, the OP can reduce the computational complexity from :math:`O(N)` to :math:`O(logN)`, where :math:`N`
+    Comparing to softmax, hsigmoid can reduce the computational complexity from :math:`O(N)` to :math:`O(logN)`, where :math:`N`
    represents the number of classes or the size of word dict.
-    The OP supports default tree and custom tree. For the default tree, you can refer to `Hierarchical Probabilistic Neural
+    The API supports default tree and custom tree. For the default tree, you can refer to `Hierarchical Probabilistic Neural
-    Network Language Model <http://www.iro.umontreal.ca/~lisa/pointeurs/hierarchical-nnlm-aistats05.pdf>`_. For the custom
+    Network Language Model <http://www.iro.umontreal.ca/~lisa/pointeurs/hierarchical-nnlm-aistats05.pdf>`_.
-    tree, you need to set :attr:`is_custom` to True, and do the following steps (take the language model as an example):
+    For the custom tree, you need to set :attr:`is_custom` to True, and do the following steps (take the language model as an example):
    1. Using a custom word dict to build a binary tree, each leaf node should be an word in the word dict.
    2. Creating a dict map word_id -> path that from the word to the root node, we call it path_table.
@@ -1102,17 +1105,17 @@ def smooth_l1_loss(input, label, reduction='mean', delta=1.0, name=None):
    .. math::
-         loss(x,y) = \frac{1}{n}\sum_{i}z_i
+        loss(x,y) = \frac{1}{n}\sum_{i}z_i
-    where z_i is given by:
+    where :math:`z_i` is given by:
    .. math::
        \mathop{z_i} = \left\{\begin{array}{rcl}
-        0.5(x_i - y_i)^2 & & {if |x_i - y_i| < delta} \\
+                0.5(x_i - y_i)^2 & & {if |x_i - y_i| < \delta} \\
-        delta * |x_i - y_i| - 0.5 * delta^2 & & {otherwise}
+                \delta * |x_i - y_i| - 0.5 * \delta^2 & & {otherwise}
-        \end{array} \right.
+            \end{array} \right.
    Parameters:
        input (Tensor): Input tensor, the data type is float32 or float64. Shape is
@@ -1126,12 +1129,11 @@ def smooth_l1_loss(input, label, reduction='mean', delta=1.0, name=None):
            If :attr:`reduction` is ``'sum'``, the reduced sum loss is returned.
            If :attr:`reduction` is ``'none'``, the unreduced loss is returned.
            Default is ``'mean'``.
-        delta (float, optional): Specifies the hyperparameter delta to be used.
+        delta (float, optional): Specifies the hyperparameter :math:`\delta` to be used.
            The value determines how large the errors need to be to use L1. Errors
            smaller than delta are minimized with L2. Parameter is ignored for
            negative/zero values. Default = 1.0
-        name (str, optional): Name for the operation (optional, default is
+        name (str, optional): For details, please refer to :ref:`api_guide_Name`. Generally, no setting is required. Default: None.
-            None). For more information, please refer to :ref:`api_guide_Name`.
    Returns:
        Tensor, The tensor variable storing the smooth_l1_loss of input and label.
@@ -1140,14 +1142,12 @@ def smooth_l1_loss(input, label, reduction='mean', delta=1.0, name=None):
        .. code-block:: python
            import paddle
-            import numpy as np
-            input_data = np.random.rand(3,3).astype("float32")
+            input = paddle.rand([3, 3]).astype('float32')
-            label_data = np.random.rand(3,3).astype("float32")
+            label = paddle.rand([3, 3]).astype('float32')
-            input = paddle.to_tensor(input_data)
-            label = paddle.to_tensor(label_data)
            output = paddle.nn.functional.smooth_l1_loss(input, label)
            print(output)
+            # [0.068004]
    """
    check_variable_and_dtype(
        input, 'input', ['float32', 'float64'], 'smooth_l1_loss'
@@ -1310,7 +1310,7 @@ def margin_ranking_loss(
 def l1_loss(input, label, reduction='mean', name=None):
    r"""
-    This operator computes the L1 Loss of Tensor ``input`` and ``label`` as follows.
+    Computes the L1 Loss of Tensor ``input`` and ``label`` as follows.
    If `reduction` set to ``'none'``, the loss is:
@@ -1341,8 +1341,8 @@ def l1_loss(input, label, reduction='mean', name=None):
    Returns:
        Tensor, the L1 Loss of Tensor ``input`` and ``label``.
-            If `reduction` is ``'none'``, the shape of output loss is [N, *], the same as ``input`` .
+        If `reduction` is ``'none'``, the shape of output loss is [N, *], the same as ``input`` .
-            If `reduction` is ``'mean'`` or ``'sum'``, the shape of output loss is [1].
+        If `reduction` is ``'mean'`` or ``'sum'``, the shape of output loss is [1].
    Examples:
        .. code-block:: python
@@ -1536,7 +1536,7 @@ def nll_loss(
 def kl_div(input, label, reduction='mean', name=None):
    r"""
-    This operator calculates the Kullback-Leibler divergence loss
+    Calculate the Kullback-Leibler divergence loss
    between Input(X) and Input(Target). Notes that Input(X) is the
    log-probability and Input(Target) is the probability.
@@ -1581,31 +1581,26 @@ def kl_div(input, label, reduction='mean', name=None):
        .. code-block:: python
            import paddle
-            import numpy as np
            import paddle.nn.functional as F
            shape = (5, 20)
-            input = np.random.uniform(-10, 10, shape).astype('float32')
+            x = paddle.uniform(shape, min=-10, max=10).astype('float32')
-            target = np.random.uniform(-10, 10, shape).astype('float32')
+            target = paddle.uniform(shape, min=-10, max=10).astype('float32')
            # 'batchmean' reduction, loss shape will be [1]
-            pred_loss = F.kl_div(paddle.to_tensor(input),
+            pred_loss = F.kl_div(x, target, reduction='batchmean')
-                                 paddle.to_tensor(target), reduction='batchmean')
            # shape=[1]
            # 'mean' reduction, loss shape will be [1]
-            pred_loss = F.kl_div(paddle.to_tensor(input),
+            pred_loss = F.kl_div(x, target, reduction='mean')
-                                 paddle.to_tensor(target), reduction='mean')
            # shape=[1]
            # 'sum' reduction, loss shape will be [1]
-            pred_loss = F.kl_div(paddle.to_tensor(input),
+            pred_loss = F.kl_div(x, target, reduction='sum')
-                                 paddle.to_tensor(target), reduction='sum')
            # shape=[1]
            # 'none' reduction, loss shape is same with input shape
-            pred_loss = F.kl_div(paddle.to_tensor(input),
+            pred_loss = F.kl_div(x, target, reduction='none')
-                                 paddle.to_tensor(target), reduction='none')
            # shape=[5, 20]
    """
@@ -1862,9 +1857,7 @@ def margin_cross_entropy(
    .. hint::
        The API supports single GPU and multi GPU, and don't supports CPU.
        For data parallel mode, set ``group=False``.
        For model parallel mode, set ``group=None`` or the group instance return by paddle.distributed.new_group.
        And logits.shape[-1] can be different at each rank.
@@ -1876,7 +1869,7 @@ def margin_cross_entropy(
        margin2 (float, optional): m2 of margin loss, default value is `0.5`.
        margin3 (float, optional): m3 of margin loss, default value is `0.0`.
        scale (float, optional): s of margin loss, default value is `64.0`.
-        group (Group, optional): The group instance return by paddle.distributed.new_group 
+        group (Group, optional): The group instance return by paddle.distributed.new_group
            or ``None`` for global default group or ``False`` for data parallel (do not communication cross ranks).
            Default is ``None``.
        return_softmax (bool, optional): Whether return softmax probability. Default value is `False`.
@@ -1887,12 +1880,12 @@ def margin_cross_entropy(
                    Default value is `'mean'`.
    Returns:
-        ``Tensor`` or Tuple of two ``Tensor`` : Return the cross entropy loss if \
+        Tensor|tuple[Tensor, Tensor], return the cross entropy loss if
-            `return_softmax` is False, otherwise the tuple \
+            `return_softmax` is False, otherwise the tuple (loss, softmax),
-            (loss, softmax), softmax is shard_softmax when \
+            softmax is shard_softmax when using model parallel, otherwise
-            using model parallel, otherwise softmax is in \
+            softmax is in the same shape with input logits. If
-            the same shape with input logits. If ``reduction == None``, \
+            ``reduction == None``, the shape of loss is ``[N, 1]``, otherwise
-            the shape of loss is ``[N, 1]``, otherwise the shape is ``[1]``.
+            the shape is ``[1]``.
    Examples:
@@ -1932,7 +1925,7 @@ def margin_cross_entropy(
        print(label)
        print(loss)
        print(softmax)
        #Tensor(shape=[2, 4], dtype=float64, place=CUDAPlace(0), stop_gradient=True,
        #       [[ 0.85204151, -0.55557678,  0.04994566,  0.71986042],
        #        [-0.20198586, -0.35270476, -0.55182702,  0.09749021]])
@@ -1993,7 +1986,7 @@ def margin_cross_entropy(
        print(loss)
        print(softmax)
-        # python -m paddle.distributed.launch --gpus=0,1 test_margin_cross_entropy.py 
+        # python -m paddle.distributed.launch --gpus=0,1 test_margin_cross_entropy.py
        ## for rank0 input
        #Tensor(shape=[4, 4], dtype=float64, place=CUDAPlace(0), stop_gradient=True,
        #       [[ 0.32888934,  0.02408748, -0.02763289,  0.18173063],
@@ -3245,6 +3238,19 @@ def multi_label_soft_margin_loss(
    input, label, weight=None, reduction="mean", name=None
 ):
    r"""
+    Calculate a multi-class multi-classification
+    hinge loss (margin-based loss) between input :math:`x` (a 2D mini-batch `Tensor`)
+    and output :math:`y` (which is a 2D `Tensor` of target class indices).
+    For each sample in the mini-batch:
+    .. math::
+        \text{loss}(x, y) = \sum_{ij}\frac{\max(0, 1 - (x[y[j]] - x[i]))}{\text{x.size}(0)}
+    where :math:`x \in \left\{0, \; \cdots , \; \text{x.size}(0) - 1\right\}`, \
+    :math:`y \in \left\{0, \; \cdots , \; \text{y.size}(0) - 1\right\}`, \
+    :math:`0 \leq y[j] \leq \text{x.size}(0)-1`, \
+    and :math:`i \neq y[j]` for all :math:`i` and :math:`j`.
+    :math:`y` and :math:`x` must have the same size.
    Parameters:
        input (Tensor): Input tensor, the data type is float32 or float64. Shape is (N, C), where C is number of classes, and if shape is more than 2D, this is (N, C, D1, D2,..., Dk), k >= 1.
@@ -3338,7 +3344,7 @@ def multi_label_soft_margin_loss(
 def hinge_embedding_loss(input, label, margin=1.0, reduction='mean', name=None):
    r"""
-    This operator calculates hinge_embedding_loss. Measures the loss given an input tensor :math:`x` and a labels tensor :math:`y`(containing 1 or -1).
+    Calculates hinge_embedding_loss. Measures the loss given an input tensor :math:`x` and a labels tensor :math:`y`(containing 1 or -1).
    This is usually used for measuring whether two inputs are similar or dissimilar, e.g. using the L1 pairwise distance as :math:`x`,
    and is typically used for learning nonlinear embeddings or semi-supervised learning.

--- a/python/paddle/nn/functional/norm.py
+++ b/python/paddle/nn/functional/norm.py
@@ -36,7 +36,7 @@ __all__ = []
 def normalize(x, p=2, axis=1, epsilon=1e-12, name=None):
    r"""
-    This op normalizes ``x`` along dimension ``axis`` using :math:`L_p` norm. This layer computes
+    Normalize ``x`` along dimension ``axis`` using :math:`L_p` norm. This layer computes
    .. math::
@@ -50,7 +50,7 @@ def normalize(x, p=2, axis=1, epsilon=1e-12, name=None):
    Parameters:
        x (Tensor): The input tensor could be N-D tensor, and the input data type could be float32 or float64.
-        p (float|int, optional): The exponent value in the norm formulation. Default: 2
+        p (float|int, optional): The exponent value in the norm formulation. Default: 2.
        axis (int, optional): The axis on which to apply normalization. If `axis < 0`, the dimension to normalization is `x.ndim + axis`. -1 is the last dimension.
        epsilon (float, optional): Small float added to denominator to avoid dividing by zero. Default is 1e-12.
        name (str, optional): Name for the operation (optional, default is None). For more information, please refer to :ref:`api_guide_Name`.
@@ -357,11 +357,8 @@ def layer_norm(
        .. code-block:: python
          import paddle
-          import numpy as np
-          np.random.seed(123)
+          x = paddle.rand((2, 2, 2, 3))
-          x_data = np.random.random(size=(2, 2, 2, 3)).astype('float32')
-          x = paddle.to_tensor(x_data)
          layer_norm_out = paddle.nn.functional.layer_norm(x, x.shape[1:])
          print(layer_norm_out)
    """
@@ -468,14 +465,14 @@ def instance_norm(
    Parameters:
        x(Tensor): Input Tensor. It's data type should be float32, float64.
-        running_mean(Tensor): running mean. Default None.
+        running_mean(Tensor, optional): running mean. Default None.
-        running_var(Tensor): running variance. Default None.
+        running_var(Tensor, optional): running variance. Default None.
        weight(Tensor, optional): The weight tensor of instance_norm. Default: None.
        bias(Tensor, optional): The bias tensor of instance_norm. Default: None.
        eps(float, optional): A value added to the denominator for numerical stability. Default is 1e-5.
        momentum(float, optional): The value used for the moving_mean and moving_var computation. Default: 0.9.
-        use_input_stats(bool): Default True.
+        use_input_stats(bool, optional): Default True.
-        data_format(str, optional): Specify the input data format, may be "NC", "NCL", "NCHW" or "NCDHW". Default "NCHW".
+        data_format(str, optional): Specify the input data format, may be "NC", "NCL", "NCHW" or "NCDHW". Defalut "NCHW".
        name(str, optional): Name for the InstanceNorm, default is None. For more information, please refer to :ref:`api_guide_Name`..
    Returns:
@@ -486,11 +483,8 @@ def instance_norm(
        .. code-block:: python
          import paddle
-          import numpy as np
-          np.random.seed(123)
+          x = paddle.rand((2, 2, 2, 3))
-          x_data = np.random.random(size=(2, 2, 2, 3)).astype('float32')
-          x = paddle.to_tensor(x_data)
          instance_norm_out = paddle.nn.functional.instance_norm(x)
          print(instance_norm_out)

--- a/python/paddle/nn/functional/pooling.py
+++ b/python/paddle/nn/functional/pooling.py
@@ -665,12 +665,6 @@ def max_pool1d(
    Returns:
        Tensor: The output tensor of pooling result. The data type is same as input tensor.
-    Raises:
-        ValueError: If `padding` is a string, but not "SAME" or "VALID".
-        ValueError: If `padding` is "VALID", but `ceil_mode` is True.
-        ShapeError: If the input is not a 3-D tensor.
-        ShapeError: If the output's shape calculated is not greater than 0.
    Examples:
        .. code-block:: python
@@ -1313,11 +1307,6 @@ def max_pool2d(
    Returns:
        Tensor: The output tensor of pooling result. The data type is same as input tensor.
-    Raises:
-        ValueError: If `padding` is a string, but not "SAME" or "VALID".
-        ValueError: If `padding` is "VALID", but `ceil_mode` is True.
-        ShapeError: If the output's shape calculated is not greater than 0.
    Examples:
        .. code-block:: python
@@ -1507,11 +1496,6 @@ def max_pool3d(
    Returns:
        Tensor: The output tensor of pooling result. The data type is same as input tensor.
-    Raises:
-        ValueError: If `padding` is a string, but not "SAME" or "VALID".
-        ValueError: If `padding` is "VALID", but `ceil_mode` is True.
-        ShapeError: If the output's shape calculated is not greater than 0.
    Examples:
        .. code-block:: python
@@ -2053,8 +2037,7 @@ def adaptive_max_pool1d(x, output_size, return_mask=False, name=None):
    Returns:
            Tensor: The output tensor of adaptive pooling result. The data type is same
                      as input tensor.
-    Raises:
-            ValueError: 'output_size' should be an integer.
    Examples:
        .. code-block:: python

--- a/python/paddle/nn/functional/vision.py
+++ b/python/paddle/nn/functional/vision.py
@@ -37,7 +37,7 @@ def affine_grid(theta, out_shape, align_corners=True, name=None):
    Args:
        theta (Tensor) - A tensor with shape [N, 2, 3] or [N, 3, 4]. It contains a batch of affine transform parameters.
                           The data type can be float32 or float64.
-        out_shape (Tensor | list | tuple): Type can be a 1-D Tensor, list, or tuple. It is used to represent the shape of the output in an affine transformation, in the format ``[N, C, H, W]`` or ``[N, C, D, H, W]``. 
+        out_shape (Tensor | list | tuple): Type can be a 1-D Tensor, list, or tuple. It is used to represent the shape of the output in an affine transformation, in the format ``[N, C, H, W]`` or ``[N, C, D, H, W]``.
                                           When the format is ``[N, C, H, W]``, it represents the batch size, number of channels, height and width. When the format is ``[N, C, D, H, W]``, it represents the batch size, number of channels, depth, height and width.
                                           The data type must be int32.
        align_corners(bool, optional): if True, aligns the centers of the 4 (4D) or 8 (5D) corner pixels of the input and output tensors, and preserves the value of the corner pixels. Default: True
@@ -60,7 +60,7 @@ def affine_grid(theta, out_shape, align_corners=True, name=None):
                    [1, 2, 3, 3],
                    align_corners=False)
            print(y_t)
            #[[[[ 1.0333333   0.76666665]
            #   [ 0.76666665  1.0999999 ]
            #   [ 0.5         1.4333333 ]]
@@ -84,62 +84,82 @@ def affine_grid(theta, out_shape, align_corners=True, name=None):
    if theta.shape[1] == 3:
        use_cudnn = False
    if is_compiled_with_rocm():
-        use_cudnn = False  # ROCM platform do not have MIOPEN kernel for affine_grid
+        use_cudnn = (
+            False  # ROCM platform do not have MIOPEN kernel for affine_grid
+        )
    if in_dygraph_mode():
-        _out_shape = out_shape.numpy().tolist() if isinstance(
+        _out_shape = (
-            out_shape, Variable) else out_shape
+            out_shape.numpy().tolist()
+            if isinstance(out_shape, Variable)
+            else out_shape
+        )
        return _C_ops.affine_grid(theta, _out_shape, use_cudnn, align_corners)
    elif in_dynamic_mode():
-        _out_shape = out_shape.numpy().tolist() if isinstance(
+        _out_shape = (
-            out_shape, Variable) else out_shape
+            out_shape.numpy().tolist()
-        return _legacy_C_ops.affine_grid(theta, "output_shape", _out_shape,
+            if isinstance(out_shape, Variable)
-                                         "align_corners", align_corners,
+            else out_shape
-                                         "use_cudnn", use_cudnn)
+        )
+        return _legacy_C_ops.affine_grid(
+            theta,
+            "output_shape",
+            _out_shape,
+            "align_corners",
+            align_corners,
+            "use_cudnn",
+            use_cudnn,
+        )
    helper = LayerHelper('affine_grid')
-    check_variable_and_dtype(theta, 'theta', ['float32', 'float64'],
+    check_variable_and_dtype(
-                             'affine_grid')
+        theta, 'theta', ['float32', 'float64'], 'affine_grid'
+    )
    out = helper.create_variable_for_type_inference(theta.dtype)
    ipts = {'Theta': theta}
    attrs = {"align_corners": align_corners, "use_cudnn": use_cudnn}
    if isinstance(out_shape, Variable):
        ipts['OutputShape'] = out_shape
-        check_variable_and_dtype(out_shape, 'out_shape', ['int32'],
+        check_variable_and_dtype(
-                                 'affine_grid')
+            out_shape, 'out_shape', ['int32'], 'affine_grid'
+        )
    else:
        attrs['output_shape'] = out_shape
-    helper.append_op(type='affine_grid',
+    helper.append_op(
-                     inputs=ipts,
+        type='affine_grid',
-                     outputs={'Output': out},
+        inputs=ipts,
-                     attrs=None if len(attrs) == 0 else attrs)
+        outputs={'Output': out},
+        attrs=None if len(attrs) == 0 else attrs,
+    )
    return out
-def grid_sample(x,
+def grid_sample(
-                grid,
+    x,
-                mode='bilinear',
+    grid,
-                padding_mode='zeros',
+    mode='bilinear',
-                align_corners=True,
+    padding_mode='zeros',
-                name=None):
+    align_corners=True,
+    name=None,
+):
    """
-    This operation samples input X by using bilinear interpolation or
+    Sample input X by using bilinear interpolation or
    nearest interpolation based on flow field grid, which is usually
-    generated by :code:`affine_grid` . When the input X is 4-D Tensor, 
+    generated by :code:`affine_grid` . When the input X is 4-D Tensor,
-    the grid of shape [N, H, W, 2] is the concatenation of (x, y) 
+    the grid of shape [N, H, W, 2] is the concatenation of (x, y)
-    coordinates with shape [N, H, W] each, where x is indexing the 4th 
+    coordinates with shape [N, H, W] each, where x is indexing the 4th
-    dimension (in width dimension) of input data x and y is indexing 
+    dimension (in width dimension) of input data x and y is indexing
-    the 3rd dimension (in height dimension), finally results is the 
+    the 3rd dimension (in height dimension), finally results is the
    bilinear interpolation or nearest value of 4 nearest corner
-    points. The output tensor shape will be [N, C, H, W]. When the input X 
+    points. The output tensor shape will be [N, C, H, W]. When the input X
-    is 5-D Tensor, the grid of shape [N, D, H, W, 3] is the concatenation 
+    is 5-D Tensor, the grid of shape [N, D, H, W, 3] is the concatenation
-    of (x, y, z) coordinates with shape [N, D, H, W] each, where x is 
+    of (x, y, z) coordinates with shape [N, D, H, W] each, where x is
-    indexing the 5th dimension (in width dimension) of input data x, y is 
+    indexing the 5th dimension (in width dimension) of input data x, y is
-    indexing the 4th dimension (in height dimension) and z is indexing the 
+    indexing the 4th dimension (in height dimension) and z is indexing the
-    3rd dimension (in depth dimension) finally results is the bilinear 
+    3rd dimension (in depth dimension) finally results is the bilinear
-    interpolation or nearest value of 8 nearest cornerpoints. The output 
+    interpolation or nearest value of 8 nearest cornerpoints. The output
-    tensor shape will be [N, C, D, H, W]. 
+    tensor shape will be [N, C, D, H, W].
@@ -153,7 +173,7 @@ def grid_sample(x,
        grid_y = 0.5 * (grid[:, :, :, 1] + 1) * (H - 1)
    Step 2:
    Indices input data X with grid (x, y) in each [H, W] area, and bilinear
    interpolate point value by 4 nearest points or nearest interpolate point value
    by nearest point.
@@ -189,12 +209,12 @@ def grid_sample(x,
    Args:
        x(Tensor): The input tensor, which is a 4-d tensor with shape
-                     [N, C, H, W] or a 5-d tensor with shape [N, C, D, H, W], 
+                     [N, C, H, W] or a 5-d tensor with shape [N, C, D, H, W],
-                     N is the batch size, C is the channel number, 
+                     N is the batch size, C is the channel number,
                     D, H and W is the feature depth, height and width.
                     The data type is float32 or float64.
-        grid(Tensor): Input grid tensor, which is a 4-d tensor with shape [N, grid_H, 
+        grid(Tensor): Input grid tensor, which is a 4-d tensor with shape [N, grid_H,
-                        grid_W, 2] or a 5-d tensor with shape [N, grid_D, grid_H, 
+                        grid_W, 2] or a 5-d tensor with shape [N, grid_D, grid_H,
                        grid_W, 3]. The data type is float32 or float64.
        mode(str, optional): The interpolation method which can be 'bilinear' or 'nearest'.
                         Default: 'bilinear'.
@@ -209,17 +229,18 @@ def grid_sample(x,
                             None by default.
    Returns:
-        Tensor, The shape of output is [N, C, grid_H, grid_W] or [N, C, grid_D, grid_H, grid_W] in which `grid_D` is the depth of grid, 
+        Tensor, The shape of output is [N, C, grid_H, grid_W] or [N, C, grid_D, grid_H, grid_W] in which `grid_D` is the depth of grid,
                `grid_H` is the height of grid and `grid_W` is the width of grid. The data type is same as input tensor.
    Examples:
        .. code-block:: python
            import paddle
            import paddle.nn.functional as F
-            # x shape=[1, 1, 3, 3]           
+            # x shape=[1, 1, 3, 3]
            x = paddle.to_tensor([[[[-0.6,  0.8, -0.5],
                                    [-0.5,  0.2,  1.2],
                                    [ 1.4,  0.3, -0.2]]]],dtype='float64')
@@ -243,7 +264,7 @@ def grid_sample(x,
                padding_mode='border',
                align_corners=True)
            print(y_t)
            # output shape = [1, 1, 3, 4]
            # [[[[ 0.34   0.016  0.086 -0.448]
            #    [ 0.55  -0.076  0.35   0.59 ]
@@ -254,22 +275,33 @@ def grid_sample(x,
    _padding_modes = ['zeros', 'reflection', 'border']
    if mode not in _modes:
        raise ValueError(
-            "The mode of grid sample function should be in {}, but got: {}".
+            "The mode of grid sample function should be in {}, but got: {}".format(
-            format(_modes, mode))
+                _modes, mode
+            )
+        )
    if padding_mode not in _padding_modes:
        raise ValueError(
-            "The padding mode of grid sample function should be in {}, but got: {}"
+            "The padding mode of grid sample function should be in {}, but got: {}".format(
-            .format(_padding_modes, padding_mode))
+                _padding_modes, padding_mode
+            )
+        )
    if not isinstance(align_corners, bool):
-        raise ValueError("The align corners should be bool, but got: {}".format(
+        raise ValueError(
-            align_corners))
+            "The align corners should be bool, but got: {}".format(
+                align_corners
+            )
+        )
    cudnn_version = get_cudnn_version()
    use_cudnn = False
-    if not is_compiled_with_rocm() and (
+    if (
-            cudnn_version is not None
+        not is_compiled_with_rocm()
-    ) and align_corners and mode == 'bilinear' and padding_mode == 'zeros':
+        and (cudnn_version is not None)
+        and align_corners
+        and mode == 'bilinear'
+        and padding_mode == 'zeros'
+    ):
        use_cudnn = True
        # CUDNN always computes gradients for all inputs
        x.stop_gradient = False
@@ -281,26 +313,37 @@ def grid_sample(x,
    if in_dygraph_mode():
        return _C_ops.grid_sample(x, grid, mode, padding_mode, align_corners)
    elif in_dynamic_mode():
-        attrs = ('mode', mode, 'padding_mode', padding_mode, 'align_corners',
+        attrs = (
-                 align_corners, 'use_cudnn', use_cudnn)
+            'mode',
+            mode,
+            'padding_mode',
+            padding_mode,
+            'align_corners',
+            align_corners,
+            'use_cudnn',
+            use_cudnn,
+        )
        out = getattr(_legacy_C_ops, 'grid_sampler')(x, grid, *attrs)
    else:
        helper = LayerHelper("grid_sample", **locals())
        check_variable_and_dtype(x, 'x', ['float32', 'float64'], 'grid_sample')
-        check_variable_and_dtype(grid, 'grid', ['float32', 'float64'],
+        check_variable_and_dtype(
-                                 'grid_sample')
+            grid, 'grid', ['float32', 'float64'], 'grid_sample'
+        )
        ipts = {'X': x, 'Grid': grid}
        attrs = {
            'mode': mode,
            'padding_mode': padding_mode,
            'align_corners': align_corners,
-            'use_cudnn': use_cudnn
+            'use_cudnn': use_cudnn,
        }
        out = helper.create_variable_for_type_inference(x.dtype)
-        helper.append_op(type='grid_sampler',
+        helper.append_op(
-                         inputs=ipts,
+            type='grid_sampler',
-                         attrs=attrs,
+            inputs=ipts,
-                         outputs={'Output': out})
+            attrs=attrs,
+            outputs={'Output': out},
+        )
    return out
@@ -337,24 +380,25 @@ def pixel_shuffle(x, upscale_factor, data_format="NCHW", name=None):
    if data_format not in ["NCHW", "NHWC"]:
        raise ValueError(
            "Attr(data_format) should be 'NCHW' or 'NHWC'."
-            "But recevie Attr(data_format): {} ".format(data_format))
+            "But recevie Attr(data_format): {} ".format(data_format)
+        )
    if in_dygraph_mode():
        return _C_ops.pixel_shuffle(x, upscale_factor, data_format)
    if _in_legacy_dygraph():
-        return _legacy_C_ops.pixel_shuffle(x, "upscale_factor", upscale_factor,
+        return _legacy_C_ops.pixel_shuffle(
-                                           "data_format", data_format)
+            x, "upscale_factor", upscale_factor, "data_format", data_format
+        )
    helper = LayerHelper("pixel_shuffle", **locals())
    check_variable_and_dtype(x, 'x', ['float32', 'float64'], 'pixel_shuffle')
    out = helper.create_variable_for_type_inference(dtype=x.dtype)
-    helper.append_op(type="pixel_shuffle",
+    helper.append_op(
-                     inputs={"X": x},
+        type="pixel_shuffle",
-                     outputs={"Out": out},
+        inputs={"X": x},
-                     attrs={
+        outputs={"Out": out},
-                         "upscale_factor": upscale_factor,
+        attrs={"upscale_factor": upscale_factor, "data_format": data_format},
-                         "data_format": data_format
+    )
-                     })
    return out
@@ -384,8 +428,10 @@ def pixel_unshuffle(x, downscale_factor, data_format="NCHW", name=None):
    """
    if len(x.shape) != 4:
        raise ValueError(
-            "Input x should be 4D tensor, but received x with the shape of {}".
+            "Input x should be 4D tensor, but received x with the shape of {}".format(
-            format(x.shape))
+                x.shape
+            )
+        )
    if not isinstance(downscale_factor, int):
        raise TypeError("Downscale factor must be int type")
@@ -396,23 +442,26 @@ def pixel_unshuffle(x, downscale_factor, data_format="NCHW", name=None):
    if data_format not in ["NCHW", "NHWC"]:
        raise ValueError(
            "Attr(data_format) should be 'NCHW' or 'NHWC'."
-            "But recevie Attr(data_format): {} ".format(data_format))
+            "But recevie Attr(data_format): {} ".format(data_format)
+        )
    if _non_static_mode():
-        return _legacy_C_ops.pixel_unshuffle(x, "downscale_factor",
+        return _legacy_C_ops.pixel_unshuffle(
-                                             downscale_factor, "data_format",
+            x, "downscale_factor", downscale_factor, "data_format", data_format
-                                             data_format)
+        )
    helper = LayerHelper("pixel_unshuffle", **locals())
    check_variable_and_dtype(x, 'x', ['float32', 'float64'], 'pixel_unshuffle')
    out = helper.create_variable_for_type_inference(dtype=x.dtype)
-    helper.append_op(type="pixel_unshuffle",
+    helper.append_op(
-                     inputs={"X": x},
+        type="pixel_unshuffle",
-                     outputs={"Out": out},
+        inputs={"X": x},
-                     attrs={
+        outputs={"Out": out},
-                         "downscale_factor": downscale_factor,
+        attrs={
-                         "data_format": data_format
+            "downscale_factor": downscale_factor,
-                     })
+            "data_format": data_format,
+        },
+    )
    return out
@@ -453,8 +502,10 @@ def channel_shuffle(x, groups, data_format="NCHW", name=None):
    """
    if len(x.shape) != 4:
        raise ValueError(
-            "Input x should be 4D tensor, but received x with the shape of {}".
+            "Input x should be 4D tensor, but received x with the shape of {}".format(
-            format(x.shape))
+                x.shape
+            )
+        )
    if not isinstance(groups, int):
        raise TypeError("groups must be int type")
@@ -465,20 +516,21 @@ def channel_shuffle(x, groups, data_format="NCHW", name=None):
    if data_format not in ["NCHW", "NHWC"]:
        raise ValueError(
            "Attr(data_format) should be 'NCHW' or 'NHWC'."
-            "But recevie Attr(data_format): {} ".format(data_format))
+            "But recevie Attr(data_format): {} ".format(data_format)
+        )
    if _non_static_mode():
-        return _legacy_C_ops.channel_shuffle(x, "groups", groups, "data_format",
+        return _legacy_C_ops.channel_shuffle(
-                                             data_format)
+            x, "groups", groups, "data_format", data_format
+        )
    helper = LayerHelper("channel_shuffle", **locals())
    check_variable_and_dtype(x, 'x', ['float32', 'float64'], 'channel_shuffle')
    out = helper.create_variable_for_type_inference(dtype=x.dtype)
-    helper.append_op(type="channel_shuffle",
+    helper.append_op(
-                     inputs={"X": x},
+        type="channel_shuffle",
-                     outputs={"Out": out},
+        inputs={"X": x},
-                     attrs={
+        outputs={"Out": out},
-                         "groups": groups,
+        attrs={"groups": groups, "data_format": data_format},
-                         "data_format": data_format
+    )
-                     })
    return out
--- a/python/paddle/nn/layer/activation.py
+++ b/python/paddle/nn/layer/activation.py
@@ -215,9 +215,8 @@ class Hardshrink(Layer):
 class Hardswish(Layer):
    r"""
-    Hardswish activation
+    Hardswish activation. Create a callable object of `Hardswish`. Hardswish
+    is proposed in MobileNetV3, and performs better in computational stability
-    Hardswish is proposed in MobileNetV3, and performs better in computational stability
    and efficiency compared to swish function. For more details please refer
    to: https://arxiv.org/pdf/1905.02244.pdf
@@ -307,7 +306,7 @@ class Tanh(Layer):
 class Hardtanh(Layer):
    r"""
-    Hardtanh Activation
+    Hardtanh Activation. Create a callable object of `Hardtanh`.
    .. math::
@@ -669,7 +668,8 @@ class SELU(Layer):
 class LeakyReLU(Layer):
    r"""
-    Leaky ReLU Activation.
+    Leaky ReLU Activation. Create a callable object of `LeakyReLU` to calculate
+    the `LeakyReLU` of input `x`.
    .. math::
@@ -696,10 +696,9 @@ class LeakyReLU(Layer):
        .. code-block:: python
            import paddle
-            import numpy as np
            m = paddle.nn.LeakyReLU()
-            x = paddle.to_tensor(np.array([-2, 0, 1], 'float32'))
+            x = paddle.to_tensor([-2.0, 0, 1])
            out = m(x)  # [-0.02, 0., 1.]
    """
@@ -717,15 +716,15 @@ class LeakyReLU(Layer):
 class Sigmoid(Layer):
-    """
+    r"""
    this interface is used to construct a callable object of the ``Sigmoid`` class. This layer calcluate the `sigmoid` of input x.
    .. math::
-        Sigmoid(x) = \\frac{1}{1 + e^{-x}}
+        sigmoid(x) = \frac{1}{1 + e^{-x}}
    Parameters:
-        name (str, optional): Name for the operation (optional, default is None). For more information, please refer to :ref:`api_guide_Name`.
+        name (str, optional): For details, please refer to :ref:`api_guide_Name`. Generally, no setting is required. Default: None.
    Shape:
        x: N-D tensor, available dtype is float16, float32, float64.
@@ -737,11 +736,11 @@ class Sigmoid(Layer):
        .. code-block:: python
-          import paddle
+            import paddle
-          m = paddle.nn.Sigmoid()
+            m = paddle.nn.Sigmoid()
-          x = paddle.to_tensor([1.0, 2.0, 3.0, 4.0])
+            x = paddle.to_tensor([1.0, 2.0, 3.0, 4.0])
-          out = m(x) # [0.7310586, 0.880797, 0.95257413, 0.98201376]
+            out = m(x) # [0.7310586, 0.880797, 0.95257413, 0.98201376]
    """
    def __init__(self, name=None):
@@ -758,8 +757,8 @@ class Sigmoid(Layer):
 class Hardsigmoid(Layer):
    r"""
-    This interface is used to construct a callable object of the ``Hardsigmoid`` class.
+    ``Hardsigmoid`` Activiation Layers, Construct a callable object of
-    This layer calcluate the `hardsigmoid` of input x.
+    the ``Hardsigmoid`` class. This layer calcluate the `hardsigmoid` of input x.
    A 3-part piecewise linear approximation of sigmoid(https://arxiv.org/abs/1603.00391),
    which is much faster than sigmoid.
@@ -775,7 +774,6 @@ class Hardsigmoid(Layer):
                \end{array}
            \right.
    Parameters:
        name (str, optional): Name for the operation (optional, default is None). For more information, please refer to :ref:`api_guide_Name`.
@@ -813,15 +811,15 @@ class Softplus(Layer):
    Softplus Activation
    .. math::
+        softplus(x)=\begin{cases}
-        Softplus(x) = \frac{1}{beta} * \log(1 + e^{beta * x}) \\
+                \frac{1}{\beta} * \log(1 + e^{\beta * x}),&x\leqslant\frac{\varepsilon}{\beta};\\
-        \text{For numerical stability, the implementation reverts to the linear function when: beta * x > threshold.}
+                x,&x>\frac{\varepsilon}{\beta}.
+            \end{cases}
    Parameters:
-        beta (float, optional): The value of beta for Softplus. Default is 1
+        beta (float, optional): The value of :math:`\beta` for Softplus. Default is 1
-        threshold (float, optional): The value of threshold for Softplus. Default is 20
+        threshold (float, optional): The value of :math:`\varepsilon` for Softplus. Default is 20
-        name (str, optional): Name for the operation (optional, default is None).
+        name (str, optional): For details, please refer to :ref:`api_guide_Name`. Generally, no setting is required. Default: None.
-            For more information, please refer to :ref:`api_guide_Name`.
    Shape:
        - input: Tensor with any shape.
@@ -831,9 +829,8 @@ class Softplus(Layer):
        .. code-block:: python
            import paddle
-            import numpy as np
-            x = paddle.to_tensor(np.array([-0.4, -0.2, 0.1, 0.3]))
+            x = paddle.to_tensor([-0.4, -0.2, 0.1, 0.3], dtype='float32')
            m = paddle.nn.Softplus()
            out = m(x) # [0.513015, 0.598139, 0.744397, 0.854355]
    """
@@ -1124,16 +1121,17 @@ class ThresholdedReLU(Layer):
 class Silu(Layer):
-    """
+    r"""
-    Silu Activation.
+    Silu Activation
    .. math::
-        Silu(x) = \frac{x}{1 + e^{-x}}
+        silu(x) = \frac{x}{1 + \mathrm{e}^{-x}}
+    Where :math:`x` is the input Tensor.
    Parameters:
-        x (Tensor): The input Tensor with data type float32, or float64.
+        name (str, optional): For details, please refer to :ref:`api_guide_Name`. Generally, no setting is required. Default: None.
-        name (str, optional): Name for the operation (optional, default is None).
-            For more information, please refer to :ref:`api_guide_Name`.
    Shape:
        - input: Tensor with any shape.
@@ -1294,15 +1292,13 @@ class Softmax(Layer):
        .. code-block:: python
            import paddle
-            import numpy as np
-            x = np.array([[[2.0, 3.0, 4.0, 5.0],
+            x = paddle.to_tensor([[[2.0, 3.0, 4.0, 5.0],
                        [3.0, 4.0, 5.0, 6.0],
                        [7.0, 8.0, 8.0, 9.0]],
                        [[1.0, 2.0, 3.0, 4.0],
                        [5.0, 6.0, 7.0, 8.0],
-                        [6.0, 7.0, 8.0, 9.0]]], 'float32')
+                        [6.0, 7.0, 8.0, 9.0]]], dtype='float32')
-            x = paddle.to_tensor(x)
            m = paddle.nn.Softmax()
            out = m(x)
            # [[[0.0320586 , 0.08714432, 0.23688282, 0.64391426],
@@ -1387,7 +1383,7 @@ class LogSoftmax(Layer):
 class Maxout(Layer):
    r"""
-    Maxout Activation.
+    Maxout Activation. Create a callable object of `Maxout`.
    Assumed the input shape is (N, Ci, H, W).
    The output shape is (N, Co, H, W).

--- a/python/paddle/nn/layer/common.py
+++ b/python/paddle/nn/layer/common.py
@@ -360,22 +360,6 @@ class Upsample(Layer):
        A 3-D Tensor of the shape (num_batches, channels, out_w) or (num_batches, out_w, channels),
        A 4-D Tensor of the shape (num_batches, channels, out_h, out_w) or (num_batches, out_h, out_w, channels),
        or 5-D Tensor of the shape (num_batches, channels, out_d, out_h, out_w) or (num_batches, out_d, out_h, out_w, channels).
-    Raises:
-        TypeError: size should be a list or tuple or Tensor.
-        ValueError: The 'mode' of image_resize can only be 'linear', 'bilinear',
-                    'trilinear', 'bicubic', or 'nearest' currently.
-        ValueError: 'linear' only support 3-D tensor.
-        ValueError: 'bilinear' and 'bicubic'  only support 4-D tensor.
-        ValueError: 'trilinear' only support 5-D tensor.
-        ValueError: 'nearest' only support 4-D or 5-D tensor.
-        ValueError: One of size and scale_factor must not be None.
-        ValueError: size length should be 1 for input 3-D tensor.
-        ValueError: size length should be 2 for input 4-D tensor.
-        ValueError: size length should be 3 for input 5-D tensor.
-        ValueError: scale_factor should be greater than zero.
-        TypeError: align_corners should be a bool value
-        ValueError: align_mode can only be '0' or '1'
-        ValueError: data_format can only be 'NCW', 'NWC', 'NCHW', 'NHWC', 'NCDHW' or 'NDHWC'.
    Examples:
        .. code-block:: python

--- a/python/paddle/nn/layer/conv.py
+++ b/python/paddle/nn/layer/conv.py
@@ -305,9 +305,6 @@ class Conv1D(_ConvNd):
        - weight: 3-D tensor with shape: (out_channels, in_channels, kernel_size)
        - bias: 1-D tensor with shape: (out_channels)
        - output: 3-D tensor with same shape as input x.
-    Raises:
-        None
    Examples:
        .. code-block:: python
@@ -986,10 +983,6 @@ class Conv3D(_ConvNd):
           W_{out}&= \frac{(W_{in} + 2 * paddings[2] - (dilations[2] * (kernel\_size[2] - 1) + 1))}{strides[2]} + 1
-    Raises:
-        ValueError: If the shapes of input, filter_size, stride, padding and
-                    groups mismatch.
    Examples:
        .. code-block:: python
@@ -1171,10 +1164,6 @@ class Conv3DTranspose(_ConvNd):
           H^\prime_{out} &= (H_{in} - 1) * strides[1] - 2 * paddings[1] + dilations[1] * (kernel\_size[1] - 1) + 1
           W^\prime_{out} &= (W_{in} - 1) * strides[2] - 2 * paddings[2] + dilations[2] * (kernel\_size[2] - 1) + 1
-    Raises:
-        ValueError: If the shapes of input, filter_size, stride, padding and
-                    groups mismatch.
    Examples:
       .. code-block:: python

--- a/python/paddle/nn/layer/loss.py
+++ b/python/paddle/nn/layer/loss.py
@@ -590,15 +590,11 @@ class MSELoss(Layer):
    Examples:
        .. code-block:: python
-            import numpy as np
            import paddle
-            input_data = np.array([1.5]).astype("float32")
-            label_data = np.array([1.7]).astype("float32")
            mse_loss = paddle.nn.loss.MSELoss()
-            input = paddle.to_tensor(input_data)
+            input = paddle.to_tensor([1.5])
-            label = paddle.to_tensor(label_data)
+            label = paddle.to_tensor([1.7])
            output = mse_loss(input, label)
            print(output)
            # [0.04000002]
@@ -638,10 +634,10 @@ class MSELoss(Layer):
 class L1Loss(Layer):
    r"""
-    This interface is used to construct a callable object of the ``L1Loss`` class.
+    Construct a callable object of the ``L1Loss`` class.
    The L1Loss layer calculates the L1 Loss of ``input`` and ``label`` as follows.
-     If `reduction` set to ``'none'``, the loss is:
+    If `reduction` set to ``'none'``, the loss is:
    .. math::
        Out = \lvert input - label\rvert
@@ -677,12 +673,9 @@ class L1Loss(Layer):
        .. code-block:: python
            import paddle
-            import numpy as np
-            input_data = np.array([[1.5, 0.8], [0.2, 1.3]]).astype("float32")
+            input = paddle.to_tensor([[1.5, 0.8], [0.2, 1.3]])
-            label_data = np.array([[1.7, 1], [0.4, 0.5]]).astype("float32")
+            label = paddle.to_tensor([[1.7, 1], [0.4, 0.5]])
-            input = paddle.to_tensor(input_data)
-            label = paddle.to_tensor(label_data)
            l1_loss = paddle.nn.L1Loss()
            output = l1_loss(input, label)
@@ -921,9 +914,10 @@ class NLLLoss(Layer):
 class KLDivLoss(Layer):
    r"""
-    This interface calculates the Kullback-Leibler divergence loss
+    Generate a callable object of 'KLDivLoss' to calculate the
-    between Input(X) and Input(Target). Notes that Input(X) is the
+    Kullback-Leibler divergence loss between Input(X) and
-    log-probability and Input(Target) is the probability.
+    Input(Target). Notes that Input(X) is the log-probability
+    and Input(Target) is the probability.
    KL divergence loss is calculated as follows:
@@ -951,35 +945,30 @@ class KLDivLoss(Layer):
        .. code-block:: python
            import paddle
-            import numpy as np
            import paddle.nn as nn
            shape = (5, 20)
-            x = np.random.uniform(-10, 10, shape).astype('float32')
+            x = paddle.uniform(shape, min=-10, max=10).astype('float32')
-            target = np.random.uniform(-10, 10, shape).astype('float32')
+            target = paddle.uniform(shape, min=-10, max=10).astype('float32')
            # 'batchmean' reduction, loss shape will be [1]
            kldiv_criterion = nn.KLDivLoss(reduction='batchmean')
-            pred_loss = kldiv_criterion(paddle.to_tensor(x),
+            pred_loss = kldiv_criterion(x, target)
-                                        paddle.to_tensor(target))
            # shape=[1]
            # 'mean' reduction, loss shape will be [1]
            kldiv_criterion = nn.KLDivLoss(reduction='mean')
-            pred_loss = kldiv_criterion(paddle.to_tensor(x),
+            pred_loss = kldiv_criterion(x, target)
-                                        paddle.to_tensor(target))
            # shape=[1]
            # 'sum' reduction, loss shape will be [1]
            kldiv_criterion = nn.KLDivLoss(reduction='sum')
-            pred_loss = kldiv_criterion(paddle.to_tensor(x),
+            pred_loss = kldiv_criterion(x, target)
-                                        paddle.to_tensor(target))
            # shape=[1]
            # 'none' reduction, loss shape is same with X shape
            kldiv_criterion = nn.KLDivLoss(reduction='none')
-            pred_loss = kldiv_criterion(paddle.to_tensor(x),
+            pred_loss = kldiv_criterion(x, target)
-                                        paddle.to_tensor(target))
            # shape=[5, 20]
    """
@@ -1171,16 +1160,16 @@ class SmoothL1Loss(Layer):
    .. math::
-         loss(x,y) = \frac{1}{n}\sum_{i}z_i
+        loss(x, y) = \frac{1}{n}\sum_{i}z_i
-    where z_i is given by:
+    where :math:`z_i` is given by:
    .. math::
        \mathop{z_i} = \left\{\begin{array}{rcl}
-        0.5(x_i - y_i)^2 & & {if |x_i - y_i| < delta} \\
+                0.5(x_i - y_i)^2 & & {if |x_i - y_i| < \delta} \\
-        delta * |x_i - y_i| - 0.5 * delta^2 & & {otherwise}
+                \delta * |x_i - y_i| - 0.5 * \delta^2 & & {otherwise}
-        \end{array} \right.
+            \end{array} \right.
    Parameters:
        reduction (str, optional): Indicate how to average the loss by batch_size,
@@ -1189,12 +1178,11 @@ class SmoothL1Loss(Layer):
            If :attr:`reduction` is ``'sum'``, the reduced sum loss is returned.
            If :attr:`reduction` is ``'none'``, the unreduced loss is returned.
            Default is ``'mean'``.
-        delta (float, optional): Specifies the hyperparameter delta to be used.
+        delta (float, optional): Specifies the hyperparameter :math:`\delta` to be used.
            The value determines how large the errors need to be to use L1. Errors
            smaller than delta are minimized with L2. Parameter is ignored for
-            negative/zero values. Default = 1.0
+            negative/zero values. Default value is :math:`1.0`.
-        name (str, optional): Name for the operation (optional, default is
+        name (str, optional): For details, please refer to :ref:`api_guide_Name`. Generally, no setting is required. Default: None.
-            None). For more information, please refer to :ref:`api_guide_Name`.
    Call Parameters:
@@ -1212,14 +1200,12 @@ class SmoothL1Loss(Layer):
        .. code-block:: python
            import paddle
-            import numpy as np
+            input = paddle.rand([3, 3]).astype("float32")
-            input_data = np.random.rand(3,3).astype("float32")
+            label = paddle.rand([3, 3]).astype("float32")
-            label_data = np.random.rand(3,3).astype("float32")
-            input = paddle.to_tensor(input_data)
-            label = paddle.to_tensor(label_data)
            loss = paddle.nn.SmoothL1Loss()
            output = loss(input, label)
            print(output)
+            # [0.049606]
    """
    def __init__(self, reduction='mean', delta=1.0, name=None):
@@ -1321,7 +1307,7 @@ class MultiLabelSoftMarginLoss(Layer):
 class HingeEmbeddingLoss(Layer):
    r"""
-    This operator calculates hinge_embedding_loss. Measures the loss given an input tensor :math:`x` and a labels tensor :math:`y`(containing 1 or -1).
+    Create a callable object of `HingeEmbeddingLoss` to calculates hinge_embedding_loss. Measures the loss given an input tensor :math:`x` and a labels tensor :math:`y`(containing 1 or -1).
    This is usually used for measuring whether two inputs are similar or dissimilar, e.g. using the L1 pairwise distance as :math:`x`,
    and is typically used for learning nonlinear embeddings or semi-supervised learning.

--- a/python/paddle/nn/layer/norm.py
+++ b/python/paddle/nn/layer/norm.py
@@ -117,7 +117,7 @@ class _InstanceNormBase(Layer):
 class InstanceNorm1D(_InstanceNormBase):
    r"""
-    Applies Instance Normalization over a 3D input (a mini-batch of 1D inputs with additional channel dimension) as described in the paper Instance Normalization: The Missing Ingredient for Fast Stylization .
+    Create a callable object of `InstanceNorm1D`. Applies Instance Normalization over a 3D input (a mini-batch of 1D inputs with additional channel dimension) as described in the paper Instance Normalization: The Missing Ingredient for Fast Stylization .
    DataLayout: NCL `[batch, in_channels, length]`
@@ -133,8 +133,7 @@ class InstanceNorm1D(_InstanceNormBase):
        \sigma_{\beta}^{2} + \epsilon}} \qquad &//\ normalize \\
        y_i &\gets \gamma \hat{x_i} + \beta \qquad &//\ scale\ and\ shift
-    Note:
+Where `H` means height of feature map, `W` means width of feature map.
-        `H` means height of feature map, `W` means width of feature map.
    Parameters:
        num_features(int): Indicate the number of channels of the input ``Tensor``.
@@ -168,11 +167,8 @@ class InstanceNorm1D(_InstanceNormBase):
        .. code-block:: python
          import paddle
-          import numpy as np
-          np.random.seed(123)
+          x = paddle.rand((2, 2, 3))
-          x_data = np.random.random(size=(2, 2, 3)).astype('float32')
-          x = paddle.to_tensor(x_data) 
          instance_norm = paddle.nn.InstanceNorm1D(2)
          instance_norm_out = instance_norm(x)
@@ -191,7 +187,7 @@ class InstanceNorm1D(_InstanceNormBase):
 class InstanceNorm2D(_InstanceNormBase):
    r"""
-    Applies Instance Normalization over a 4D input (a mini-batch of 2D inputs with additional channel dimension) as described in the paper Instance Normalization: The Missing Ingredient for Fast Stylization .
+    Create a callable object of `InstanceNorm2D`. Applies Instance Normalization over a 4D input (a mini-batch of 2D inputs with additional channel dimension) as described in the paper Instance Normalization: The Missing Ingredient for Fast Stylization .
    DataLayout: NCHW `[batch, in_channels, in_height, in_width]`
@@ -208,8 +204,7 @@ class InstanceNorm2D(_InstanceNormBase):
        \sigma_{\beta}^{2} + \epsilon}} \qquad &//\ normalize \\
        y_i &\gets \gamma \hat{x_i} + \beta \qquad &//\ scale\ and\ shift
-    Note:
+Where `H` means height of feature map, `W` means width of feature map.
-        `H` means height of feature map, `W` means width of feature map.
    Parameters:
        num_features(int): Indicate the number of channels of the input ``Tensor``.
@@ -242,11 +237,8 @@ class InstanceNorm2D(_InstanceNormBase):
        .. code-block:: python
          import paddle
-          import numpy as np
-          np.random.seed(123)
+          x = paddle.rand((2, 2, 2, 3))
-          x_data = np.random.random(size=(2, 2, 2, 3)).astype('float32')
-          x = paddle.to_tensor(x_data) 
          instance_norm = paddle.nn.InstanceNorm2D(2)
          instance_norm_out = instance_norm(x)
@@ -262,7 +254,7 @@ class InstanceNorm2D(_InstanceNormBase):
 class InstanceNorm3D(_InstanceNormBase):
    r"""
-    Applies Instance Normalization over a 5D input (a mini-batch of 3D inputs with additional channel dimension) as described in the paper Instance Normalization: The Missing Ingredient for Fast Stylization .
+    Create a callable object of `InstanceNorm3D`. Applies Instance Normalization over a 5D input (a mini-batch of 3D inputs with additional channel dimension) as described in the paper Instance Normalization: The Missing Ingredient for Fast Stylization .
    DataLayout: NCHW `[batch, in_channels, D, in_height, in_width]`
@@ -279,8 +271,7 @@ class InstanceNorm3D(_InstanceNormBase):
        \sigma_{\beta}^{2} + \epsilon}} \qquad &//\ normalize \\
        y_i &\gets \gamma \hat{x_i} + \beta \qquad &//\ scale\ and\ shift
-    Note:
+Where `H` means height of feature map, `W` means width of feature map.
-        `H` means height of feature map, `W` means width of feature map.
    Parameters:
        num_features(int): Indicate the number of channels of the input ``Tensor``.
@@ -313,11 +304,8 @@ class InstanceNorm3D(_InstanceNormBase):
        .. code-block:: python
          import paddle
-          import numpy as np
-          np.random.seed(123)
+          x = paddle.rand((2, 2, 2, 2, 3))
-          x_data = np.random.random(size=(2, 2, 2, 2, 3)).astype('float32')
-          x = paddle.to_tensor(x_data) 
          instance_norm = paddle.nn.InstanceNorm3D(2)
          instance_norm_out = instance_norm(x)
@@ -494,11 +482,7 @@ class GroupNorm(Layer):
 class LayerNorm(Layer):
    r"""
-    :alias_main: paddle.nn.LayerNorm
+    Construct a callable object of the ``LayerNorm`` class.
-        :alias: paddle.nn.LayerNorm,paddle.nn.layer.LayerNorm,paddle.nn.layer.norm.LayerNorm
-        :old_api: paddle.fluid.dygraph.LayerNorm
-    This interface is used to construct a callable object of the ``LayerNorm`` class.
    For more details, refer to code examples.
    It implements the function of the Layer Normalization Layer and can be applied to mini-batch input data.
    Refer to `Layer Normalization <https://arxiv.org/pdf/1607.06450v1.pdf>`_
@@ -546,12 +530,9 @@ class LayerNorm(Layer):
        .. code-block:: python
          import paddle
-          import numpy as np
-          np.random.seed(123)
+          x = paddle.rand((2, 2, 2, 3))
-          x_data = np.random.random(size=(2, 2, 2, 3)).astype('float32')
+          layer_norm = paddle.nn.LayerNorm(x.shape[1:])
-          x = paddle.to_tensor(x_data)
-          layer_norm = paddle.nn.LayerNorm(x_data.shape[1:])
          layer_norm_out = layer_norm(x)
          print(layer_norm_out)
@@ -819,11 +800,8 @@ class BatchNorm1D(_BatchNormBase):
        .. code-block:: python
          import paddle
-          import numpy as np
-          np.random.seed(123)
+          x = paddle.rand((2, 1, 3))
-          x_data = np.random.random(size=(2, 1, 3)).astype('float32')
-          x = paddle.to_tensor(x_data) 
          batch_norm = paddle.nn.BatchNorm1D(1)
          batch_norm_out = batch_norm(x)
@@ -934,11 +912,8 @@ class BatchNorm2D(_BatchNormBase):
        .. code-block:: python
          import paddle
-          import numpy as np
-          np.random.seed(123)
+          x = paddle.rand((2, 1, 2, 3))
-          x_data = np.random.random(size=(2, 1, 2, 3)).astype('float32')
-          x = paddle.to_tensor(x_data) 
          batch_norm = paddle.nn.BatchNorm2D(1)
          batch_norm_out = batch_norm(x)
@@ -1023,11 +998,8 @@ class BatchNorm3D(_BatchNormBase):
        .. code-block:: python
          import paddle
-          import numpy as np
-          np.random.seed(123)
+          x = paddle.rand((2, 1, 2, 2, 3))
-          x_data = np.random.random(size=(2, 1, 2, 2, 3)).astype('float32')
-          x = paddle.to_tensor(x_data) 
          batch_norm = paddle.nn.BatchNorm3D(1)
          batch_norm_out = batch_norm(x)

--- a/python/paddle/nn/layer/pooling.py
+++ b/python/paddle/nn/layer/pooling.py
@@ -355,14 +355,6 @@ class MaxPool1D(Layer):
    Returns:
        A callable object of MaxPool1D.
-    Raises:
-        ValueError: If `padding` is a string, but not "SAME" or "VALID".
-        ValueError: If `padding` is "VALID", but `ceil_mode` is True.
-        ValueError: If `padding` is a list or tuple but its length greater than 1.
-        ShapeError: If the input is not a 3-D.
-        ShapeError: If the output's shape calculated is not greater than 0.
    Shape:
        - x(Tensor): The input tensor of max pool1d operator, which is a 3-D tensor.
          The data type can be float32, float64.
@@ -468,10 +460,6 @@ class MaxPool2D(Layer):
    Returns:
        A callable object of MaxPool2D.
-    Raises:
-        ValueError: If `padding` is a string, but not "SAME" or "VALID".
-        ValueError: If `padding` is "VALID", but `ceil_mode` is True.
-        ShapeError: If the output's shape calculated is not greater than 0.
    Shape:
        - x(Tensor): The input tensor of max pool2d operator, which is a 4-D tensor.
@@ -569,10 +557,6 @@ class MaxPool3D(Layer):
    Returns:
        A callable object of MaxPool3D.
-    Raises:
-        ValueError: If `padding` is a string, but not "SAME" or "VALID".
-        ValueError: If `padding` is "VALID", but `ceil_mode` is True.
-        ShapeError: If the output's shape calculated is not greater than 0.
    Shape:
        - x(Tensor): The input tensor of max pool3d operator, which is a 5-D tensor.
@@ -904,9 +888,6 @@ class AdaptiveMaxPool1D(Layer):
    Returns:
        A callable object of AdaptiveMaxPool1D.
-    Raises:
-        ValueError: 'pool_size' should be a integer or list or tuple with length as 1.
    Shape:
        - x(Tensor): The input tensor of adaptive max pool1d operator, which is a 3-D tensor.
          The data type can be float32, float64.

--- a/python/paddle/nn/utils/spectral_norm_hook.py
+++ b/python/paddle/nn/utils/spectral_norm_hook.py
@@ -141,7 +141,7 @@ def spectral_norm(
    layer, name='weight', n_power_iterations=1, eps=1e-12, dim=None
 ):
    r"""
-    This spectral_norm layer applies spectral normalization to a parameter according to the
+    Applies spectral normalization to a parameter according to the
    following Calculation:
    Step 1:
@@ -179,7 +179,7 @@ def spectral_norm(
        dim(int, optional): The index of dimension which should be permuted to the first before reshaping Input(Weight) to matrix, it should be set as 0 if Input(Weight) is the weight of fc layer, and should be set as 1 if Input(Weight) is the weight of conv layer. Default: None.
    Returns:
-        The original layer with the spectral norm hook
+        Layer, the original layer with the spectral norm hook.
    Examples:
       .. code-block:: python

--- a/python/paddle/nn/utils/weight_norm_hook.py
+++ b/python/paddle/nn/utils/weight_norm_hook.py
@@ -164,7 +164,7 @@ class WeightNorm(object):
 def weight_norm(layer, name='weight', dim=0):
    r"""
-    This weight_norm layer applies weight normalization to a parameter according to the
+    Applies weight normalization to a parameter according to the
    following formula:
    .. math::
@@ -193,11 +193,9 @@ def weight_norm(layer, name='weight', dim=0):
    Examples:
        .. code-block:: python
-          import numpy as np
          from paddle.nn import Conv2D
          from paddle.nn.utils import weight_norm
-          x = np.array([[[[0.3, 0.4], [0.3, 0.07]], [[0.83, 0.37], [0.18, 0.93]]]]).astype('float32')
          conv = Conv2D(3, 5, 3)
          wn = weight_norm(conv)
          print(conv.weight_g.shape)
@@ -218,7 +216,7 @@ def remove_weight_norm(layer, name='weight'):
        name(str, optional): Name of the weight parameter. Default: 'weight'.
    Returns:
-        Origin layer without weight norm
+        Layer, the origin layer without weight norm
    Examples:
        .. code-block:: python

--- a/python/paddle/optimizer/lr.py
+++ b/python/paddle/optimizer/lr.py
@@ -1622,7 +1622,6 @@ class MultiplicativeDecay(LRScheduler):
        .. code-block:: python
            import paddle
-            import numpy as np
            # train on default dynamic graph mode
            linear = paddle.nn.Linear(10, 10)
@@ -1937,7 +1936,7 @@ class CyclicLR(LRScheduler):
        verbose: (bool, optional): If ``True``, prints a message to stdout for each update. Default: ``False`` .
    Returns:
-    ``CyclicLR`` instance to schedule learning rate.
+        ``CyclicLR`` instance to schedule learning rate.
    Examples:
        .. code-block:: python

--- a/python/paddle/optimizer/rmsprop.py
+++ b/python/paddle/optimizer/rmsprop.py
@@ -71,12 +71,12 @@ class RMSProp(Optimizer):
    Parameters:
        learning_rate (float|LRScheduler): The learning rate used to update ``Parameter``.
          It can be a float value or a LRScheduler.
-        rho(float): rho is :math:`\rho` in equation, default is 0.95.
+        rho(float, optional): rho is :math:`\rho` in equation, default is 0.95.
-        epsilon(float): :math:`\epsilon` in equation is smoothing term to
+        epsilon(float, optional): :math:`\epsilon` in equation is smoothing term to
          avoid division by zero, default is 1e-6.
-        momentum(float): :math:`\beta` in equation is the momentum term,
+        momentum(float, optional): :math:`\beta` in equation is the momentum term,
          default is 0.0.
-        centered(bool): If True, gradients are normalized by the estimated variance of
+        centered(bool, optional): If True, gradients are normalized by the estimated variance of
          the gradient; if False, by the uncentered second moment. Setting this to
          True may help with training, but is slightly more expensive in terms of
          computation and memory. Defaults to False.
@@ -100,9 +100,6 @@ class RMSProp(Optimizer):
        name (str, optional): This parameter is used by developers to print debugging information. 
          For details, please refer to :ref:`api_guide_Name`. Default is None.
-    Raises:
-        ValueError: If learning_rate, rho, epsilon, momentum are None.
    Examples:
          .. code-block:: python

--- a/python/paddle/profiler/profiler.py
+++ b/python/paddle/profiler/profiler.py
--- a/python/paddle/profiler/utils.py
+++ b/python/paddle/profiler/utils.py
@@ -18,16 +18,19 @@ import functools
 from contextlib import ContextDecorator
 from paddle.fluid import core
-from paddle.fluid.core import (_RecordEvent, TracerEventType)
+from paddle.fluid.core import _RecordEvent, TracerEventType
 _is_profiler_used = False
 _has_optimizer_wrapped = False
 _AllowedEventTypeList = [
-    TracerEventType.Dataloader, TracerEventType.ProfileStep,
+    TracerEventType.Dataloader,
-    TracerEventType.Forward, TracerEventType.Backward,
+    TracerEventType.ProfileStep,
-    TracerEventType.Optimization, TracerEventType.PythonOp,
+    TracerEventType.Forward,
-    TracerEventType.PythonUserDefined
+    TracerEventType.Backward,
+    TracerEventType.Optimization,
+    TracerEventType.PythonOp,
+    TracerEventType.PythonUserDefined,
 ]
@@ -36,8 +39,10 @@ class RecordEvent(ContextDecorator):
    Interface for recording a time range by user defined.
    Args:
-        name(str): Name of the record event
+        name (str): Name of the record event.
-        event_type(TracerEventType, optional): Optional, default value is TracerEventType.PythonUserDefined. It is reserved for internal purpose, and it is better not to specify this parameter. 
+        event_type (TracerEventType, optional): Optional, default value is
+            `TracerEventType.PythonUserDefined`. It is reserved for internal
+            purpose, and it is better not to specify this parameter.
    Examples:
        .. code-block:: python
@@ -59,13 +64,14 @@ class RecordEvent(ContextDecorator):
            record_event.end()
    **Note**:
-        RecordEvent will take effect only when :ref:`Profiler <api_paddle_profiler_Profiler>` is on and at the state of RECORD.
+        RecordEvent will take effect only when :ref:`Profiler <api_paddle_profiler_Profiler>` is on and at the state of `RECORD`.
    """
    def __init__(
-            self,
+        self,
-            name: str,
+        name: str,
-            event_type: TracerEventType = TracerEventType.PythonUserDefined):
+        event_type: TracerEventType = TracerEventType.PythonUserDefined,
+    ):
        self.name = name
        self.event_type = event_type
        self.event = None
@@ -98,8 +104,12 @@ class RecordEvent(ContextDecorator):
        if not _is_profiler_used:
            return
        if self.event_type not in _AllowedEventTypeList:
-            warn("Only TracerEvent Type in [{}, {}, {}, {}, {}, {},{}]\
+            warn(
-                  can be recorded.".format(*_AllowedEventTypeList))
+                "Only TracerEvent Type in [{}, {}, {}, {}, {}, {},{}]\
+                  can be recorded.".format(
+                    *_AllowedEventTypeList
+                )
+            )
            self.event = None
        else:
            self.event = _RecordEvent(self.name, self.event_type)
@@ -134,7 +144,7 @@ def load_profiler_result(filename: str):
        filename(str): Name of the exported protobuf file of profiler data.
    Returns:
-        ProfilerResult object, which stores profiling data.
+        ``ProfilerResult`` object, which stores profiling data.
    Examples:
        .. code-block:: python
@@ -158,14 +168,13 @@ def in_profiler_mode():
 def wrap_optimizers():
    def optimizer_warpper(func):
        @functools.wraps(func)
        def warpper(*args, **kwargs):
            if in_profiler_mode():
-                with RecordEvent('Optimization Step',
+                with RecordEvent(
-                                 event_type=TracerEventType.Optimization):
+                    'Optimization Step', event_type=TracerEventType.Optimization
+                ):
                    return func(*args, **kwargs)
            else:
                return func(*args, **kwargs)
@@ -176,6 +185,7 @@ def wrap_optimizers():
    if _has_optimizer_wrapped == True:
        return
    import paddle.optimizer as optimizer
    for classname in optimizer.__all__:
        if classname != 'Optimizer':
            classobject = getattr(optimizer, classname)

--- a/python/paddle/reader/decorator.py
+++ b/python/paddle/reader/decorator.py
@@ -264,9 +264,6 @@ def compose(*readers, **kwargs):
    Returns:
        the new data reader (Reader).
-    Raises:
-        ComposeNotAligned: outputs of readers are not aligned. This will not raise if check_alignment is set to False.
    Examples:
        .. code-block:: python

--- a/python/paddle/sparse/creation.py
+++ b/python/paddle/sparse/creation.py
@@ -98,12 +98,6 @@ def sparse_coo_tensor(
    Returns:
        Tensor: A Tensor constructed from ``indices`` and ``values`` .
-    Raises:
-        TypeError: If the data type of ``values`` is not list, tuple, numpy.ndarray, paddle.Tensor
-        ValueError: If ``values`` is tuple|list, it can't contain nested tuple|list with different lengths , such as: [[1, 2], [3, 4, 5]]. If the ``indices`` is not a 2-D.
-        TypeError: If ``dtype`` is not bool, float16, float32, float64, int8, int16, int32, int64, uint8, complex64, complex128
-        ValueError: If ``place`` is not paddle.CPUPlace, paddle.CUDAPinnedPlace, paddle.CUDAPlace or specified pattern string.
    Examples:
    .. code-block:: python
@@ -222,12 +216,6 @@ def sparse_csr_tensor(
    Returns:
        Tensor: A Tensor constructed from ``crows``, ``cols`` and ``values`` .
-    Raises:
-        TypeError: If the data type of ``values`` is not list, tuple, numpy.ndarray, paddle.Tensor
-        ValueError: If ``values`` is tuple|list, it can't contain nested tuple|list with different lengths , such as: [[1, 2], [3, 4, 5]]. If the ``crow``, ``cols`` and ``values`` is not a 2-D.
-        TypeError: If ``dtype`` is not bool, float16, float32, float64, int8, int16, int32, int64, uint8, complex64, complex128
-        ValueError: If ``place`` is not paddle.CPUPlace, paddle.CUDAPinnedPlace, paddle.CUDAPlace or specified pattern string.
    Examples:
    .. code-block:: python

--- a/python/paddle/static/io.py
+++ b/python/paddle/static/io.py
@@ -143,11 +143,6 @@ def normalize_program(program, feed_vars, fetch_vars):
    Returns:
        Program: Normalized/Optimized program.
-    Raises:
-        TypeError: If `program` is not a Program, an exception is thrown.
-        TypeError: If `feed_vars` is not a Variable or a list of Variable, an exception is thrown.
-        TypeError: If `fetch_vars` is not a Variable or a list of Variable, an exception is thrown.
    Examples:
        .. code-block:: python
@@ -285,10 +280,6 @@ def serialize_program(feed_vars, fetch_vars, **kwargs):
    Returns:
        bytes: serialized program.
-    Raises:
-        ValueError: If `feed_vars` is not a Variable or a list of Variable, an exception is thrown.
-        ValueError: If `fetch_vars` is not a Variable or a list of Variable, an exception is thrown.
    Examples:
        .. code-block:: python
@@ -348,10 +339,6 @@ def serialize_persistables(feed_vars, fetch_vars, executor, **kwargs):
    Returns:
        bytes: serialized program.
-    Raises:
-        ValueError: If `feed_vars` is not a Variable or a list of Variable, an exception is thrown.
-        ValueError: If `fetch_vars` is not a Variable or a list of Variable, an exception is thrown.
    Examples:
        .. code-block:: python
@@ -497,10 +484,6 @@ def save_inference_model(
    Returns:
        None
-    Raises:
-        ValueError: If `feed_vars` is not a Variable or a list of Variable, an exception is thrown.
-        ValueError: If `fetch_vars` is not a Variable or a list of Variable, an exception is thrown.
    Examples:
        .. code-block:: python
@@ -783,9 +766,6 @@ def load_inference_model(path_prefix, executor, **kwargs):
        ``Variable`` (refer to :ref:`api_guide_Program_en`). It contains variables from which
        we can get inference results.
-    Raises:
-        ValueError: If `path_prefix.pdmodel` or `path_prefix.pdiparams`  doesn't exist.
    Examples:
        .. code-block:: python

--- a/python/paddle/static/nn/common.py
+++ b/python/paddle/static/nn/common.py
@@ -124,9 +124,6 @@ def fc(
    Returns:
        Tensor, its shape is :math:`[batch\_size, *, size]` , and the data type is same with input.
-    Raises:
-        ValueError: If dimensions of the input tensor is less than 2.
    Examples:
        .. code-block:: python
@@ -281,9 +278,7 @@ def deform_conv2d(
    Returns:
        Tensor: The tensor storing the deformable convolution \
                  result. A Tensor with type float32, float64.
-    Raises:
-        ValueError: If the shapes of input, filter_size, stride, padding and
-                    groups mismatch.
    Examples:
        .. code-block:: python

--- a/python/paddle/tensor/linalg.py
+++ b/python/paddle/tensor/linalg.py
@@ -282,7 +282,7 @@ def norm(x, p='fro', axis=None, keepdim=False, name=None):
    Returns the matrix norm (Frobenius) or vector norm (the 1-norm, the Euclidean
    or 2-norm, and in general the p-norm for p > 0) of a given tensor.
-    .. note::
+    Note:
        This norm API is different from `numpy.linalg.norm`.
        This api supports high-order input tensors (rank >= 3), and certain axis need to be pointed out to calculate the norm.
        But `numpy.linalg.norm` only supports 1-D vector or 2-D matrix as input tensor.
@@ -1170,7 +1170,7 @@ def dot(x, y, name=None):
    """
    This operator calculates inner product for vectors.
-    .. note::
+    Note:
       Support 1-d and 2-d Tensor. When it is 2d, the first dimension of this matrix
       is the batch dimension, which means that the vectors of multiple batches are dotted.
@@ -1516,10 +1516,12 @@ def cholesky(x, upper=False, name=None):
            Its data type should be float32 or float64.
        upper (bool): The flag indicating whether to return upper or lower
            triangular matrices. Default: False.
+        name (str, optional): Name for the operation (optional, default is None).
+            For more information, please refer to :ref:`api_guide_Name`.
    Returns:
-        Tensor: A Tensor with same shape and data type as `x`. It represents \
+        Tensor, A Tensor with same shape and data type as `x`. It represents
-            triangular matrices generated by Cholesky decomposition.
+        triangular matrices generated by Cholesky decomposition.
    Examples:
        .. code-block:: python
@@ -1911,24 +1913,27 @@ def mv(x, vec, name=None):
 def det(x, name=None):
    """
    Calculates determinant value of a square matrix or batches of square matrices.
    Args:
-        x (Tensor): input (Tensor): the input matrix of size `(n, n)` or the batch of matrices of size
+        x (Tensor): input (Tensor): the input matrix of size `(n, n)` or the
-                    `(*, n, n)` where `*` is one or more batch dimensions.
+            batch of matrices of size `(*, n, n)` where `*` is one or more
+            batch dimensions.
    Returns:
-        y (Tensor):the determinant value of a square matrix or batches of square matrices.
+        Tensor, the determinant value of a square matrix or batches of square matrices.
    Examples:
        .. code-block:: python
-        import paddle
+            import paddle
-        x =  paddle.randn([3,3,3])
+            x =  paddle.randn([3,3,3])
-        A = paddle.linalg.det(x)
+            A = paddle.linalg.det(x)
-        print(A)
+            print(A)
-        # [ 0.02547996,  2.52317095, -6.15900707])
+            # [ 0.02547996,  2.52317095, -6.15900707])
    """
@@ -1978,18 +1983,18 @@ def slogdet(x, name=None):
        of the absolute value of determinant, respectively.
    Examples:
-    .. code-block:: python
+        .. code-block:: python
-        import paddle
+            import paddle
-        x =  paddle.randn([3,3,3])
+            x =  paddle.randn([3,3,3])
-        A = paddle.linalg.slogdet(x)
+            A = paddle.linalg.slogdet(x)
-        print(A)
+            print(A)
-        # [[ 1.        ,  1.        , -1.        ],
+            # [[ 1.        ,  1.        , -1.        ],
-        # [-0.98610914, -0.43010661, -0.10872950]])
+            # [-0.98610914, -0.43010661, -0.10872950]])
    """
    if in_dygraph_mode():
@@ -2102,13 +2107,11 @@ def matrix_power(x, n, name=None):
    Specifically,
-    - If `n > 0`, it returns the matrix or a batch of matrices raised to the power
+    - If `n > 0`, it returns the matrix or a batch of matrices raised to the power of `n`.
-    of `n`.
    - If `n = 0`, it returns the identity matrix or a batch of identity matrices.
-    - If `n < 0`, it returns the inverse of each matrix (if invertible) raised to
+    - If `n < 0`, it returns the inverse of each matrix (if invertible) raised to the power of `abs(n)`.
-    the power of `abs(n)`.
    Args:
        x (Tensor): A square matrix or a batch of square matrices to be raised
@@ -2243,10 +2246,12 @@ def lu(x, pivot=True, get_infos=False, name=None):
    Pivoting is done if pivot is set to True.
    P mat can be get by pivots:
-    # ones = eye(rows) #eye matrix of rank rows
-    # for i in range(cols):
+    .. code-block:: text
-    #     swap(ones[i], ones[pivots[i]])
+        ones = eye(rows) #eye matrix of rank rows
-    # return ones
+        for i in range(cols):
+            swap(ones[i], ones[pivots[i]])
+        return ones
    Args:
@@ -2260,15 +2265,15 @@ def lu(x, pivot=True, get_infos=False, name=None):
            For more information, please refer to :ref:`api_guide_Name`.
    Returns:
-        factorization (Tensor): LU matrix, the factorization of input X.
+        factorization (Tensor), LU matrix, the factorization of input X.
-        pivots (IntTensor): the pivots of size(∗(N-2), min(m,n)). `pivots` stores all the
+        pivots (IntTensor), the pivots of size(∗(N-2), min(m,n)). `pivots` stores all the
-                    intermediate transpositions of rows. The final permutation `perm` could be
+        intermediate transpositions of rows. The final permutation `perm` could be
-                    reconstructed by this, details refer to upper example.
+        reconstructed by this, details refer to upper example.
-        infos (IntTensor, optional): if `get_infos` is `True`, this is a tensor of size (∗(N-2))
+        infos (IntTensor, optional), if `get_infos` is `True`, this is a tensor of size (∗(N-2))
-                    where non-zero values indicate whether factorization for the matrix or each minibatch
+        where non-zero values indicate whether factorization for the matrix or each minibatch
-                    has succeeded or failed.
+        has succeeded or failed.
    Examples:
@@ -2342,9 +2347,11 @@ def lu_unpack(x, y, unpack_ludata=True, unpack_pivots=True, name=None):
    unpack L and U matrix from LU, unpack permutation matrix P from Pivtos .
    P mat can be get by pivots:
-    # ones = eye(rows) #eye matrix of rank rows
-    # for i in range(cols):
+    .. code-block:: text
-    #     swap(ones[i], ones[pivots[i]])
+        ones = eye(rows) #eye matrix of rank rows
+        for i in range(cols):
+            swap(ones[i], ones[pivots[i]])
    Args:
@@ -2360,11 +2367,11 @@ def lu_unpack(x, y, unpack_ludata=True, unpack_pivots=True, name=None):
            For more information, please refer to :ref:`api_guide_Name`.
    Returns:
-        P (Tensor): Permutation matrix P of lu factorization.
+        P (Tensor), Permutation matrix P of lu factorization.
-        L (Tensor): The lower triangular matrix tensor of lu factorization.
+        L (Tensor), The lower triangular matrix tensor of lu factorization.
-        U (Tensor): The upper triangular matrix tensor of lu factorization.
+        U (Tensor), The upper triangular matrix tensor of lu factorization.
    Examples:
@@ -2437,14 +2444,14 @@ def lu_unpack(x, y, unpack_ludata=True, unpack_pivots=True, name=None):
 def eig(x, name=None):
    """
-    This API performs the eigenvalue decomposition of a square matrix or a batch of square matrices.
+    Performs the eigenvalue decomposition of a square matrix or a batch of square matrices.
-    .. note::
+    Note:
-        If the matrix is a Hermitian or a real symmetric matrix, please use :ref:`paddle.linalg.eigh` instead, which is much faster.
+        - If the matrix is a Hermitian or a real symmetric matrix, please use :ref:`paddle.linalg.eigh` instead, which is much faster.
-        If only eigenvalues is needed, please use :ref:`paddle.linalg.eigvals` instead.
+        - If only eigenvalues is needed, please use :ref:`paddle.linalg.eigvals` instead.
-        If the matrix is of any shape, please use :ref:`paddle.linalg.svd`.
+        - If the matrix is of any shape, please use :ref:`paddle.linalg.svd`.
-        This API is only supported on CPU device.
+        - This API is only supported on CPU device.
-        The output datatype is always complex for both real and complex input.
+        - The output datatype is always complex for both real and complex input.
    Args:
        x (Tensor): A tensor with shape math:`[*, N, N]`, The data type of the x should be one of ``float32``,
@@ -2460,16 +2467,14 @@ def eig(x, name=None):
        .. code-block:: python
            import paddle
-            import numpy as np
            paddle.device.set_device("cpu")
-            x_data = np.array([[1.6707249, 7.2249975, 6.5045543],
+            x = paddle.to_tensor([[1.6707249, 7.2249975, 6.5045543],
                               [9.956216,  8.749598,  6.066444 ],
-                               [4.4251957, 1.7983172, 0.370647 ]]).astype("float32")
+                               [4.4251957, 1.7983172, 0.370647 ]])
-            x = paddle.to_tensor(x_data)
            w, v = paddle.linalg.eig(x)
-            print(w)
+            print(v)
            # Tensor(shape=[3, 3], dtype=complex128, place=CPUPlace, stop_gradient=False,
            #       [[(-0.5061363550800655+0j) , (-0.7971760990842826+0j) ,
            #         (0.18518077798279986+0j)],
@@ -2478,7 +2483,7 @@ def eig(x, name=None):
            #        [(-0.23142567697893396+0j),  (0.4944999840400175+0j) ,
            #         (0.7058765252952796+0j) ]])
-            print(v)
+            print(w)
            # Tensor(shape=[3], dtype=complex128, place=CPUPlace, stop_gradient=False,
            #       [ (16.50471283351188+0j)  , (-5.5034820550763515+0j) ,
            #         (-0.21026087843552282+0j)])
@@ -2520,8 +2525,8 @@ def eigvals(x, name=None):
            For more information, please refer to :ref:`api_guide_Name`.
    Returns:
-        Tensor: A tensor containing the unsorted eigenvalues which has the same batch dimensions with `x`.
+        Tensor, A tensor containing the unsorted eigenvalues which has the same batch
-            The eigenvalues are complex-valued even when `x` is real.
+        dimensions with `x`. The eigenvalues are complex-valued even when `x` is real.
    Examples:
        .. code-block:: python
@@ -2662,18 +2667,17 @@ def eigh(x, UPLO='L', name=None):
            property.  For more information, please refer to :ref:`api_guide_Name`.
    Returns:
+        - out_value(Tensor):  A Tensor with shape [*, N] and data type of float32 and float64.
-        out_value(Tensor):  A Tensor with shape [*, N] and data type of float32 and float64. The eigenvalues of eigh op.
+            The eigenvalues of eigh op.
-        out_vector(Tensor): A Tensor with shape [*, N, N] and data type of float32,float64,complex64 and complex128. The eigenvectors of eigh op.
+        - out_vector(Tensor): A Tensor with shape [*, N, N] and data type of float32,float64,
+            complex64 and complex128. The eigenvectors of eigh op.
    Examples:
        .. code-block:: python
-            import numpy as np
            import paddle
-            x_data = np.array([[1, -2j], [2j, 5]])
+            x = paddle.to_tensor([[1, -2j], [2j, 5]])
-            x = paddle.to_tensor(x_data)
            out_value, out_vector = paddle.linalg.eigh(x, UPLO='L')
            print(out_value)
            #[0.17157288, 5.82842712]
@@ -3060,8 +3064,8 @@ def solve(x, y, name=None):
    .. math::
        Out = X^-1 * Y
-    Specifically,
-    - This system of linear equations has one solution if and only if input 'X' is invertible.
+    Specifically, this system of linear equations has one solution if and only if input 'X' is invertible.
    Args:
        x (Tensor): A square matrix or a batch of square matrices. Its shape should be `[*, M, M]`, where `*` is zero or
@@ -3076,23 +3080,21 @@ def solve(x, y, name=None):
        Its data type should be the same as that of `x`.
    Examples:
-    .. code-block:: python
-        # a square system of linear equations:
+        .. code-block:: python
-        # 2*X0 + X1 = 9
-        # X0 + 2*X1 = 8
-        import paddle
+            # a square system of linear equations:
-        import numpy as np
+            # 2*X0 + X1 = 9
+            # X0 + 2*X1 = 8
+            import paddle
-        np_x = np.array([[3, 1],[1, 2]])
+            x = paddle.to_tensor([[3, 1],[1, 2]], dtype="float64")
-        np_y = np.array([9, 8])
+            y = paddle.to_tensor([9, 8], dtype="float64")
-        x = paddle.to_tensor(np_x, dtype="float64")
+            out = paddle.linalg.solve(x, y)
-        y = paddle.to_tensor(np_y, dtype="float64")
-        out = paddle.linalg.solve(x, y)
-        print(out)
+            print(out)
-        # [2., 3.])
+            # [2., 3.])
    """
    if in_dygraph_mode():
        return _C_ops.solve(x, y)
@@ -3122,23 +3124,23 @@ def triangular_solve(
        Input `x` and `y` is 2D matrices or batches of 2D matrices. If the inputs are batches, the outputs
        is also batches.
-        Args:
+    Args:
-            x (Tensor): The input triangular coefficient matrix. Its shape should be `[*, M, M]`, where `*` is zero or
+        x (Tensor): The input triangular coefficient matrix. Its shape should be `[*, M, M]`, where `*` is zero or
-                more batch dimensions. Its data type should be float32 or float64.
+            more batch dimensions. Its data type should be float32 or float64.
-            y (Tensor): Multiple right-hand sides of system of equations. Its shape should be `[*, M, K]`, where `*` is
+        y (Tensor): Multiple right-hand sides of system of equations. Its shape should be `[*, M, K]`, where `*` is
-                zero or more batch dimensions. Its data type should be float32 or float64.
+            zero or more batch dimensions. Its data type should be float32 or float64.
-            upper (bool, optional): Whether to solve the upper-triangular system of equations (default) or the lower-triangular
+        upper (bool, optional): Whether to solve the upper-triangular system of equations (default) or the lower-triangular
-                system of equations. Default: True.
+            system of equations. Default: True.
-            transpose (bool, optional): whether `x` should be transposed before calculation. Default: False.
+        transpose (bool, optional): whether `x` should be transposed before calculation. Default: False.
-            unitriangular (bool, optional): whether `x` is unit triangular. If True, the diagonal elements of `x` are assumed
+        unitriangular (bool, optional): whether `x` is unit triangular. If True, the diagonal elements of `x` are assumed
-                to be 1 and not referenced from `x` . Default: False.
+            to be 1 and not referenced from `x` . Default: False.
-            name(str, optional): Name for the operation (optional, default is None).
+        name(str, optional): Name for the operation (optional, default is None).
-                For more information, please refer to :ref:`api_guide_Name`.
+            For more information, please refer to :ref:`api_guide_Name`.
-        Returns:
+    Returns:
-            Tensor: The solution of the system of equations. Its data type should be the same as that of `x`.
+        Tensor: The solution of the system of equations. Its data type should be the same as that of `x`.
-        Examples:
+    Examples:
        .. code-block:: python
            # a square system of linear equations:
@@ -3146,12 +3148,7 @@ def triangular_solve(
            #      2*x2  +   x3 = -9
            #               -x3 = 5
-    <<<<<<< HEAD
            import paddle
-            import numpy as np
-    =======
-                import paddle
-    >>>>>>> 912be4f897 (fix numpy issue in codeblock examples for operators under python/paddle/tensor folder (#46765))
            x = paddle.to_tensor([[1, 1, 1],
                                  [0, 2, 1],
@@ -3216,18 +3213,18 @@ def cholesky_solve(x, y, upper=False, name=None):
        Tensor: The solution of the system of equations. Its data type is the same as that of `x`.
    Examples:
-    .. code-block:: python
+        .. code-block:: python
-        import paddle
+            import paddle
-        u = paddle.to_tensor([[1, 1, 1],
+            u = paddle.to_tensor([[1, 1, 1],
-                                [0, 2, 1],
+                                    [0, 2, 1],
-                                [0, 0,-1]], dtype="float64")
+                                    [0, 0,-1]], dtype="float64")
-        b = paddle.to_tensor([[0], [-9], [5]], dtype="float64")
+            b = paddle.to_tensor([[0], [-9], [5]], dtype="float64")
-        out = paddle.linalg.cholesky_solve(b, u, upper=True)
+            out = paddle.linalg.cholesky_solve(b, u, upper=True)
-        print(out)
+            print(out)
-        # [-2.5, -7, 9.5]
+            # [-2.5, -7, 9.5]
    """
    if in_dygraph_mode():
        return _C_ops.cholesky_solve(x, y, upper)

--- a/python/paddle/tensor/logic.py
+++ b/python/paddle/tensor/logic.py
@@ -96,7 +96,7 @@ def logical_and(x, y, out=None, name=None):
        out = x \&\& y
-    .. note::
+    Note:
        ``paddle.logical_and`` supports broadcasting. If you want know more about broadcasting, please refer to :ref:`user_guide_broadcasting`.
    Args:
@@ -136,7 +136,7 @@ def logical_or(x, y, out=None, name=None):
        out = x || y
-    .. note::
+    Note:
        ``paddle.logical_or`` supports broadcasting. If you want know more about broadcasting, please refer to :ref:`user_guide_broadcasting`.
    Args:
@@ -178,7 +178,7 @@ def logical_xor(x, y, out=None, name=None):
        out = (x || y) \&\& !(x \&\& y)
-    .. note::
+    Note:
        ``paddle.logical_xor`` supports broadcasting. If you want know more about broadcasting, please refer to :ref:`user_guide_broadcasting`.
    Args:
@@ -974,13 +974,6 @@ def isclose(x, y, rtol=1e-05, atol=1e-08, equal_nan=False, name=None):
    Returns:
        Tensor: ${out_comment}.
-    Raises:
-        TypeError: The data type of ``x`` must be one of float32, float64.
-        TypeError: The data type of ``y`` must be one of float32, float64.
-        TypeError: The type of ``rtol`` must be float.
-        TypeError: The type of ``atol`` must be float.
-        TypeError: The type of ``equal_nan`` must be bool.
    Examples:
        .. code-block:: python

--- a/python/paddle/tensor/manipulation.py
+++ b/python/paddle/tensor/manipulation.py
@@ -177,10 +177,6 @@ def slice(input, axes, starts, ends):
    Returns:
        Tensor:  A ``Tensor``. The data type is same as ``input``.
-    Raises:
-        TypeError: The type of ``starts`` must be list, tuple or Tensor.
-        TypeError: The type of ``ends`` must be list, tuple or Tensor.
    Examples:
        .. code-block:: python
@@ -510,9 +506,6 @@ def unstack(x, axis=0, num=None):
    Returns:
        list(Tensor): The unstacked Tensors list. The list elements are N-D Tensors of data types float32, float64, int32, int64.
-    Raises:
-        ValueError: If x.shape[axis] <= 0 or axis is not in range [-D, D).
    Examples:
        .. code-block:: python
@@ -1229,8 +1222,10 @@ def broadcast_tensors(input, name=None):
    """
    This OP broadcast a list of tensors following broadcast semantics
-    .. note::
+    Note:
-        If you want know more about broadcasting, please refer to :ref:`user_guide_broadcasting`.
+        If you want know more about broadcasting, please refer to `Introduction to Tensor`_ .
+    .. _Introduction to Tensor: ../../guides/beginner/tensor_en.html#chapter5-broadcasting-of-tensor
    Args:
        input (list|tuple): ``input`` is a Tensor list or Tensor tuple which is with data type bool,
@@ -1545,10 +1540,6 @@ def flatten(x, start_axis=0, stop_axis=-1, name=None):
                  axes flattened by indicated start axis and end axis. \
                  A Tensor with data type same as input x.
-    Raises:
-        ValueError: If x is not a Tensor.
-        ValueError: If start_axis or stop_axis is illegal.
    Examples:
        .. code-block:: python
@@ -2250,7 +2241,8 @@ def unique_consecutive(
    r"""
    Eliminates all but the first element from every consecutive group of equivalent elements.
-    .. note:: This function is different from :func:`paddle.unique` in the sense that this function
+    Note:
+        This function is different from :func:`paddle.unique` in the sense that this function
        only eliminates consecutive duplicate values. This semantics is similar to `std::unique` in C++.
    Args:
@@ -4626,8 +4618,9 @@ def put_along_axis(arr, indices, values, axis, reduce='assign'):
        indices (Tensor) : Indices to put along each 1d slice of arr. This must match the dimension of arr,
            and need to broadcast against arr. Supported data type are int and int64.
        axis (int) : The axis to put 1d slices along.
-        reduce (string | optinal) : The reduce operation, default is 'assign', support 'add', 'assign', 'mul' and 'multiply'.
+        reduce (str, optional): The reduce operation, default is 'assign', support 'add', 'assign', 'mul' and 'multiply'.
-    Returns :
+    Returns:
        Tensor: The indexed element, same dtype with arr
    Examples:

--- a/python/paddle/tensor/math.py
+++ b/python/paddle/tensor/math.py
@@ -4146,9 +4146,8 @@ def lerp_(x, y, weight, name=None):
 def erfinv(x, name=None):
    r"""
-    The inverse error function of x.
+    The inverse error function of x. Please refer to :ref:`api_paddle_erf`
-    Equation:
        .. math::
            erfinv(erf(x)) = x.
@@ -4158,7 +4157,7 @@ def erfinv(x, name=None):
        name (str, optional): Name for the operation (optional, default is None). For more information, please refer to :ref:`api_guide_Name`.
    Returns:
-        out (Tensor): An N-D Tensor, the shape and data type is the same with input.
+        out (Tensor), an N-D Tensor, the shape and data type is the same with input.
    Example:
        .. code-block:: python
@@ -4260,8 +4259,6 @@ def rad2deg(x, name=None):
 def deg2rad(x, name=None):
    r"""
    Convert each of the elements of input x from degrees to angles in radians.
-    Equation:
        .. math::
            deg2rad(x)=\pi * x / 180
@@ -4277,8 +4274,6 @@ def deg2rad(x, name=None):
        .. code-block:: python
            import paddle
-            import numpy as np
            x1 = paddle.to_tensor([180.0, -180.0, 360.0, -360.0, 90.0, -90.0])
            result1 = paddle.deg2rad(x1)
            print(result1)
@@ -4705,18 +4700,18 @@ def angle(x, name=None):
    return out
 def heaviside(x, y, name=None):
-    """
+    r"""
    Computes the Heaviside step function determined by corresponding element in y for each element in x. The equation is
    .. math::
        heaviside(x, y)=
            \left\{
-                \\begin{array}{lcl}
+                \begin{array}{lcl}
-                0,& &\\text{if} \ x < 0, \\\\
+                0,& &\text{if} \ x < 0, \\
-                y,& &\\text{if} \ x = 0, \\\\
+                y,& &\text{if} \ x = 0, \\
-                1,& &\\text{if} \ x > 0.
+                1,& &\text{if} \ x > 0.
                \end{array}
-            \\right.
+            \right.
    Note:
        ``paddle.heaviside`` supports broadcasting. If you want know more about broadcasting, please refer to :ref:`user_guide_broadcasting`.
@@ -4742,7 +4737,7 @@ def heaviside(x, y, name=None):
            paddle.heaviside(x, y)
            #    [[0.        , 0.20000000, 1.        ],
            #     [0.        , 1.        , 0.30000001]]
-     """
+    """
    op_type = 'elementwise_heaviside'
    axis = -1
    act = None

--- a/python/paddle/utils/cpp_extension/cpp_extension.py
+++ b/python/paddle/utils/cpp_extension/cpp_extension.py
@@ -106,7 +106,7 @@ def setup(**attr):
    If the above conditions are not met, the corresponding warning will be printed, and a fatal error may
    occur because of ABI compatibility.
-    .. note::
+    Note:
        1. Currently we support Linux, MacOS and Windows platfrom.
        2. On Linux platform, we recommend to use GCC 8.2 as soft linking condidate of ``/usr/bin/cc`` .
@@ -266,7 +266,7 @@ def CppExtension(sources, *args, **kwargs):
        )
-    .. note::
+    Note:
        It is mainly used in ``setup`` and the nama of built shared library keeps same
        as ``name`` argument specified in ``setup`` interface.
@@ -318,7 +318,7 @@ def CUDAExtension(sources, *args, **kwargs):
        )
-    .. note::
+    Note:
        It is mainly used in ``setup`` and the nama of built shared library keeps same
        as ``name`` argument specified in ``setup`` interface.
@@ -329,7 +329,7 @@ def CUDAExtension(sources, *args, **kwargs):
        **kwargs(dict[option], optional): Specify other arguments same as ``setuptools.Extension`` .
    Returns:
-        setuptools.Extension: An instance of setuptools.Extension
+        setuptools.Extension: An instance of setuptools.Extension.
    """
    kwargs = normalize_extension_kwargs(kwargs, use_cuda=True)
    # Note(Aurelius84): While using `setup` and `jit`, the Extension `name` will
@@ -840,7 +840,7 @@ def load(
    ``python setup.py install`` command. The interface contains all compiling and installing
    process underground.
-    .. note::
+    Note:
        1. Currently we support Linux, MacOS and Windows platfrom.
        2. On Linux platform, we recommend to use GCC 8.2 as soft linking condidate of ``/usr/bin/cc`` .

--- a/python/paddle/vision/datasets/cifar.py
+++ b/python/paddle/vision/datasets/cifar.py
@@ -116,11 +116,7 @@ class Cifar10(Dataset):
        assert mode.lower() in [
            'train',
            'test',
-            'train',
+        ], "mode.lower() should be 'train' or 'test', but got {}".format(mode)
-            'test',
-        ], "mode should be 'train10', 'test10', 'train100' or 'test100', but got {}".format(
-            mode
-        )
        self.mode = mode.lower()
        if backend is None:

--- a/python/paddle/vision/ops.py
+++ b/python/paddle/vision/ops.py
@@ -175,16 +175,6 @@ def yolo_loss(
    Returns:
        Tensor: A 1-D tensor with shape [N], the value of yolov3 loss
-    Raises:
-        TypeError: Input x of yolov3_loss must be Tensor
-        TypeError: Input gtbox of yolov3_loss must be Tensor
-        TypeError: Input gtlabel of yolov3_loss must be Tensor
-        TypeError: Input gtscore of yolov3_loss must be None or Tensor
-        TypeError: Attr anchors of yolov3_loss must be list or tuple
-        TypeError: Attr class_num of yolov3_loss must be an integer
-        TypeError: Attr ignore_thresh of yolov3_loss must be a float number
-        TypeError: Attr use_label_smooth of yolov3_loss must be a bool value
    Examples:
      .. code-block:: python
@@ -397,12 +387,6 @@ def yolo_box(
        and a 3-D tensor with shape [N, M, :attr:`class_num`], the classification 
        scores of boxes.
-    Raises:
-        TypeError: Input x of yolov_box must be Tensor
-        TypeError: Attr anchors of yolo box must be list or tuple
-        TypeError: Attr class_num of yolo box must be an integer
-        TypeError: Attr conf_thresh of yolo box must be a float number
    Examples:
    .. code-block:: python
@@ -957,9 +941,7 @@ def deform_conv2d(
    Returns:
        Tensor: The tensor variable storing the deformable convolution \
                  result. A Tensor with type float32, float64.
-    Raises:
-        ValueError: If the shapes of input, filter_size, stride, padding and
-                    groups mismatch.
    Examples:
        .. code-block:: python

--- a/python/paddle/vision/transforms/transforms.py
+++ b/python/paddle/vision/transforms/transforms.py