test=updata_docs, test=develop (#5691)

* test=updata_docs, test=develop * test=updata_docs, test=develop

test=updata_docs, test=develop (#5691)
* test=updata_docs, test=develop * test=updata_docs, test=develop
d4622844 · DanielYang · GitHub · 089dea74 · d4622844 · d4622844
5 changed file
--- a/README.md
+++ b/README.md
@@ -5,18 +5,21 @@
 飞桨的产业级模型库，包含大量经过产业实践长期打磨的主流模型以及在国际竞赛中的夺冠模型；提供面向语义理解、图像分类、目标检测、图像分割、文字识别、语音合成等场景的多个端到端开发套件，满足企业低成本开发和快速集成的需求。飞桨的模型库是围绕国内企业实际研发流程量身定制打造的产业级模型库，服务企业遍布能源、金融、工业、农业等多个领域。
 ## 近期更新
-**`2022-5-17`**: 更新`release/2.3`分支，飞桨官方模型超过500个，生态模型超过170个（数量持续更新中）.
+**`2022-11-29`**: 更新`release/2.4`分支，飞桨官方模型超过600个，生态模型超过260个（数量持续更新中）.
+**`2022-5-17`**: 更新`release/2.3`分支，飞桨官方模型超过500个，生态模型超过170个.
 **`2021-11-30`**: 更新`release/2.2`分支，系统的梳理了飞桨官方模型、学术模型和社区模型的清单，其中官方模型超过400个，生态模型超过100个
-**`Note`**:`release/2.2`以后分支模型均基于动态图实现，目前`develop`分支中仍有一些静态图模型代码，有需要的开发者可以继续切换到`develop`分支使用.
+**`Note`**:`release/2.2`以后分支模型均基于动态图实现，目前`dev-static`分支中仍有一些静态图模型代码，有需要的开发者可以继续切换到`dev-static`分支使用.
 ## 主要内容
 |  目录 |   说明 |
 | --- | --- |
-| [官方模型(official)](official/) |• 面向产业实践，数量超过500个<br />• [飞桨PP系列模型](official/PP-Models.md)，效果与精度最佳平衡<br />• 支持使用动态图开发视觉、自然语言、语音和推荐等领域模型<br />• 飞桨官方实现并提供持续技术支持及答疑<br />• 与飞桨核心框架版本对齐，已经经过充分的测试保证 |
+| [官方模型(official)](docs/official/README.md) |• 面向产业实践，数量超过600个<br />• [飞桨PP系列模型](docs/official/PP-Models.md)，效果与精度最佳平衡<br />• 支持使用动态图开发视觉、自然语言、语音和推荐等领域模型<br />• 飞桨官方实现并提供持续技术支持及答疑<br />• 与飞桨核心框架版本对齐，已经经过充分的测试保证 |
-|[学术模型(research)](research/) |• 面向学术前沿，侧重对于问题的持续更新<br />• 主要由飞桨相关的学术生态合作伙伴贡献|
+|[学术模型(research)](docs/research/README.md) |• 面向学术前沿，侧重对于问题的持续更新<br />• 主要由飞桨相关的学术生态合作伙伴贡献|
-|[社区模型(community)](community/) | • 面向更多丰富场景，侧重对于学术论文的覆盖<br />• 主要由飞桨生态开发者贡献，持续更新中|
+|[社区模型(community)](docs/community/README.md) | • 面向更多丰富场景，侧重对于学术论文的覆盖<br />• 主要由飞桨生态开发者贡献，持续更新中|
 ## 欢迎加入飞桨模型库技术交流群
 - 如果你希望了解飞桨模型库最新进展，或者希望与资深开发者一起讨论产业实践关注的重点模型，欢迎扫码加入飞桨模型库交流群：

--- a/community/README.md
+++ b/community/README.md
 # 社区模型库
-飞桨目前包含170+个社区模型，覆盖CV、NLP、推荐等多个领域，详细内容如下表：
+飞桨目前包含260+社区模型，覆盖CV、NLP、推荐等多个领域，详细内容如下表：
 ### 图像分类
@@ -9,7 +9,7 @@
        <th>序号</th>
        <th>论文名称(链接)</th>
        <th>摘要</th>
-        <th>数据集/指标</th>
+        <th>数据集</th>
        <th width='10%'>快速开始</th>
    </tr>
    <tr>
@@ -147,25 +147,116 @@
    </tr>
    <tr>
        <td>20</td>
-        <td><a href="https://paperswithcode.com/paper/matching-networks-for-one-shot-learning">Matching Networks for One Shot Learning </a></td>
+        <td><a href="https://openaccess.thecvf.com/content/ICCV2021/html/Li_MicroNet_Improving_Image_Recognition_With_Extremely_Low_FLOPs_ICCV_2021_paper.html">MicroNet: Improving Image Recognition with Extremely Low FLOPs</a></td>
-        <td><details><summary>Abstract</summary><div>Learning from a few examples remains a key challenge in machine learning. Despite recent advances in important domains such as vision and language, the standard supervised deep learning paradigm does not offer a satisfactory solution for learning new concepts rapidly from little data. In this work, we employ ideas from metric learning based on deep neural features and from recent advances that augment neural networks with external memories. Our framework learns a network that maps a small labelled support set and an unlabelled example to its label, obviating the need for fine-tuning to adapt to new class types. We then define one-shot learning problems on vision (using Omniglot, ImageNet) and language tasks. Our algorithm improves one-shot accuracy on ImageNet from 87.6% to 93.2% and from 88.0% to 93.8% on Omniglot compared to competing approaches. We also demonstrate the usefulness of the same model on language modeling by introducing a one-shot task on the Penn Treebank.</div></details></td>
+        <td><details><summary>Abstract</summary><div>This paper aims at addressing the problem of substantial performance degradation at extremely low computational cost (e.g. 5M FLOPs on ImageNet classification). We found that two factors, sparse connectivity and dynamic activation function, are effective to improve the accuracy. The former avoids the significant reduction of network width, while the latter mitigates the detriment of reduction in network depth. Technically, we propose micro-factorized convolution, which factorizes a convolution matrix into low rank matrices, to integrate sparse connectivity into convolution. We also present a new dynamic activation function, named Dynamic Shift Max, to improve the non-linearity via maxing out multiple dynamic fusions between an input feature map and its circular channel shift. Building upon these two new operators, we arrive at a family of networks, named MicroNet, that achieves significant performance gains over the state of the art in the low FLOP regime. For instance, under the constraint of 12M FLOPs, MicroNet achieves 59.4% top-1 accuracy on ImageNet classification, outperforming MobileNetV3 by 9.6%. Source code is at https://github.com/liyunsheng13/micronet.</div></details></td>
-        <td>omniglot k-way=5, n-shot=1, acc = 98.1</td>
+        <td>ImageNet1k数据集, MicroNet-M3 top1-acc 62.5%</td>
-        <td><a href="https://github.com/ranpeng-git/few-shot-paddle">快速开始</a></td>
+        <td><a href="https://github.com/flytocc/PaddleClas_private/tree/micronet">快速开始</a></td>
    </tr>
    <tr>
        <td>21</td>
-        <td><a href="https://arxiv.org/pdf/2107.10224v1.pdf">CycleMLP: A MLP-like Architecture for Dense Prediction</a></td>
+        <td><a href="https://arxiv.org/abs/2007.02269">Rethinking Bottleneck Structure for Efficient Mobile Network Design</a></td>
-        <td><details><summary>Abstract</summary><div>This paper presents a simple MLP-like architecture, CycleMLP, which is a versatile backbone for visual recognition and dense predictions. As compared to modern MLP architectures, e.g., MLP-Mixer, ResMLP, and gMLP, whose architectures are correlated to image size and thus are infeasible in object detection and segmentation, CycleMLP has two advantages compared to modern approaches. (1) It can cope with various image sizes. (2) It achieves linear computational complexity to image size by using local windows. In contrast, previous MLPs have O(N2) computations due to fully spatial connections. We build a family of models which surpass existing MLPs and even state-of-the-art Transformer-based models, e.g., Swin Transformer, while using fewer parameters and FLOPs. We expand the MLP-like models' applicability, making them a versatile backbone for dense prediction tasks. CycleMLP achieves competitive results on object detection, instance segmentation, and semantic segmentation. In particular, CycleMLP-Tiny outperforms Swin-Tiny by 1.3% mIoU on ADE20K dataset with fewer FLOPs. Moreover, CycleMLP also shows excellent zero-shot robustness on ImageNet-C dataset. Code is available at https://github.com/ShoufaChen/CycleMLP.</div></details></td>
+        <td><details><summary>Abstract</summary><div>The inverted residual block is dominating architecture design for mobile networks recently. It changes the classic residual bottleneck by introducing two design rules: learning inverted residuals and using linear bottlenecks. In this paper, we rethink the necessity of such design changes and find it may bring risks of information loss and gradient confusion. We thus propose to flip the structure and present a novel bottleneck design, called the sandglass block, that performs identity mapping and spatial transformation at higher dimensions and thus alleviates information loss and gradient confusion effectively. Extensive experiments demonstrate that, different from the common belief, such bottleneck structure is more beneficial than the inverted ones for mobile networks. In ImageNet classification, by simply replacing the inverted residual block with our sandglass block without increasing parameters and computation, the classification accuracy can be improved by more than 1.7% over MobileNetV2. On Pascal VOC 2007 test set, we observe that there is also 0.9% mAP improvement in object detection. We further verify the effectiveness of the sandglass block by adding it into the search space of neural architecture search method DARTS. With 25% parameter reduction, the classification accuracy is improved by 0.13% over previous DARTS models. Code can be found at: this https URL.</div></details></td>
-        <td>ImageNet: CycleMLP-B1 78.9</td>
+        <td>ImageNet1k数据集, MobileNeXt-1.00 top1-acc 74.02%</td>
-        <td><a href="https://github.com/flytocc/CycleMLP-paddle">快速开始</a></td>
+        <td><a href="https://github.com/flytocc/PaddleClas/tree/mobilenext">快速开始</a></td>
    </tr>
    <tr>
        <td>22</td>
+        <td><a href="https://arxiv.org/abs/2103.15808">CvT: Introducing Convolutions to Vision Transformers</a></td>
+        <td><details><summary>Abstract</summary><div>We present in this paper a new architecture, named Convolutional vision Transformer (CvT), that improves Vision Transformer (ViT) in performance and efficiency by introducing convolutions into ViT to yield the best of both designs. This is accomplished through two primary modifications: a hierarchy of Transformers containing a new convolutional token embedding, and a convolutional Transformer block leveraging a convolutional projection. These changes introduce desirable properties of convolutional neural networks (CNNs) to the ViT architecture (\ie shift, scale, and distortion invariance) while maintaining the merits of Transformers (\ie dynamic attention, global context, and better generalization). We validate CvT by conducting extensive experiments, showing that this approach achieves state-of-the-art performance over other Vision Transformers and ResNets on ImageNet-1k, with fewer parameters and lower FLOPs. In addition, performance gains are maintained when pretrained on larger datasets (\eg ImageNet-22k) and fine-tuned to downstream tasks. Pre-trained on ImageNet-22k, our CvT-W24 obtains a top-1 accuracy of 87.7\% on the ImageNet-1k val set. Finally, our results show that the positional encoding, a crucial component in existing Vision Transformers, can be safely removed in our model, simplifying the design for higher resolution vision tasks. Code will be released at \url{https://github.com/leoxiaobin/CvT}.</div></details></td>
+        <td>ImageNet1k数据集CvT-13, 81.6%</td>
+        <td><a href="https://github.com/flytocc/PaddleClas/tree/CvT">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>23</td>
+        <td><a href="https://arxiv.org/abs/1904.09730">An Energy and GPU-Computation Efficient Backbone Network for Real-Time Object Detection</a></td>
+        <td><details><summary>Abstract</summary><div>As DenseNet conserves intermediate features with diverse receptive fields by aggregating them with dense connection, it shows good performance on the object detection task. Although feature reuse enables DenseNet to produce strong features with a small number of model parameters and FLOPs, the detector with DenseNet backbone shows rather slow speed and low energy efficiency. We find the linearly increasing input channel by dense connection leads to heavy memory access cost, which causes computation overhead and more energy consumption. To solve the inefficiency of DenseNet, we propose an energy and computation efficient architecture called VoVNet comprised of One-Shot Aggregation (OSA). The OSA not only adopts the strength of DenseNet that represents diversified features with multi receptive fields but also overcomes the inefficiency of dense connection by aggregating all features only once in the last feature maps. To validate the effectiveness of VoVNet as a backbone network, we design both lightweight and large-scale VoVNet and apply them to one-stage and two-stage object detectors. Our VoVNet based detectors outperform DenseNet based ones with 2x faster speed and the energy consumptions are reduced by 1.6x - 4.1x. In addition to DenseNet, VoVNet also outperforms widely used ResNet backbone with faster speed and better energy efficiency. In particular, the small object detection performance has been significantly improved over DenseNet and ResNet.</div></details></td>
+        <td>imagenet VoVNet-39, top1 acc 0.7677</td>
+        <td><a href="https://github.com/renmada/PaddleClas/tree/vovnet">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>24</td>
+        <td><a href="https://arxiv.org/abs/2108.02456">Residual Attention: A Simple but Effective Method for Multi-Label Recognition</a></td>
+        <td><details><summary>Abstract</summary><div>Multi-label image recognition is a challenging computer vision task of practical use. Progresses in this area, however, are often characterized by complicated methods, heavy computations, and lack of intuitive explanations. To effectively capture different spatial regions occupied by objects from different categories, we propose an embarrassingly simple module, named class-specific residual attention (CSRA). CSRA generates class-specific features for every category by proposing a simple spatial attention score, and then combines it with the class-agnostic average pooling feature. CSRA achieves state-of-the-art results on multilabel recognition, and at the same time is much simpler than them. Furthermore, with only 4 lines of code, CSRA also leads to consistent improvement across many diverse pretrained models and datasets without any extra training. CSRA is both easy to implement and light in computations, which also enjoys intuitive explanations and visualizations.</div></details></td>
+        <td>VOC2007 dataset, resnet101, head num=1, mAP 94.7%</td>
+        <td><a href="https://github.com/CuberrChen/CSRA-Paddle">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>25</td>
+        <td><a href="https://arxiv.org/abs/1702.00758">HashNet: Deep Learning to Hash by Continuation</a></td>
+        <td><details><summary>Abstract</summary><div>Learning to hash has been widely applied to approximate nearest neighbor search for large-scale multimedia retrieval, due to its computation efficiency and retrieval quality. Deep learning to hash, which improves retrieval quality by end-to-end representation learning and hash encoding, has received increasing attention recently. Subject to the ill-posed gradient difficulty in the optimization with sign activations, existing deep learning to hash methods need to first learn continuous representations and then generate binary hash codes in a separated binarization step, which suffer from substantial loss of retrieval quality. This work presents HashNet, a novel deep architecture for deep learning to hash by continuation method with convergence guarantees, which learns exactly binary hash codes from imbalanced similarity data. The key idea is to attack the ill-posed gradient problem in optimizing deep networks with non-smooth binary activations by continuation method, in which we begin from learning an easier network with smoothed activation function and let it evolve during the training, until it eventually goes back to being the original, difficult to optimize, deep network with the sign activation function. Comprehensive empirical evidence shows that HashNet can generate exactly binary hash codes and yield state-of-the-art multimedia retrieval performance on standard benchmarks.</div></details></td>
+        <td>MS COCO 16bits 0.6873, 32bits 0.7184, 48bits 0.7301, 64bits 0.7362</td>
+        <td><a href="https://github.com/hatimwen/paddle_hashnet">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>26</td>
        <td><a href="https://proceedings.neurips.cc/paper/2018/file/13f3cf8c531952d72e5847c4183e6910-Paper.pdf">Greedy Hash: Towards Fast Optimization for Accurate Hash Coding in CNN</a></td>
        <td><details><summary>Abstract</summary><div>To convert the input into binary code, hashing algorithm has been widely used for approximate nearest neighbor search on large-scale image sets due to its computation and storage efficiency. Deep hashing further improves the retrieval quality by combining the hash coding with deep neural network. However, a major difficulty in deep hashing lies in the discrete constraints imposed on the network output, which generally makes the optimization NP hard. In this work, we adopt the greedy principle to tackle this NP hard problem by iteratively updating the network toward the probable optimal discrete solution in each iteration. A hash coding layer is designed to implement our approach which strictly uses the sign function in forward propagation to maintain the discrete constraints, while in back propagation the gradients are transmitted intactly to the front layer to avoid the vanishing gradients. In addition to the theoretical derivation, we provide a new perspective to visualize and understand the effectiveness and efficiency of our algorithm. Experiments on benchmark datasets show that our scheme outperforms state-of-the-art hashing methods in both supervised and unsupervised tasks.</div></details></td>
-        <td>cifar10(1) 12bits 0.766, 24bits 0.794, 32bit 0.803, 48bits 0.817</td>
+        <td>cifar10(1) 12bits 0.774, 24bits 0.795, 32bit 0.810, 48bits 0.822</td>
        <td><a href="https://github.com/hatimwen/paddle_greedyhash">快速开始</a></td>
    </tr>
+    <tr>
+        <td>27</td>
+        <td><a href="https://paperswithcode.com/paper/trusted-multi-view-classification-1">Trusted Multi-View Classification</a></td>
+        <td><details><summary>Abstract</summary><div>Multi-view classification (MVC) generally focuses on improving classification accuracy by using information from different views, typically integrating them into a unified comprehensive representation for downstream tasks. However, it is also crucial to dynamically assess the quality of a view for different samples in order to provide reliable uncertainty estimations, which indicate whether predictions can be trusted. To this end, we propose a novel multi-view classification method, termed trusted multi-view classification, which provides a new paradigm for multi-view learning by dynamically integrating different views at an evidence level. The algorithm jointly utilizes multiple views to promote both classification reliability and robustness by integrating evidence from each view. To achieve this, the Dirichlet distribution is used to model the distribution of the class probabilities, parameterized with evidence from different views and integrated with the Dempster-Shafer theory. The unified learning framework induces accurate uncertainty and accordingly endows the model with both reliability and robustness for out-of-distribution samples. Extensive experimental results validate the effectiveness of the proposed model in accuracy, reliability and robustness.</div></details></td>
+        <td>nan</td>
+        <td><a href="https://github.com/MiuGod0126/TMC_Paddle">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>28</td>
+        <td><a href="https://arxiv.org/abs/1901.09891">See Better Before Looking Closer: Weakly Supervised Data Augmentation Network for Fine-Grained Visual Classification</a></td>
+        <td><details><summary>Abstract</summary><div>Data augmentation is usually adopted to increase the amount of training data, prevent overfitting and improve the performance of deep models. However, in practice, random data augmentation, such as random image cropping, is low-efficiency and might introduce many uncontrolled background noises. In this paper, we propose Weakly Supervised Data Augmentation Network (WS-DAN) to explore the potential of data augmentation. Specifically, for each training image, we first generate attention maps to represent the object's discriminative parts by weakly supervised learning. Next, we augment the image guided by these attention maps, including attention cropping and attention dropping. The proposed WS-DAN improves the classification accuracy in two folds. In the first stage, images can be seen better since more discriminative parts' features will be extracted. In the second stage, attention regions provide accurate location of object, which ensures our model to look at the object closer and further improve the performance. Comprehensive experiments in common fine-grained visual classification datasets show that our WS-DAN surpasses the state-of-the-art methods, which demonstrates its effectiveness.</div></details></td>
+        <td>WS-DAN CUB-200-2011 acc 89.4%, FGVC-Aircraft 93.0%, Standford Cars 94.5%</td>
+        <td><a href="https://github.com/Victory8858/WS-DAN-Paddle">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>29</td>
+        <td><a href="https://arxiv.org/pdf/2107.10224v1.pdf">CycleMLP: A MLP-like Architecture for Dense Prediction</a></td>
+        <td><details><summary>Abstract</summary><div>This paper presents a simple MLP-like architecture, CycleMLP, which is a versatile backbone for visual recognition and dense predictions. As compared to modern MLP architectures, e.g., MLP-Mixer, ResMLP, and gMLP, whose architectures are correlated to image size and thus are infeasible in object detection and segmentation, CycleMLP has two advantages compared to modern approaches. (1) It can cope with various image sizes. (2) It achieves linear computational complexity to image size by using local windows. In contrast, previous MLPs have O(N2) computations due to fully spatial connections. We build a family of models which surpass existing MLPs and even state-of-the-art Transformer-based models, e.g., Swin Transformer, while using fewer parameters and FLOPs. We expand the MLP-like models' applicability, making them a versatile backbone for dense prediction tasks. CycleMLP achieves competitive results on object detection, instance segmentation, and semantic segmentation. In particular, CycleMLP-Tiny outperforms Swin-Tiny by 1.3% mIoU on ADE20K dataset with fewer FLOPs. Moreover, CycleMLP also shows excellent zero-shot robustness on ImageNet-C dataset. Code is available at https://github.com/ShoufaChen/CycleMLP.</div></details></td>
+        <td>CycleMLP-B1：78.9</td>
+        <td><a href="https://github.com/flytocc/CycleMLP-paddle">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>30</td>
+        <td><a href="https://arxiv.org/abs/2201.03545">A ConvNet for the 2020s</a></td>
+        <td><details><summary>Abstract</summary><div>The "Roaring 20s" of visual recognition began with the introduction of Vision Transformers (ViTs), which quickly superseded ConvNets as the state-of-the-art image classification model. A vanilla ViT, on the other hand, faces difficulties when applied to general computer vision tasks such as object detection and semantic segmentation. It is the hierarchical Transformers (e.g., Swin Transformers) that reintroduced several ConvNet priors, making Transformers practically viable as a generic vision backbone and demonstrating remarkable performance on a wide variety of vision tasks. However, the effectiveness of such hybrid approaches is still largely credited to the intrinsic superiority of Transformers, rather than the inherent inductive biases of convolutions. In this work, we reexamine the design spaces and test the limits of what a pure ConvNet can achieve. We gradually "modernize" a standard ResNet toward the design of a vision Transformer, and discover several key components that contribute to the performance difference along the way. The outcome of this exploration is a family of pure ConvNet models dubbed ConvNeXt. Constructed entirely from standard ConvNet modules, ConvNeXts compete favorably with Transformers in terms of accuracy and scalability, achieving 87.8% ImageNet top-1 accuracy and outperforming Swin Transformers on COCO detection and ADE20K segmentation, while maintaining the simplicity and efficiency of standard ConvNets.</div></details></td>
+        <td>ConvNeXt-T top1 acc 0.821</td>
+        <td><a href="https://github.com/flytocc/ConvNeXt-paddle">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>31</td>
+        <td><a href="https://arxiv.org/pdf/2202.09741.pdf">Visual Attention Network</a></td>
+        <td><details><summary>Abstract</summary><div>While originally designed for natural language processing tasks, the self-attention mechanism has recently taken various computer vision areas by storm. However, the 2D nature of images brings three challenges for applying self-attention in computer vision. (1) Treating images as 1D sequences neglects their 2D structures. (2) The quadratic complexity is too expensive for high-resolution images. (3) It only captures spatial adaptability but ignores channel adaptability. In this paper, we propose a novel linear attention named large kernel attention (LKA) to enable self-adaptive and long-range correlations in self-attention while avoiding its shortcomings. Furthermore, we present a neural network based on LKA, namely Visual Attention Network (VAN). While extremely simple, VAN surpasses similar size vision transformers(ViTs) and convolutional neural networks(CNNs) in various tasks, including image classification, object detection, semantic segmentation, panoptic segmentation, pose estimation, etc. For example, VAN-B6 achieves 87.8% accuracy on ImageNet benchmark and set new state-of-the-art performance (58.2 PQ) for panoptic segmentation. Besides, VAN-B2 surpasses Swin-T 4% mIoU (50.1 vs. 46.1) for semantic segmentation on ADE20K benchmark, 2.6% AP (48.8 vs. 46.2) for object detection on COCO dataset. It provides a novel method and a simple yet strong baseline for the community. Code is available at https://github.com/Visual-Attention-Network.</div></details></td>
+        <td>VAN-Tiny top1 acc 0.754</td>
+        <td><a href="https://github.com/flytocc/PaddleClas_VAN-Classification">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>32</td>
+        <td><a href="https://arxiv.org/pdf/1804.06882.pdf">Pelee: A Real-Time Object Detection System on Mobile Devices</a></td>
+        <td><details><summary>Abstract</summary><div>An increasing need of running Convolutional Neural Network (CNN) models on mobile devices with limited computing power and memory resource encourages studies on efficient model design. A number of efficient architectures have been proposed in recent years, for example, MobileNet, ShuffleNet, and MobileNetV2. However, all these models are heavily dependent on depthwise separable convolution which lacks efficient implementation in most deep learning frameworks. In this study, we propose an efficient architecture named PeleeNet, which is built with conventional convolution instead. On ImageNet ILSVRC 2012 dataset, our proposed PeleeNet achieves a higher accuracy and over 1.8 times faster speed than MobileNet and MobileNetV2 on NVIDIA TX2. Meanwhile, PeleeNet is only 66% of the model size of MobileNet. We then propose a real-time object detection system by combining PeleeNet with Single Shot MultiBox Detector (SSD) method and optimizing the architecture for fast speed. Our proposed detection system2, named Pelee, achieves 76.4% mAP (mean average precision) on PASCAL VOC2007 and 22.4 mAP on MS COCO dataset at the speed of 23.6 FPS on iPhone 8 and 125 FPS on NVIDIA TX2. The result on COCO outperforms YOLOv2 in consideration of a higher precision, 13.6 times lower computational cost and 11.3 times smaller model size.</div></details></td>
+        <td>peleenet top1 acc 0.726</td>
+        <td><a href="https://github.com/flytocc/PeleeNet-paddle">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>33</td>
+        <td><a href="https://arxiv.org/abs/2104.10858">All Tokens Matter: Token Labeling for Training Better Vision Transformers</a></td>
+        <td><details><summary>Abstract</summary><div>In this paper, we present token labeling -- a new training objective for training high-performance vision transformers (ViTs). Different from the standard training objective of ViTs that computes the classification loss on an additional trainable class token, our proposed one takes advantage of all the image patch tokens to compute the training loss in a dense manner. Specifically, token labeling reformulates the image classification problem into multiple token-level recognition problems and assigns each patch token with an individual location-specific supervision generated by a machine annotator. Experiments show that token labeling can clearly and consistently improve the performance of various ViT models across a wide spectrum. For a vision transformer with 26M learnable parameters serving as an example, with token labeling, the model can achieve 84.4% Top-1 accuracy on ImageNet. The result can be further increased to 86.4% by slightly scaling the model size up to 150M, delivering the minimal-sized model among previous models (250M+) reaching 86%. We also show that token labeling can clearly improve the generalization capability of the pre-trained models on downstream tasks with dense prediction, such as semantic segmentation. Our code and all the training details will be made publicly available at https://github.com/zihangJiang/TokenLabeling.</div></details></td>
+        <td>LV-ViT-T @ImageNet val top1 acc=79.1%</td>
+        <td><a href="https://github.com/flytocc/TokenLabeling-paddle">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>34</td>
+        <td><a href="https://openaccess.thecvf.com/content_CVPR_2019/html/Chen_Destruction_and_Construction_Learning_for_Fine-Grained_Image_Recognition_CVPR_2019_paper.html">Destruction and Construction Learning for Fine-grained Image Recognition</a></td>
+        <td><details><summary>Abstract</summary><div>Delicate feature representation about object parts plays a critical role in fine-grained recognition. For example, experts can even distinguish fine-grained objects relying only on object parts according to professional knowledge. In this paper, we propose a novel "Destruction and Construction Learning" (DCL) method to enhance the difficulty of fine-grained recognition and exercise the classification model to acquire expert knowledge. Besides the standard classification backbone network, another "destruction and construction" stream is introduced to carefully "destruct" and then "reconstruct" the input image, for learning discriminative regions and features. More specifically, for "destruction", we first partition the input image into local regions and then shuffle them by a Region Confusion Mechanism (RCM). To correctly recognize these destructed images, the classification network has to pay more attention to discriminative regions for spotting the differences. To compensate the noises introduced by RCM, an adversarial loss, which distinguishes original images from destructed ones, is applied to reject noisy patterns introduced by RCM. For "construction", a region alignment network, which tries to restore the original spatial layout of local regions, is followed to model the semantic correlation among local regions. By jointly training with parameter sharing, our proposed DCL injects more discriminative local details to the classification network. Experimental results show that our proposed framework achieves state-of-the-art performance on three standard benchmarks. Moreover, our proposed method does not need any external knowledge during training, and there is no computation overhead at inference time except the standard classification network feed-forwarding. Source code: https://github.com/JDAI-CV/DCL.</div></details></td>
+        <td>ResNet50, CUB-200-2011 acc 87.8%, Stanford Cars 94.5%, FGVC-Aircraft 93.0%</td>
+        <td><a href="https://github.com/zzc98/PaddlePaddle_DCL">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>35</td>
+        <td><a href="https://arxiv.org/pdf/2007.08461.pdf">How to Trust Unlabeled Data? Instance Credibility Inference for Few-Shot Learning</a></td>
+        <td><details><summary>Abstract</summary><div>Deep learning based models have excelled in many computer vision tasks and appear to surpass humans' performance. However, these models require an avalanche of expensive human labeled training data and many iterations to train their large number of parameters. This severely limits their scalability to the real-world long-tail distributed categories, some of which are with a large number of instances, but with only a few manually annotated. Learning from such extremely limited labeled examples is known as Few-shot learning (FSL). Different to prior arts that leverage meta-learning or data augmentation strategies to alleviate this extremely data-scarce problem, this paper presents a statistical approach, dubbed Instance Credibility Inference (ICI) to exploit the support of unlabeled instances for few-shot visual recognition. Typically, we repurpose the self-taught learning paradigm to predict pseudo-labels of unlabeled instances with an initial classifier trained from the few shot and then select the most confident ones to augment the training set to re-train the classifier. This is achieved by constructing a (Generalized) Linear Model (LM/GLM) with incidental parameters to model the mapping from (un-)labeled features to their (pseudo-)labels, in which the sparsity of the incidental parameters indicates the credibility of the corresponding pseudo-labeled instance. We rank the credibility of pseudo-labeled instances along the regularization path of their corresponding incidental parameters, and the most trustworthy pseudo-labeled examples are preserved as the augmented labeled instances. Theoretically, under mild conditions of restricted eigenvalue, irrepresentability, and large error, our approach is guaranteed to collect all the correctly-predicted instances from the noisy pseudo-labeled set.</div></details></td>
+        <td>mini-ImageNet, semi ICIR 1shot 73.12%, 5shot 83.28%</td>
+        <td><a href="https://github.com/renmada/ICI-paddle">快速开始</a></td>
+    </tr>
 </table>
 ### 目标检测
@@ -230,7 +321,7 @@
        <td>8</td>
        <td><a href="https://arxiv.org/pdf/1502.03044.pdf">Show, Attend and Tell: Neural Image Caption Generation with Visual Attention</a></td>
        <td><details><summary>Abstract</summary><div>We present an approach to efficiently detect the 2D pose of multiple peoplein an image. The approach uses a nonparametric representation, which we referto as Part Affinity Fields (PAFs), to learn to associate body parts withindividuals in the image. The architecture encodes global context, allowing agreedy bottom-up parsing step that maintains high accuracy while achievingrealtime performance, irrespective of the number of people in the image. Thearchitecture is designed to jointly learn part locations and their associationvia two branches of the same sequential prediction process. Our method placedfirst in the inaugural COCO 2016 keypoints challenge, and significantly exceedsthe previous state-of-the-art result on the MPII Multi-Person benchmark, bothin performance and efficiency.</div></details></td>
-        <td>bleu-1: 67%, bleu-2: 45.7%, </br> bleu-3: 31.4%, bleu-4: 21.3%</td>
+        <td>bleu-1: 67%, bleu-2: 45.7%, bleu-3: 31.4%, bleu-4: 21.3%</td>
        <td><a href="https://github.com/Lieberk/Paddle-VA-Captioning">快速开始</a></td>
    </tr>
    <tr>
@@ -517,6 +608,178 @@
        <td>Prostate dataset Dice coefficient: 0.869参考论文指标</td>
        <td><a href="https://github.com/YellowLight021/Vnet">快速开始</a></td>
    </tr>
+    <tr>
+        <td>29</td>
+        <td><a href="https://arxiv.org/abs/2107.00782">Polarized Self-Attention: Towards High-quality Pixel-wise Regression</a></td>
+        <td><details><summary>Abstract</summary><div>Pixel-wise regression is probably the most common problem in fine-grained computer vision tasks, such as estimating keypoint heatmaps and segmentation masks. These regression problems are very challenging particularly because they require, at low computation overheads, modeling long-range dependencies on high-resolution inputs/outputs to estimate the highly nonlinear pixel-wise semantics. While attention mechanisms in Deep Convolutional Neural Networks(DCNNs) has become popular for boosting long-range dependencies, element-specific attention, such as Nonlocal blocks, is highly complex and noise-sensitive to learn, and most of simplified attention hybrids try to reach the best compromise among multiple types of tasks. In this paper, we present the Polarized Self-Attention(PSA) block that incorporates two critical designs towards high-quality pixel-wise regression: (1) Polarized filtering: keeping high internal resolution in both channel and spatial attention computation while completely collapsing input tensors along their counterpart dimensions. (2) Enhancement: composing non-linearity that directly fits the output distribution of typical fine-grained regression, such as the 2D Gaussian distribution (keypoint heatmaps), or the 2D Binormial distribution (binary segmentation masks). PSA appears to have exhausted the representation capacity within its channel-only and spatial-only branches, such that there is only marginal metric differences between its sequential and parallel layouts. Experimental results show that PSA boosts standard baselines by 2−4 points, and boosts state-of-the-arts by 1−2 points on 2D pose estimation and semantic segmentation benchmarks.</div></details></td>
+        <td>数据集 cityscapes valset验收指标：1.  HRNetV2-OCR+PSA(s) mIOU= 86.7% 参考论文 Table.52. 日志中包含周期 validation 和损失结果3. 复现后合入PaddleSeg</td>
+        <td><a href="https://github.com/marshall-dteach/psanet-main">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>30</td>
+        <td><a href="https://arxiv.org/pdf/2109.03201">nnFormer: Interleaved Transformer for Volumetric Segmentation</a></td>
+        <td><details><summary>Abstract</summary><div>Transformer, the model of choice for naturallanguage processing, has drawn scant attention from themedical imaging community. Given the ability to exploitlong-term dependencies, transformers are promising tohelp atypical convolutional neural networks to overcometheir inherent shortcomings of spatial inductive bias. However, most of recently proposed transformer-based segmentation approaches simply treated transformers as assisted modules to help encode global context into convolutional representations. To address this issue, we introducennFormer (i.e., not-another transFormer), a 3D transformerfor volumetric medical image segmentation. nnFormer notonly exploits the combination of interleaved convolutionand self-attention operations, but also introduces localand global volume-based self-attention mechanism to learnvolume representations. Moreover, nnFormer proposes touse skip attention to replace the traditional concatenation/summation operations in skip connections in U-Netlike architecture. Experiments show that nnFormer significantly outperforms previous transformer-based counterparts by large margins on three public datasets. Comparedto nnUNet, nnFormer produces significantly lower HD95and comparable DSC results. Furthermore, we show thatnnFormer and nnUNet are highly complementary to eachother in model ensembling. Codes and models of nnFormerare available at https://git.io/JSf3i.</div></details></td>
+        <td>数据集     ACDC：注册后下载https://acdc.creatis.insa-lyon.fr/#phase/5846c3ab6a3c7735e84b67f2验收指标：1.Dice = 91.78% 对应论文 Table.3中实现；2. 训练中包含周期性valset的评估结果和损失3. 复现后合入PaddleSeg 中MedicalSeg</td>
+        <td><a href="https://github.com/YellowLight021/paddle_nnformer">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>31</td>
+        <td><a href="https://arxiv.org/abs/2105.05537">Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation</a></td>
+        <td><details><summary>Abstract</summary><div>In the past few years, convolutional neural networks (CNNs) have achieved milestones in medical image analysis. Especially, the deep neural networks based on U-shaped architecture and skip-connections have been widely applied in a variety of medical image tasks. However, although CNN has achieved excellent performance, it cannot learn global and long-range semantic information interaction well due to the locality of the convolution operation. In this paper, we propose Swin-Unet, which is an Unet-like pure Transformer for medical image segmentation. The tokenized image patches are fed into the Transformer-based U-shaped Encoder-Decoder architecture with skip-connections for local-global semantic feature learning. Specifically, we use hierarchical Swin Transformer with shifted windows as the encoder to extract context features. And a symmetric Swin Transformer-based decoder with patch expanding layer is designed to perform the up-sampling operation to restore the spatial resolution of the feature maps. Under the direct down-sampling and up-sampling of the inputs and outputs by 4x, experiments on multi-organ and cardiac segmentation tasks demonstrate that the pure Transformer-based U-shaped Encoder-Decoder network outperforms those methods with full-convolution or the combination of transformer and convolution. The codes and trained models will be publicly available at this https URL.</div></details></td>
+        <td>数据集SYNAPSE：联系 jienengchen01@gmail.com 获取处理后数据链接，或者在synapse 官网下载。验收指标：1. Avg Dice = 79.13% 对应论文 Table.1中实现；2. 训练中包含周期性valset的评估结果和损失3. 复现后合入PaddleSeg 中MedicalSeg</td>
+        <td><a href="https://github.com/marshall-dteach/SwinUNet">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>32</td>
+        <td><a href="https://arxiv.org/pdf/2105.10860.pdf">FCCDN: Feature Constraint Network for VHR Image Change Detection</a></td>
+        <td><details><summary>Abstract</summary><div>Change detection is the process of identifying pixelwise differences in bitemporal co-registered images. It is of great significance to Earth observations. Recently, with the emergence of deep learning (DL), the power and feasibility of deep convolutional neural network (CNN)-based methods have been shown in the field of change detection. However, there is still a lack of effective supervision for change feature learning. In this work, a feature constraint change detection network (FCCDN) is proposed. We constrain features both in bitemporal feature extraction and feature fusion. More specifically, we propose a dual encoder-decoder network backbone for the change detection task. At the center of the backbone, we design a nonlocal feature pyramid network to extract and fuse multiscale features. To fuse bitemporal features in a robust way, we build a dense connection-based feature fusion module. Moreover, a self-supervised learning-based strategy is proposed to constrain feature learning. Based on FCCDN, we achieve state-of-the-art performance on two building change detection datasets (LEVIR-CD and WHU). On the LEVIR-CD dataset, we achieve an IoU of 0.8569 and an F1 score of 0.9229. On the WHU dataset, we achieve an IoU of 0.8820 and an F1 score of 0.9373. Moreover, for the first time, the acquisition of accurate bitemporal semantic segmentation results is achieved without using semantic segmentation labels. This is vital for the application of change detection because it saves the cost of labeling.</div></details></td>
+        <td>数据集 LEVIR-CD验收指标：1.FCCDN (512)   F1 = 92.29% 参考论文 Table.32. 日志中包含周期 validation 和损失结果3. 复现后合入PaddleRS</td>
+        <td><a href="https://github.com/liuxtakeoff/FCCDN_paddle">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>33</td>
+        <td><a href="https://arxiv.org/pdf/2201.01293.pdf">A Transformer-Based Siamese Network for Change Detection</a></td>
+        <td><details><summary>Abstract</summary><div>This paper presents a transformer-based Siamese network architecture (abbreviated by ChangeFormer) for Change Detection (CD) from a pair of co-registered remote sensing images. Different from recent CD frameworks, which are based on fully convolutional networks (ConvNets), the proposed method unifies hierarchically structured transformer encoder with Multi-Layer Perception (MLP) decoder in a Siamese network architecture to efficiently render multi-scale long-range details required for accurate CD. Experiments on two CD datasets show that the proposed end-to-end trainable ChangeFormer architecture achieves better CD performance than previous counterparts. Our code is available at https://github.com/wgcban/ChangeFormer.</div></details></td>
+        <td>数据集 LEVIR-CD验收指标：1.ChangeFormer   F1 = 90.4% 参考论文 Table.12. 日志中包含周期 validation 和损失结果3. 复现后合入PaddleRS</td>
+        <td><a href="https://github.com/HULEIYI/ChangeFormer-pd">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>34</td>
+        <td><a href="https://arxiv.org/pdf/2102.04306v1.pdf">TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation</a></td>
+        <td><details><summary>Abstract</summary><div>Medical image segmentation is an essential prerequisite for developing healthcare systems, especially for disease diagnosis and treatment planning. On various medical image segmentation tasks, the u-shaped architecture, also known as U-Net, has become the de-facto standard and achieved tremendous success. However, due to the intrinsic locality of convolution operations, U-Net generally demonstrates limitations in explicitly modeling long-range dependency. Transformers, designed for sequence-to-sequence prediction, have emerged as alternative architectures with innate global self-attention mechanisms, but can result in limited localization abilities due to insufficient low-level details. In this paper, we propose TransUNet, which merits both Transformers and U-Net, as a strong alternative for medical image segmentation. On one hand, the Transformer encodes tokenized image patches from a convolution neural network (CNN) feature map as the input sequence for extracting global contexts. On the other hand, the decoder upsamples the encoded features which are then combined with the high-resolution CNN feature maps to enable precise localization. We argue that Transformers can serve as strong encoders for medical image segmentation tasks, with the combination of U-Net to enhance finer details by recovering localized spatial information. TransUNet achieves superior performances to various competing methods on different medical applications including multi-organ segmentation and cardiac segmentation. Code and models are available at https://github.com/Beckschen/TransUNet.</div></details></td>
+        <td>数据集SYNAPSE：联系 jienengchen01@gmail.com 获取处理后数据链接，或者在synapse 官网下载验收指标：1. Avg Dice = 77.48% 对应论文 Table.1中实现；2. 训练中包含周期性valset的评估结果和损失3. 复现后合入PaddleSeg 中MedicalSeg</td>
+        <td><a href="https://github.com/YellowLight021/TransUnetPaddle">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>35</td>
+        <td><a href="https://ieeexplore.ieee.org/abstract/document/9497514">FactSeg: Foreground Activation Driven Small Object Semantic Segmentation in Large-Scale Remote Sensing Imagery</a></td>
+        <td><details><summary>Abstract</summary><div>The small object semantic segmentation task is aimed at automatically extracting key objects from high-resolution remote sensing (HRS) imagery. Compared with the large-scale coverage areas for remote sensing imagery, the key objects, such as cars and ships, in HRS imagery often contain only a few pixels. In this article, to tackle this problem, the foreground activation (FA)-driven small object semantic segmentation (FactSeg) framework is proposed from perspectives of structure and optimization. In the structure design, FA object representation is proposed to enhance the awareness of the weak features in small objects. The FA object representation framework is made up of a dual-branch decoder and collaborative probability (CP) loss. In the dual-branch decoder, the FA branch is designed to activate the small object features (activation) and suppress the large-scale background, and the semantic refinement (SR) branch is designed to further distinguish small objects (refinement). The CP loss is proposed to effectively combine the activation and refinement outputs of the decoder under the CP hypothesis. During the collaboration, the weak features of the small objects are enhanced with the activation output, and the refined output can be viewed as the refinement of the binary outputs. In the optimization stage, small object mining (SOM)-based network optimization is applied to automatically select effective samples and refine the direction of the optimization while addressing the imbalanced sample problem between the small objects and the large-scale background. The experimental results obtained with two benchmark HRS imagery segmentation datasets demonstrate that the proposed framework outperforms the state-of-the-art semantic segmentation methods and achieves a good tradeoff between accuracy and efficiency. Code will be available at: http://rsidea.whu.edu.cn/FactSeg.htm</div></details></td>
+        <td>数据集 iSAID：https://captain-whu.github.io/iSAID/dataset.html验收指标：1. FactSeg ResNet-50 mIOU= 64.79% 参考论文Table.4 实现2. 日志中包含周期 validation 和损失结果3. 复现后合入PaddleRS</td>
+        <td><a href="https://github.com/LHE-IT/FactSeg_paddle/">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>36</td>
+        <td><a href="https://hszhao.github.io/papers/eccv18_psanet.pdf">PSANet: Point-wise Spatial Attention Network for Scene Parsing</a></td>
+        <td><details><summary>Abstract</summary><div>We notice information flow in convolutional neural networks is restricted inside local neighborhood regions due to the physical design of convolutional filters, which limits the overall understanding of complex scenes. In this paper, we propose the point-wise spatial attention network (PSANet) to relax the local neighborhood constraint. Each position on the feature map is connected to all the other ones through a self-adaptively learned attention mask. Moreover, information propagation in bi-direction for scene parsing is enabled. Information at other positions can be collected to help the prediction of the current position and vice versa, information at the current position can be distributed to assist the prediction of other ones. Our proposed approach achieves top performance on various competitive scene parsing datasets, including ADE20K, PASCAL VOC 2012 and Cityscapes, demonstrating its effectiveness and generality.</div></details></td>
+        <td>数据集 cityscapes  valset 验收指标：1. PSANet-resnet50 输入分辨率512x1024 mIOU=77.24% 参考 https://github.com/open-mmlab/mmsegmentation/tree/master/configs/psanet2. 日志中包含周期 validation 和损失结果3. 复现后合入PaddleSeg</td>
+        <td><a href="https://github.com/justld/PSANet_paddle">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>37</td>
+        <td><a href="https://arxiv.org/abs/1811.11721">CCNet: Criss-Cross Attention for Semantic Segmentation</a></td>
+        <td><details><summary>Abstract</summary><div>Contextual information is vital in visual understanding problems, such as semantic segmentation and object detection. We propose a Criss-Cross Network (CCNet) for obtaining full-image contextual information in a very effective and efficient way. Concretely, for each pixel, a novel criss-cross attention module harvests the contextual information of all the pixels on its criss-cross path. By taking a further recurrent operation, each pixel can finally capture the full-image dependencies. Besides, a category consistent loss is proposed to enforce the criss-cross attention module to produce more discriminative features. Overall, CCNet is with the following merits: 1) GPU memory friendly. Compared with the non-local block, the proposed recurrent criss-cross attention module requires 11x less GPU memory usage. 2) High computational efficiency. The recurrent criss-cross attention significantly reduces FLOPs by about 85% of the non-local block. 3) The state-of-the-art performance. We conduct extensive experiments on semantic segmentation benchmarks including Cityscapes, ADE20K, human parsing benchmark LIP, instance segmentation benchmark COCO, video segmentation benchmark CamVid. In particular, our CCNet achieves the mIoU scores of 81.9%, 45.76% and 55.47% on the Cityscapes test set, the ADE20K validation set and the LIP validation set respectively, which are the new state-of-the-art results. The source codes are available at \url{https://github.com/speedinghzl/CCNet}.</div></details></td>
+        <td>数据集 cityscapes  valset验收指标：1. CCNet-resnet101 R=2+OHEM mIOU= 80.0% 参考 https://github.com/speedinghzl/CCNet/tree/pure-python2. 日志中包含周期 validation 和损失结果3. 复现后合入PaddleSeg</td>
+        <td><a href="https://github.com/justld/CCNet_paddle">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>38</td>
+        <td><a href="https://arxiv.org/abs/2101.06085">Deep Dual-resolution Networks for Real-time and Accurate Semantic Segmentation of Road Scenes</a></td>
+        <td><details><summary>Abstract</summary><div>Semantic segmentation is a key technology for autonomous vehicles to understand the surrounding scenes. The appealing performances of contemporary models usually come at the expense of heavy computations and lengthy inference time, which is intolerable for self-driving. Using light-weight architectures (encoder-decoder or two-pathway) or reasoning on low-resolution images, recent methods realize very fast scene parsing, even running at more than 100 FPS on a single 1080Ti GPU. However, there is still a significant gap in performance between these real-time methods and the models based on dilation backbones. To tackle this problem, we proposed a family of efficient backbones specially designed for real-time semantic segmentation. The proposed deep dual-resolution networks (DDRNets) are composed of two deep branches between which multiple bilateral fusions are performed. Additionally, we design a new contextual information extractor named Deep Aggregation Pyramid Pooling Module (DAPPM) to enlarge effective receptive fields and fuse multi-scale context based on low-resolution feature maps. Our method achieves a new state-of-the-art trade-off between accuracy and speed on both Cityscapes and CamVid dataset. In particular, on a single 2080Ti GPU, DDRNet-23-slim yields 77.4% mIoU at 102 FPS on Cityscapes test set and 74.7% mIoU at 230 FPS on CamVid test set. With widely used test augmentation, our method is superior to most state-of-the-art models and requires much less computation. Codes and trained models are available online.</div></details></td>
+        <td>数据集 cityscapes valset验收指标：1.  DDRNet-23 mIOU= 79.5% 参考论文 Table.42. 日志中包含周期 validation 和损失结果3. 复现后合入PaddleSeg</td>
+        <td><a href="https://github.com/justld/DDRNet_paddle">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>39</td>
+        <td><a href="https://arxiv.org/abs/1809.10486">nnU-Net: Self-adapting Framework for U-Net-Based Medical Image Segmentation</a></td>
+        <td><details><summary>Abstract</summary><div>The U-Net was presented in 2015. With its straight-forward and successful architecture it quickly evolved to a commonly used benchmark in medical image segmentation. The adaptation of the U-Net to novel problems, however, comprises several degrees of freedom regarding the exact architecture, preprocessing, training and inference. These choices are not independent of each other and substantially impact the overall performance. The present paper introduces the nnU-Net ('no-new-Net'), which refers to a robust and self-adapting framework on the basis of 2D and 3D vanilla U-Nets. We argue the strong case for taking away superfluous bells and whistles of many proposed network designs and instead focus on the remaining aspects that make out the performance and generalizability of a method. We evaluate the nnU-Net in the context of the Medical Segmentation Decathlon challenge, which measures segmentation performance in ten disciplines comprising distinct entities, image modalities, image geometries and dataset sizes, with no manual adjustments between datasets allowed. At the time of manuscript submission, nnU-Net achieves the highest mean dice scores across all classes and seven phase 1 tasks (except class 1 in BrainTumour) in the online leaderboard of the challenge.</div></details></td>
+        <td>数据集 MSD-Lung :https://drive.google.com/drive/folders/1HqEgzS8BV2c7xYNrZdEAnrHk7osJJ--2验收指标：1. 3DUnet-cascade Avg Dice = 66.85%；ensemble 2DUnet+ 3DUnet Avg Dice = 61.18%；对应论文 Table.2中实现；2. 训练中包含周期性valset的评估结果和损失3. 复现后合入PaddleSeg 中MedicalSeg</td>
+        <td><a href="https://github.com/justld/nnunet_paddle">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>40</td>
+        <td><a href="https://paperswithcode.com/paper/unetr-transformers-for-3d-medical-image">UNETR: Transformers for 3D Medical Image</a></td>
+        <td><details><summary>Abstract</summary><div>Fully Convolutional Neural Networks (FCNNs) with contracting and expanding paths have shown prominence for the majority of medical image segmentation applications since the past decade. In FCNNs, the encoder plays an integral role by learning both global and local features and contextual representations which can be utilized for semantic output prediction by the decoder. Despite their success, the locality of convolutional layers in FCNNs, limits the capability of learning long-range spatial dependencies. Inspired by the recent success of transformers for Natural Language Processing (NLP) in long-range sequence learning, we reformulate the task of volumetric (3D) medical image segmentation as a sequence-to-sequence prediction problem. We introduce a novel architecture, dubbed as UNEt TRansformers (UNETR), that utilizes a transformer as the encoder to learn sequence representations of the input volume and effectively capture the global multi-scale information, while also following the successful "U-shaped" network design for the encoder and decoder. The transformer encoder is directly connected to a decoder via skip connections at different resolutions to compute the final semantic segmentation output. We have validated the performance of our method on the Multi Atlas Labeling Beyond The Cranial Vault (BTCV) dataset for multi-organ segmentation and the Medical Segmentation Decathlon (MSD) dataset for brain tumor and spleen segmentation tasks. Our benchmarks demonstrate new state-of-the-art performance on the BTCV leaderboard. Code: https://monai.io/research/unetr</div></details></td>
+        <td>nan</td>
+        <td><a href="https://github.com/sun222/PaddleSeg/tree/release/2.5/contrib/MedicalSeg">快速开始</a></td>
+    </tr>
+</table>
+### OCR
+<table>
+    <tr>
+        <th>序号</th>
+        <th>论文名称(链接)</th>
+        <th>摘要</th>
+        <th>数据集</th>
+        <th width='10%'>快速开始</th>
+    </tr>
+    <tr>
+        <td>1</td>
+        <td><a href="https://arxiv.org/pdf/1609.03605v1.pdf">Detecting Text in Natural Image with Connectionist Text Proposal Network</a></td>
+        <td><details><summary>Abstract</summary><div>We propose a novel Connectionist Text Proposal Network (CTPN) that accurately localizes text lines in natural image. The CTPN detects a text line in a sequence of fine-scale text proposals directly in convolutional feature maps. We develop a vertical anchor mechanism that jointly predicts location and text/non-text score of each fixed-width proposal, considerably improving localization accuracy. The sequential proposals are naturally connected by a recurrent neural network, which is seamlessly incorporated into the convolutional network, resulting in an end-to-end trainable model. This allows the CTPN to explore rich context information of image, making it powerful to detect extremely ambiguous text. The CTPN works reliably on multi-scale and multi- language text without further post-processing, departing from previous bottom-up methods requiring multi-step post-processing. It achieves 0.88 and 0.61 F-measure on the ICDAR 2013 and 2015 benchmarks, surpass- ing recent results [8, 35] by a large margin. The CTPN is computationally efficient with 0:14s/image, by using the very deep VGG16 model [27]. Online demo is available at: http://textdet.com/.</div></details></td>
+        <td>icdar2015: 0.61</td>
+        <td><a href="https://github.com/BADBADBADBOY/paddle.ctpn">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>2</td>
+        <td><a href="Efficient and Accurate Arbitrary-Shaped Text Detection with Pixel Aggregation Network (zhihu.com)">Efficient and Accurate Arbitrary-Shaped Text Detection with Pixel Aggregation Network</a></td>
+        <td><details><summary>Abstract</summary><div>Attention-based scene text recognizers have gained huge success, whichleverages a more compact intermediate representation to learn 1d- or 2d-attention by a RNN-based encoder-decoder architecture. However, such methodssuffer from attention-drift problem because high similarity among encodedfeatures leads to attention confusion under the RNN-based local attentionmechanism. Moreover, RNN-based methods have low efficiency due to poorparallelization. To overcome these problems, we propose the MASTER, aself-attention based scene text recognizer that (1) not only encodes theinput-output attention but also learns self-attention which encodesfeature-feature and target-target relationships inside the encoder and decoderand (2) learns a more powerful and robust intermediate representation tospatial distortion, and (3) owns a great training efficiency because of hightraining parallelization and a high-speed inference because of an efficientmemory-cache mechanism. Extensive experiments on various benchmarks demonstratethe superior performance of our MASTER on both regular and irregular scenetext. Pytorch code can be found at this https URL,and Tensorflow code can be found at this https URL.</div></details></td>
+        <td>ResNet18  ctw1500 0.806</td>
+        <td><a href="https://github.com/JennyVanessa/PANet-Paddle">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>3</td>
+        <td><a href="MASTER: Multi-Aspect Non-local Network for Scene Text Recognition (arxiv.org)">MASTER: Multi-Aspect Non-local Network for Scene Text Recognition</a></td>
+        <td><details><summary>Abstract</summary><div>Temporal action proposal generation is an important and challenging task invideo understanding, which aims at detecting all temporal segments containingaction instances of interest. The existing proposal generation approaches aregenerally based on pre-defined anchor windows or heuristic bottom-up boundarymatching strategies. This paper presents a simple and efficient framework(RTD-Net) for direct action proposal generation, by re-purposing aTransformer-alike architecture. To tackle the essential visual differencebetween time and space, we make three important improvements over the originaltransformer detection framework (DETR). First, to deal with slowness prior invideos, we replace the original Transformer encoder with a boundary attentivemodule to better capture long-range temporal information. Second, due to theambiguous temporal boundary and relatively sparse annotations, we present arelaxed matching scheme to relieve the strict criteria of single assignment toeach groundtruth. Finally, we devise a three-branch head to further improve theproposal confidence estimation by explicitly predicting its completeness.Extensive experiments on THUMOS14 and ActivityNet-1.3 benchmarks demonstratethe effectiveness of RTD-Net, on both tasks of temporal action proposalgeneration and temporal action detection. Moreover, due to its simplicity indesign, our framework is more efficient than previous proposal generationmethods, without non-maximum suppression post-processing. The code and modelsare made available at this https URL.</div></details></td>
+        <td> IIIT5K: 95 SVT: 90.6 IC03: 96.4 IC13: 95.3IC15: 79.4 SVTP: 834.5 CT80: 84.5  avg: 89.81</td>
+        <td><a href="https://github.com/S-HuaBomb/MASTER-paddle">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>4</td>
+        <td><a href="Fourier Contour Embedding for Arbitrary-Shaped Text Detection (thecvf.com)">Fourier Contour Embedding for Arbitrary-Shaped Text Detection</a></td>
+        <td><details><summary>Abstract</summary><div>One of the main challenges for arbitrary-shaped text detection is to design a good text instance representation that allows networks to learn diverse text geometry variances. Most of existing methods model text instances in image spatial domain via masks or contour point sequences in the Cartesian or the polar coordinate system. However, the mask representation might lead to expensive post-processing, while the point sequence one may have limited capability to model texts with highly-curved shapes. To tackle these problems, we model text instances in the Fourier domain and propose one novel Fourier Contour Embedding (FCE) method to represent arbitrary shaped text contours as compact signatures. We further construct FCENet with a backbone, feature pyramid networks (FPN) and a simple post-processing with the Inverse Fourier Transformation (IFT) and Non-Maximum Suppression (NMS). Different from previous methods, FCENet first predicts compact Fourier signatures of text instances, and then reconstructs text contours via IFT and NMS during test. Extensive experiments demonstrate that FCE is accurate and robust to fit contours of scene texts even with highly-curved shapes, and also validate the effectiveness and the good generalization of FCENet for arbitrary-shaped text detection. Furthermore, experimental results show that our FCENet is superior to the state-of-the-art (SOTA) methods on CTW1500 and Total-Text, especially on challenging highly-curved text subset.</div></details></td>
+        <td>ResNet50 + DCNv2 ctw1500 0.851</td>
+        <td><a href="https://github.com/zhiminzhang0830/FCENet_Paddle">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>5</td>
+        <td><a href="https://arxiv.org/pdf/2105.04286">Primitive Representation Learning for Scene Text Recognition </a></td>
+        <td><details><summary>Abstract</summary><div>Scene text recognition is a challenging task due to diverse variations of text instances in natural scene images. Conventional methods based on CNN-RNN-CTC or encoder-decoder with attention mechanism may not fully investigate stable and efficient feature representations for multi-oriented scene texts. In this paper, we propose a primitive representation learning method that aims to exploit intrinsic representations of scene text images. We model elements in feature maps as the nodes of an undirected graph. A pooling aggregator and a weighted aggregator are proposed to learn primitive representations, which are transformed into high-level visual text representations by graph convolutional networks. A Primitive REpresentation learning Network (PREN) is constructed to use the visual text representations for parallel decoding. Furthermore, by integrating visual text representations into an encoderdecoder model with the 2D attention mechanism, we propose a framework called PREN2D to alleviate the misalignment problem in attention-based methods. Experimental results on both English and Chinese scene text recognition tasks demonstrate that PREN keeps a balance between accuracy and efficiency, while PREN2D achieves state-of-theart performance.</div></details></td>
+        <td>SynthText+Mjsynth; IIIT5k: 86.03%, SVT: 87.17%, IC03: 95.16%, IC13: 93.93%, IC15: 78.52%, SVTP: 81.71%, CUTE80: 75.69%, avg: 85.5%</td>
+        <td><a href="https://github.com/developWmark/paddle_PROCR">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>6</td>
+        <td><a href="https://arxiv.org/pdf/2105.06229">RF-Learning:Reciprocal Feature Learning via Explicit and Implicit Tasks in Scene Text Recognition</a></td>
+        <td><details><summary>Abstract</summary><div>Text recognition is a popular topic for its broad applications. In this work, we excavate the implicit task, character counting within the traditional text recognition, without additional labor annotation cost. The implicit task plays as an auxiliary branch for complementing the sequential recognition. We design a two-branch reciprocal feature learning framework in order to adequately utilize the features from both the tasks. Through exploiting the complementary effect between explicit and implicit tasks, the feature is reliably enhanced. Extensive experiments on 7 benchmarks show the advantages of the proposed methods in both text recognition and the new-built character counting tasks. In addition, it is convenient yet effective to equip with variable networks and tasks. We offer abundant ablation studies, generalizing experiments with deeper understanding on the tasks. Code is available.</div></details></td>
+        <td>RF-Learning visual  IIIT5K: 96, SVT:94.7 IC03:96.2 IC13:95.9 IC15:88.7 SVTP:86.7 CUTE80:88.2 avg: 92.34</td>
+        <td><a href="https://github.com/zhiminzhang0830/RFLearning_Paddle">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>7</td>
+        <td><a href="https://openaccess.thecvf.com/content_CVPR_2020/papers/Zhang_Deep_Relational_Reasoning_Graph_Network_for_Arbitrary_Shape_Text_Detection_CVPR_2020_paper.pdf">Deep Relational Reasoning Graph Network for Arbitrary Shape Text Detection</a></td>
+        <td><details><summary>Abstract</summary><div>Arbitrary shape text detection is a challenging task due to the high variety and complexity of scenes texts. In this paper, we propose a novel unified relational reasoning graph network for arbitrary shape text detection. In our method, an innovative local graph bridges a text proposal model via Convolutional Neural Network (CNN) and a deep relational reasoning network via Graph Convolutional Network (GCN), making our network end-to-end trainable. To be concrete, every text instance will be divided into a series of small rectangular components, and the geometry attributes (e.g., height, width, and orientation) of the small components will be estimated by our text proposal model. Given the geometry attributes, the local graph construction model can roughly establish linkages between different text components. For further reasoning and deducing the likelihood of linkages between the component and its neighbors, we adopt a graph-based network to perform deep relational reasoning on local graphs. Experiments on public available datasets demonstrate the state-of-the-art performance of our method.</div></details></td>
+        <td>ResNet50  ctw1500 0.840</td>
+        <td><a href="https://github.com/zhiminzhang0830/DRRG_Paddle">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>8</td>
+        <td><a href="https://openaccess.thecvf.com/content/CVPR2021/papers/Chen_Scene_Text_Telescope_Text-Focused_Scene_Image_Super-Resolution_CVPR_2021_paper.pdf">Scene Text Telescope: Text-Focused Scene Image Super-Resolution</a></td>
+        <td><details><summary>Abstract</summary><div>Image super-resolution, which is often regarded as a preprocessing procedure of scene text recognition, aims to recover the realistic features from a low-resolution text image. It has always been challenging due to large variations in text shapes, fonts, backgrounds, etc. However, most existing methods employ generic super-resolution frameworks to handle scene text images while ignoring text-specific properties such as text-level layouts and character-level details. In this paper, we establish a text-focused super-resolution framework, called Scene Text Telescope (STT). In terms of text-level layouts, we propose a Transformer-Based Super-Resolution Network (TBSRN) containing a Self-Attention Module to extract sequential information, which is robust to tackle the texts in arbitrary orientations. In terms of character-level details, we propose a Position-Aware Module and a Content-Aware Module to highlight the position and the content of each character. By observing that some characters look indistinguishable in low-resolution conditions, we use a weighted cross-entropy loss to tackle this problem. We conduct extensive experiments, including text recognition with pre-trained recognizers and image quality evaluation, on TextZoom and several scene text recognition benchmarks to assess the super-resolution images. The experimental results show that our STT can indeed generate text-focused super-resolution images and outperform the existing methods in terms of recognition accuracy.</div></details></td>
+        <td>CRNN+tbsrn，easy: 0.5979, medium: 0.4507, hard: 0.3418, avg: 0.4634</td>
+        <td><a href="https://github.com/Lieberk/Paddle-TextSR-STT ">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>9</td>
+        <td><a href="https://arxiv.org/abs/2207.11463">When Counting Meets HMER: Counting-Aware Network for Handwritten Mathematical Expression Recognition</a></td>
+        <td><details><summary>Abstract</summary><div>Recently, most handwritten mathematical expression recognition (HMER) methods adopt the encoder-decoder networks, which directly predict the markup sequences from formula images with the attention mechanism. However, such methods may fail to accurately read formulas with complicated structure or generate long markup sequences, as the attention results are often inaccurate due to the large variance of writing styles or spatial layouts. To alleviate this problem, we propose an unconventional network for HMER named Counting-Aware Network (CAN), which jointly optimizes two tasks: HMER and symbol counting. Specifically, we design a weakly-supervised counting module that can predict the number of each symbol class without the symbol-level position annotations, and then plug it into a typical attention-based encoder-decoder model for HMER. Experiments on the benchmark datasets for HMER validate that both joint optimization and counting results are beneficial for correcting the prediction errors of encoder-decoder models, and CAN consistently outperforms the state-of-the-art methods. In particular, compared with an encoder-decoder model for HMER, the extra time cost caused by the proposed counting module is marginal. The source code is available at https://github.com/LBH1024/CAN.</div></details></td>
+        <td>1.ExpRate=65.89 2.复现后合入PaddleOCR套件，并添加TIPC</td>
+        <td><a href="https://github.com/Lllllolita/CAN_Paddle">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>10</td>
+        <td><a href="https://arxiv.org/pdf/2007.07542.pdf?ref=https://githubhelp.com">RobustScanner: Dynamically Enhancing Positional Clues for Robust Text Recognition</a></td>
+        <td><details><summary>Abstract</summary><div>The attention-based encoder-decoder framework has recently achieved impressive results for scene text recognition, and many variants have emerged with improvements in recognition quality. However, it performs poorly on contextless texts (e.g., random character sequences) which is unacceptable in most of real application scenarios. In this paper, we first deeply investigate the decoding process of the decoder. We empirically find that a representative character-level sequence decoder utilizes not only context information but also positional information. Contextual information, which the existing approaches heavily rely on, causes the problem of attention drift. To suppress such side-effect, we propose a novel position enhancement branch, and dynamically fuse its outputs with those of the decoder attention module for scene text recognition. Specifically, it contains a position aware module to enable the encoder to output feature vectors encoding their own spatial positions, and an attention module to estimate glimpses using the positional clue (i.e., the current decoding time step) only. The dynamic fusion is conducted for more robust feature via an element-wise gate mechanism. Theoretically, our proposed method, dubbed \emph{RobustScanner}, decodes individual characters with dynamic ratio between context and positional clues, and utilizes more positional ones when the decoding sequences with scarce context, and thus is robust and practical. Empirically, it has achieved new state-of-the-art results on popular regular and irregular text recognition benchmarks while without much performance drop on contextless benchmarks, validating its robustness in both contextual and contextless application scenarios.</div></details></td>
+        <td>IIIT5K: 95.1 SVT:89.2 IC13:93.1    IC15:77.8  SVTP:80.3  CT80:90.3 avg 87.63</td>
+        <td><a href="https://github.com/smilelite/RobustScanner.paddle">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>11</td>
+        <td><a href="https://arxiv.org/abs/2005.13117">SPIN: Structure-Preserving Inner Offset Network for Scene Text Recognition</a></td>
+        <td><details><summary>Abstract</summary><div>Arbitrary text appearance poses a great challenge in scene text recognition tasks. Existing works mostly handle with the problem in consideration of the shape distortion, including perspective distortions, line curvature or other style variations. Therefore, methods based on spatial transformers are extensively studied. However, chromatic difficulties in complex scenes have not been paid much attention on. In this work, we introduce a new learnable geometric-unrelated module, the Structure-Preserving Inner Offset Network (SPIN), which allows the color manipulation of source data within the network. This differentiable module can be inserted before any recognition architecture to ease the downstream tasks, giving neural networks the ability to actively transform input intensity rather than the existing spatial rectification. It can also serve as a complementary module to known spatial transformations and work in both independent and collaborative ways with them. Extensive experiments show that the use of SPIN results in a significant improvement on multiple text recognition benchmarks compared to the state-of-the-arts.</div></details></td>
+        <td>IIIT5K: 94.6, SVT:89, IC03: 93.3, IC13:94.2,IC15:80.7,SVTP:83,CUTE80:84.7,avg: 88.5</td>
+        <td><a href="https://github.com/smilelite/spin_paddle">快速开始</a></td>
+    </tr>
 </table>
 ### 图像生成
@@ -567,7 +830,7 @@
        <td>6</td>
        <td><a href="https://paperswithcode.com/paper/singan-learning-a-generative-model-from-a">SinGAN: Learning a Generative Model from a Single Natural Image</a></td>
        <td><details><summary>Abstract</summary><div>We propose spatially-adaptive normalization, a simple but effective layer forsynthesizing photorealistic images given an input semantic layout. Previousmethods directly feed the semantic layout as input to the deep network, whichis then processed through stacks of convolution, normalization, andnonlinearity layers. We show that this is suboptimal as the normalizationlayers tend to ``wash away'' semantic information. To address the issue, wepropose using the input layout for modulating the activations in normalizationlayers through a spatially-adaptive, learned transformation. Experiments onseveral challenging datasets demonstrate the advantage of the proposed methodover existing approaches, regarding both visual fidelity and alignment withinput layouts. Finally, our model allows user control over both semantic andstyle. Code is available at this https URL .</div></details></td>
-        <td>人眼评估生成的图像(可参考论文中展示的生成图片Figure6)</td>
+        <td>任意一张图片 人眼评估生成的图像(可参考论文中展示的生成图片Figure6)</td>
        <td><a href="https://github.com/icey-zhang/paddle_SinGAN">快速开始</a></td>
    </tr>
    <tr>
@@ -647,6 +910,182 @@
        <td>DIV2K and Flickr2K and OST; 可视化效果与论文一致</td>
        <td><a href="https://github.com/20151001860/Real_ESRGAN_paddle">快速开始</a></td>
    </tr>
+    <tr>
+        <td>18</td>
+        <td><a href="https://paperswithcode.com/paper/towards-real-world-blind-face-restoration">GFP-GAN: Towards Real-World Blind Face Restoration with Generative Facial Prior</a></td>
+        <td><details><summary>Abstract</summary><div>Blind face restoration usually relies on facial priors, such as facial geometry prior or reference prior, to restore realistic and faithful details. However, very low-quality inputs cannot offer accurate geometric prior while high-quality references are inaccessible, limiting the applicability in real-world scenarios. In this work, we propose GFP-GAN that leverages rich and diverse priors encapsulated in a pretrained face GAN for blind face restoration. This Generative Facial Prior (GFP) is incorporated into the face restoration process via novel channel-split spatial feature transform layers, which allow our method to achieve a good balance of realness and fidelity. Thanks to the powerful generative facial prior and delicate designs, our GFP-GAN could jointly restore facial details and enhance colors with just a single forward pass, while GAN inversion methods require expensive image-specific optimization at inference. Extensive experiments show that our method achieves superior performance to prior art on both synthetic and real-world datasets.</div></details></td>
+        <td>CelebA-Test： LPIPS=0.3646， FID=42.62 </td>
+        <td><a href="https://github.com/yangshurong/GFP-GAN_paddle/tree/main">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>19</td>
+        <td><a href="https://paperswithcode.com/paper/aggregated-contextual-transformations-for">Aggregated Contextual Transformations for High-Resolution Image Inpainting</a></td>
+        <td><details><summary>Abstract</summary><div>State-of-the-art image inpainting approaches can suffer from generating distorted structures and blurry textures in high-resolution images (e.g., 512x512). The challenges mainly drive from (1) image content reasoning from distant contexts, and (2) fine-grained texture synthesis for a large missing region. To overcome these two challenges, we propose an enhanced GAN-based model, named Aggregated COntextual-Transformation GAN (AOT-GAN), for high-resolution image inpainting. Specifically, to enhance context reasoning, we construct the generator of AOT-GAN by stacking multiple layers of a proposed AOT block. The AOT blocks aggregate contextual transformations from various receptive fields, allowing to capture both informative distant image contexts and rich patterns of interest for context reasoning. For improving texture synthesis, we enhance the discriminator of AOT-GAN by training it with a tailored mask-prediction task. Such a training objective forces the discriminator to distinguish the detailed appearances of real and synthesized patches, and in turn, facilitates the generator to synthesize clear textures. Extensive comparisons on Places2, the most challenging benchmark with 1.8 million high-resolution images of 365 complex scenes, show that our model outperforms the state-of-the-art by a significant margin in terms of FID with 38.60% relative improvement. A user study including more than 30 subjects further validates the superiority of AOT-GAN. We further evaluate the proposed AOT-GAN in practical applications, e.g., logo removal, face editing, and object removal. Results show that our model achieves promising completions in the real world. We release code and models in https://github.com/researchmm/AOT-GAN-for-Inpainting.</div></details></td>
+        <td>Places365-val(20-30% ): PSNR=26.03, SSIM=0.890</td>
+        <td><a href="https://github.com/ctkindle/AOT_GAN_Paddle">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>20</td>
+        <td><a href="https://arxiv.org/pdf/1901.09221v3.pdf">Progressive Image Deraining Networks: A Better and Simpler Baseline</a></td>
+        <td><details><summary>Abstract</summary><div>Along with the deraining performance improvement of deep networks, their structures and learning become more and more complicated and diverse, making it difficult to analyze the contribution of various network modules when developing new deraining networks. To handle this issue, this paper provides a better and simpler baseline deraining network by considering network architecture, input and output, and loss functions. Specifically, by repeatedly unfolding a shallow ResNet, progressive ResNet (PRN) is proposed to take advantage of recursive computation. A recurrent layer is further introduced to exploit the dependencies of deep features across stages, forming our progressive recurrent network (PReNet). Furthermore, intra-stage recursive computation of ResNet can be adopted in PRN and PReNet to notably reduce network parameters with graceful degradation in deraining performance. For network input and output, we take both stage-wise result and original rainy image as input to each ResNet and finally output the prediction of {residual image}. As for loss functions, single MSE or negative SSIM losses are sufficient to train PRN and PReNet. Experiments show that PRN and PReNet perform favorably on both synthetic and real rainy images. Considering its simplicity, efficiency and effectiveness, our models are expected to serve as a suitable baseline in future deraining research. The source codes are available at https://github.com/csdwren/PReNet.</div></details></td>
+        <td>Rain100H数据集,PReNet模型,psnr=29.46, ssim=0.899</td>
+        <td><a href="https://github.com/simonsLiang/PReNet_paddle">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>21</td>
+        <td><a href="https://paperswithcode.com/paper/styleclip-text-driven-manipulation-of">StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery</a></td>
+        <td><details><summary>Abstract</summary><div>Inspired by the ability of StyleGAN to generate highly realistic images in a variety of domains, much recent work has focused on understanding how to use the latent spaces of StyleGAN to manipulate generated and real images. However, discovering semantically meaningful latent manipulations typically involves painstaking human examination of the many degrees of freedom, or an annotated collection of images for each desired manipulation. In this work, we explore leveraging the power of recently introduced Contrastive Language-Image Pre-training (CLIP) models in order to develop a text-based interface for StyleGAN image manipulation that does not require such manual effort. We first introduce an optimization scheme that utilizes a CLIP-based loss to modify an input latent vector in response to a user-provided text prompt. Next, we describe a latent mapper that infers a text-guided latent manipulation step for a given input image, allowing faster and more stable text-based manipulation. Finally, we present a method for mapping a text prompts to input-agnostic directions in StyleGAN's style space, enabling interactive text-driven image manipulation. Extensive results and comparisons demonstrate the effectiveness of our approaches.</div></details></td>
+        <td>可视化</td>
+        <td><a href="https://github.com/ultranity/Paddle-StyleCLIP/">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>22</td>
+        <td><a href="https://paperswithcode.com/paper/gan-prior-embedded-network-for-blind-face">GAN Prior Embedded Network for Blind Face Restoration in the Wild</a></td>
+        <td><details><summary>Abstract</summary><div>Blind face restoration (BFR) from severely degraded face images in the wild is a very challenging problem. Due to the high illness of the problem and the complex unknown degradation, directly training a deep neural network (DNN) usually cannot lead to acceptable results. Existing generative adversarial network (GAN) based methods can produce better results but tend to generate over-smoothed restorations. In this work, we propose a new method by first learning a GAN for high-quality face image generation and embedding it into a U-shaped DNN as a prior decoder, then fine-tuning the GAN prior embedded DNN with a set of synthesized low-quality face images. The GAN blocks are designed to ensure that the latent code and noise input to the GAN can be respectively generated from the deep and shallow features of the DNN, controlling the global face structure, local face details and background of the reconstructed image. The proposed GAN prior embedded network (GPEN) is easy-to-implement, and it can generate visually photo-realistic results. Our experiments demonstrated that the proposed GPEN achieves significantly superior results to state-of-the-art BFR methods both quantitatively and qualitatively, especially for the restoration of severely degraded face images in the wild. The source code and models can be found at https://github.com/yangxy/GPEN.</div></details></td>
+        <td>FID=31.72（CelebA-HQ-val ）</td>
+        <td><a href="https://github.com/bitcjm/GPEN_REPO">快速开始</a></td>
+    </tr>
+</table>
+### 图像修复
+<table>
+    <tr>
+        <th>序号</th>
+        <th>论文名称(链接)</th>
+        <th>摘要</th>
+        <th>数据集</th>
+        <th width='10%'>快速开始</th>
+    </tr>
+    <tr>
+        <td>1</td>
+        <td><a href="https://arxiv.org/pdf/2204.04676">Simple Baselines for Image Restoration</a></td>
+        <td><details><summary>Abstract</summary><div>Although there have been significant advances in the field of image restoration recently, the system complexity of the state-of-the-art (SOTA) methods is increasing as well, which may hinder the convenient analysis and comparison of methods. In this paper, we propose a simple baseline that exceeds the SOTA methods and is computationally efficient. To further simplify the baseline, we reveal that the nonlinear activation functions, e.g. Sigmoid, ReLU, GELU, Softmax, etc. are not necessary: they could be replaced by multiplication or removed. Thus, we derive a Nonlinear Activation Free Network, namely NAFNet, from the baseline. SOTA results are achieved on various challenging benchmarks, e.g. 33.69 dB PSNR on GoPro (for image deblurring), exceeding the previous SOTA 0.38 dB with only 8.4% of its computational costs; 40.30 dB PSNR on SIDD (for image denoising), exceeding the previous SOTA 0.28 dB with less than half of its computational costs. The code and the pre-trained models are released at https://github.com/megvii-research/NAFNet.</div></details></td>
+        <td>SIDD PSNR: 40.3045, SSIM:0.9614</td>
+        <td><a href="https://github.com/Lllllolita/CAN_Paddle">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>2</td>
+        <td><a href="https://arxiv.org/pdf/2105.06086v1.pdf">HINet: Half Instance Normalization Network for Image Restoration</a></td>
+        <td><details><summary>Abstract</summary><div>In this paper, we explore the role of Instance Normalization in low-level vision tasks. Specifically, we present a novel block: Half Instance Normalization Block (HIN Block), to boost the performance of image restoration networks. Based on HIN Block, we design a simple and powerful multi-stage network named HINet, which consists of two subnetworks. With the help of HIN Block, HINet surpasses the state-of-the-art (SOTA) on various image restoration tasks. For image denoising, we exceed it 0.11dB and 0.28 dB in PSNR on SIDD dataset, with only 7.5% and 30% of its multiplier-accumulator operations (MACs), 6.8 times and 2.9 times speedup respectively. For image deblurring, we get comparable performance with 22.5% of its MACs and 3.3 times speedup on REDS and GoPro datasets. For image deraining, we exceed it by 0.3 dB in PSNR on the average result of multiple datasets with 1.4 times speedup. With HINet, we won 1st place on the NTIRE 2021 Image Deblurring Challenge - Track2. JPEG Artifacts, with a PSNR of 29.70. The code is available at https://github.com/megvii-model/HINet.</div></details></td>
+        <td>SIDD PSNR: 39.99, SSIM:0.958</td>
+        <td><a href="https://github.com/youngAt19/hinet_paddle">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>3</td>
+        <td><a href="https://arxiv.org/abs/2104.10546">Invertible Denoising Network: A Light Solution for Real Noise Removal</a></td>
+        <td><details><summary>Abstract</summary><div>Invertible networks have various benefits for image denoising since they are lightweight, information-lossless, and memory-saving during back-propagation. However, applying invertible models to remove noise is challenging because the input is noisy, and the reversed output is clean, following two different distributions. We propose an invertible denoising network, InvDN, to address this challenge. InvDN transforms the noisy input into a low-resolution clean image and a latent representation containing noise. To discard noise and restore the clean image, InvDN replaces the noisy latent representation with another one sampled from a prior distribution during reversion. The denoising performance of InvDN is better than all the existing competitive models, achieving a new state-of-the-art result for the SIDD dataset while enjoying less run time. Moreover, the size of InvDN is far smaller, only having 4.2% of the number of parameters compared to the most recently proposed DANet. Further, via manipulating the noisy latent representation, InvDN is also able to generate noise more similar to the original one. Our code is available at: https://github.com/Yang-Liu1082/InvDN.git.</div></details></td>
+        <td>SIDD PSNR: 39.28, SSIM:0.955</td>
+        <td><a href="https://github.com/hnmizuho/InvDN_paddlepaddle">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>4</td>
+        <td><a href="https://openaccess.thecvf.com/content/ICCV2021W/AIM/papers/Liang_SwinIR_Image_Restoration_Using_Swin_Transformer_ICCVW_2021_paper.pdf">SwinIR: Image Restoration Using Swin Transformer</a></td>
+        <td><details><summary>Abstract</summary><div>Image restoration is a long-standing low-level vision problem that aims to restore high-quality images from low-quality images (e.g., downscaled, noisy and compressed images). While state-of-the-art image restoration methods are based on convolutional neural networks, few attempts have been made with Transformers which show impressive performance on high-level vision tasks. In this paper, we propose a strong baseline model SwinIR for image restoration based on the Swin Transformer. SwinIR consists of three parts: shallow feature extraction, deep feature extraction and high-quality image reconstruction. In particular, the deep feature extraction module is composed of several residual Swin Transformer blocks (RSTB), each of which has several Swin Transformer layers together with a residual connection. We conduct experiments on three representative tasks: image super-resolution (including classical, lightweight and real-world image super-resolution), image denoising (including grayscale and color image denoising) and JPEG compression artifact reduction. Experimental results demonstrate that SwinIR outperforms state-of-the-art methods on different tasks by up to 0.14∼0.45dB, while the total number of parameters can be reduced by up to 67%.</div></details></td>
+        <td>CBSD68, average PSNR, noise 15: 34.42</td>
+        <td><a href="https://github.com/sldyns/SwinIR_paddle">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>5</td>
+        <td><a href="https://arxiv.org/abs/2101.02824">Neighbor2Neighbor: Self-Supervised Denoising from Single Noisy Images</a></td>
+        <td><details><summary>Abstract</summary><div>In the last few years, image denoising has benefited a lot from the fast development of neural networks. However, the requirement of large amounts of noisy-clean image pairs for supervision limits the wide use of these models. Although there have been a few attempts in training an image denoising model with only single noisy images, existing self-supervised denoising approaches suffer from inefficient network training, loss of useful information, or dependence on noise modeling. In this paper, we present a very simple yet effective method named Neighbor2Neighbor to train an effective image denoising model with only noisy images. Firstly, a random neighbor sub-sampler is proposed for the generation of training image pairs. In detail, input and target used to train a network are images sub-sampled from the same noisy image, satisfying the requirement that paired pixels of paired images are neighbors and have very similar appearance with each other. Secondly, a denoising network is trained on sub-sampled training pairs generated in the first stage, with a proposed regularizer as additional loss for better performance. The proposed Neighbor2Neighbor framework is able to enjoy the progress of state-of-the-art supervised denoising networks in network architecture design. Moreover, it avoids heavy dependence on the assumption of the noise distribution. We explain our approach from a theoretical perspective and further validate it through extensive experiments, including synthetic experiments with different noise distributions in sRGB space and real-world experiments on a denoising benchmark dataset in raw-RGB space.</div></details></td>
+        <td>Gaussion 25, BSD300: PSNR: 30.79, SSIM:0.873</td>
+        <td><a href="https://github.com/txyugood/Neighbor2Neighbor_Paddle">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>6</td>
+        <td><a href="https://arxiv.org/pdf/2111.09881.pdf?ref=https://githubhelp.com">Restormer: Efficient Transformer for High-Resolution Image Restoration</a></td>
+        <td><details><summary>Abstract</summary><div>Since convolutional neural networks (CNNs) perform well at learning generalizable image priors from large-scale data, these models have been extensively applied to image restoration and related tasks. Recently, another class of neural architectures, Transformers, have shown significant performance gains on natural language and high-level vision tasks. While the Transformer model mitigates the shortcomings of CNNs (i.e., limited receptive field and inadaptability to input content), its computational complexity grows quadratically with the spatial resolution, therefore making it infeasible to apply to most image restoration tasks involving high-resolution images. In this work, we propose an efficient Transformer model by making several key designs in the building blocks (multi-head attention and feed-forward network) such that it can capture long-range pixel interactions, while still remaining applicable to large images. Our model, named Restoration Transformer (Restormer), achieves state-of-the-art results on several image restoration tasks, including image deraining, single-image motion deblurring, defocus deblurring (single-image and dual-pixel data), and image denoising (Gaussian grayscale/color denoising, and real image denoising). The source code and pre-trained models are available at https://github.com/swz30/Restormer.</div></details></td>
+        <td>CBSD68, average PSNR, noise 15: 34.39</td>
+        <td><a href="https://github.com/txyugood/Restormer_Paddle">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>7</td>
+        <td><a href="https://arxiv.org/pdf/1608.03981.pdf">Beyond a Gaussian Denoiser: Residual Learning of Deep CNN for Image Denoising</a></td>
+        <td><details><summary>Abstract</summary><div>Discriminative model learning for image denoising has been recently attracting considerable attentions due to its favorable denoising performance. In this paper, we take one step forward by investigating the construction of feed-forward denoising convolutional neural networks (DnCNNs) to embrace the progress in very deep architecture, learning algorithm, and regularization method into image denoising. Specifically, residual learning and batch normalization are utilized to speed up the training process as well as boost the denoising performance. Different from the existing discriminative denoising models which usually train a specific model for additive white Gaussian noise (AWGN) at a certain noise level, our DnCNN model is able to handle Gaussian denoising with unknown noise level (i.e., blind Gaussian denoising). With the residual learning strategy, DnCNN implicitly removes the latent clean image in the hidden layers. This property motivates us to train a single DnCNN model to tackle with several general image denoising tasks such as Gaussian denoising, single image super-resolution and JPEG image deblocking. Our extensive experiments demonstrate that our DnCNN model can not only exhibit high effectiveness in several general image denoising tasks, but also be efficiently implemented by benefiting from GPU computing.</div></details></td>
+        <td>BSD68:  average PSNR, noise 15: 31.73</td>
+        <td><a href="https://github.com/sldyns/DnCNN_paddle">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>8</td>
+        <td><a href="https://arxiv.org/pdf/2003.06792v2.pdf">Learning Enriched Features for Real Image Restoration and Enhancement</a></td>
+        <td><details><summary>Abstract</summary><div>With the goal of recovering high-quality image content from its degraded version, image restoration enjoys numerous applications, such as in surveillance, computational photography, medical imaging, and remote sensing. Recently, convolutional neural networks (CNNs) have achieved dramatic improvements over conventional approaches for image restoration task. Existing CNN-based methods typically operate either on full-resolution or on progressively low-resolution representations. In the former case, spatially precise but contextually less robust results are achieved, while in the latter case, semantically reliable but spatially less accurate outputs are generated. In this paper, we present a novel architecture with the collective goals of maintaining spatially-precise high-resolution representations through the entire network and receiving strong contextual information from the low-resolution representations. The core of our approach is a multi-scale residual block containing several key elements: (a) parallel multi-resolution convolution streams for extracting multi-scale features, (b) information exchange across the multi-resolution streams, (c) spatial and channel attention mechanisms for capturing contextual information, and (d) attention based multi-scale feature aggregation. In a nutshell, our approach learns an enriched set of features that combines contextual information from multiple scales, while simultaneously preserving the high-resolution spatial details. Extensive experiments on five real image benchmark datasets demonstrate that our method, named as MIRNet, achieves state-of-the-art results for a variety of image processing tasks, including image denoising, super-resolution, and image enhancement. The source code and pre-trained models are available at https://github.com/swz30/MIRNet.</div></details></td>
+        <td>SIDD PSNR: 39.72, SSIM:0.959</td>
+        <td><a href="https://github.com/sldyns/MIRNet_paddle">快速开始</a></td>
+    </tr>
+</table>
+### 异常检测
+<table>
+    <tr>
+        <th>序号</th>
+        <th>论文名称(链接)</th>
+        <th>摘要</th>
+        <th>数据集</th>
+        <th width='10%'>快速开始</th>
+    </tr>
+    <tr>
+        <td>1</td>
+        <td><a href="https://arxiv.org/pdf/2111.09099.pdf">Self-Supervised Predictive Convolutional Attentive Block for Anomaly Detection</a></td>
+        <td><details><summary>Abstract</summary><div>Anomaly detection is commonly pursued as a one-class classification problem, where models can only learn from normal training samples, while being evaluated on both normal and abnormal test samples. Among the successful approaches for anomaly detection, a distinguished category of methods relies on predicting masked information (e.g. patches, future frames, etc.) and leveraging the reconstruction error with respect to the masked information as an abnormality score. Different from related methods, we propose to integrate the reconstruction-based functionality into a novel self-supervised predictive architectural building block. The proposed self-supervised block is generic and can easily be incorporated into various state-of-the-art anomaly detection methods. Our block starts with a convolutional layer with dilated filters, where the center area of the receptive field is masked. The resulting activation maps are passed through a channel attention module. Our block is equipped with a loss that minimizes the reconstruction error with respect to the masked area in the receptive field. We demonstrate the generality of our block by integrating it into several state-of-the-art frameworks for anomaly detection on image and video, providing empirical evidence that shows considerable performance improvements on MVTec AD, Avenue, and ShanghaiTech. We release our code as open source at https://github.com/ristea/sspcab.</div></details></td>
+        <td>MVTec AD数据集，结合CutPaste方法，3-way detection AUROC 96.1%</td>
+        <td><a href="https://github.com/Lieberk/Paddle-SSPCAB">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>2</td>
+        <td><a href="https://arxiv.org/pdf/2104.04015v1.pdf">CutPaste: Self-Supervised Learning for Anomaly Detection and Localization</a></td>
+        <td><details><summary>Abstract</summary><div>We aim at constructing a high performance model for defect detection that detects unknown anomalous patterns of an image without anomalous data. To this end, we propose a two-stage framework for building anomaly detectors using normal training data only. We first learn self-supervised deep representations and then build a generative one-class classifier on learned representations. We learn representations by classifying normal data from the CutPaste, a simple data augmentation strategy that cuts an image patch and pastes at a random location of a large image. Our empirical study on MVTec anomaly detection dataset demonstrates the proposed algorithm is general to be able to detect various types of real-world defects. We bring the improvement upon previous arts by 3.1 AUCs when learning representations from scratch. By transfer learning on pretrained representations on ImageNet, we achieve a new state-of-theart 96.6 AUC. Lastly, we extend the framework to learn and extract representations from patches to allow localizing defective areas without annotations during training.</div></details></td>
+        <td>MVTec AD数据集，3-way detection AUROC 95.2%</td>
+        <td><a href="https://github.com/liuxtakeoff/cutpaste_paddle">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>3</td>
+        <td><a href="https://arxiv.org/pdf/2201.10703v2.pdf">Anomaly Detection via Reverse Distillation from One-Class Embedding</a></td>
+        <td><details><summary>Abstract</summary><div>Knowledge distillation (KD) achieves promising results on the challenging problem of unsupervised anomaly detection (AD).The representation discrepancy of anomalies in the teacher-student (T-S) model provides essential evidence for AD. However, using similar or identical architectures to build the teacher and student models in previous studies hinders the diversity of anomalous representations. To tackle this problem, we propose a novel T-S model consisting of a teacher encoder and a student decoder and introduce a simple yet effective "reverse distillation" paradigm accordingly. Instead of receiving raw images directly, the student network takes teacher model's one-class embedding as input and targets to restore the teacher's multiscale representations. Inherently, knowledge distillation in this study starts from abstract, high-level presentations to low-level features. In addition, we introduce a trainable one-class bottleneck embedding (OCBE) module in our T-S model. The obtained compact embedding effectively preserves essential information on normal patterns, but abandons anomaly perturbations. Extensive experimentation on AD and one-class novelty detection benchmarks shows that our method surpasses SOTA performance, demonstrating our proposed approach's effectiveness and generalizability.</div></details></td>
+        <td>MVTec AD数据集， 256尺度，wide-resnet50，detection AUROC 98.5%， loc-AUROC 97.8%， loc-PRO 93.9%</td>
+        <td><a href="https://github.com/renmada/RD4AD-paddle">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>4</td>
+        <td><a href="https://arxiv.org/pdf/2111.07677.pdf">FastFlow: Unsupervised Anomaly Detection and Localization via 2D Normalizing Flows</a></td>
+        <td><details><summary>Abstract</summary><div>Unsupervised anomaly detection and localization is crucial to the practical application when collecting and labeling sufficient anomaly data is infeasible. Most existing representation-based approaches extract normal image features with a deep convolutional neural network and characterize the corresponding distribution through non-parametric distribution estimation methods. The anomaly score is calculated by measuring the distance between the feature of the test image and the estimated distribution. However, current methods can not effectively map image features to a tractable base distribution and ignore the relationship between local and global features which are important to identify anomalies. To this end, we propose FastFlow implemented with 2D normalizing flows and use it as the probability distribution estimator. Our FastFlow can be used as a plug-in module with arbitrary deep feature extractors such as ResNet and vision transformer for unsupervised anomaly detection and localization. In training phase, FastFlow learns to transform the input visual feature into a tractable distribution and obtains the likelihood to recognize anomalies in inference phase. Extensive experimental results on the MVTec AD dataset show that FastFlow surpasses previous state-of-the-art methods in terms of accuracy and inference efficiency with various backbone networks. Our approach achieves 99.4% AUC in anomaly detection with high inference efficiency.</div></details></td>
+        <td>MVTec AD数据集，ResNet18 , image-level AUC  97.9%，pixel-leval AUC 97.2%</td>
+        <td><a href="https://github.com/lelexx/fastflow_paddle">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>5</td>
+        <td><a href="https://arxiv.org/pdf/2011.08785v1.pdf">PaDiM: A Patch Distribution Modeling Framework for Anomaly Detection and Localization</a></td>
+        <td><details><summary>Abstract</summary><div>We present a new framework for Patch Distribution Modeling, PaDiM, to concurrently detect and localize anomalies in images in a one-class learning setting. PaDiM makes use of a pretrained convolutional neural network (CNN) for patch embedding, and of multivariate Gaussian distributions to get a probabilistic representation of the normal class. It also exploits correlations between the different semantic levels of CNN to better localize anomalies. PaDiM outperforms current state-of-the-art approaches for both anomaly detection and localization on the MVTec AD and STC datasets. To match real-world visual industrial inspection, we extend the evaluation protocol to assess performance of anomaly localization algorithms on non-aligned dataset. The state-of-the-art performance and low complexity of PaDiM make it a good candidate for many industrial applications.</div></details></td>
+        <td>resnet18 mvtec image-level auc 0.891, pixel-level auc: 0.968</td>
+        <td><a href="https://github.com/CuberrChen/PaDiM-Paddle">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>6</td>
+        <td><a href="https://arxiv.org/pdf/2103.04257.pdf">Student-Teacher Feature Pyramid Matching for Anomaly Detection</a></td>
+        <td><details><summary>Abstract</summary><div>Anomaly detection is a challenging task and usually formulated as an one-class learning problem for the unexpectedness of anomalies. This paper proposes a simple yet powerful approach to this issue, which is implemented in the student-teacher framework for its advantages but substantially extends it in terms of both accuracy and efficiency. Given a strong model pre-trained on image classification as the teacher, we distill the knowledge into a single student network with the identical architecture to learn the distribution of anomaly-free images and this one-step transfer preserves the crucial clues as much as possible. Moreover, we integrate the multi-scale feature matching strategy into the framework, and this hierarchical feature matching enables the student network to receive a mixture of multi-level knowledge from the feature pyramid under better supervision, thus allowing to detect anomalies of various sizes. The difference between feature pyramids generated by the two networks serves as a scoring function indicating the probability of anomaly occurring. Due to such operations, our approach achieves accurate and fast pixel-level anomaly detection. Very competitive results are delivered on the MVTec anomaly detection dataset, superior to the state of the art ones.</div></details></td>
+        <td>resnet18 mvtec image-level auc 0.893, pixel-level auc: 0.951</td>
+        <td><a href="https://github.com/CuberrChen/STFPM-Paddle">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>7</td>
+        <td><a href="https://arxiv.org/pdf/2011.11108.pdf">Multiresolution Knowledge Distillation for Anomaly Detection</a></td>
+        <td><details><summary>Abstract</summary><div>Unsupervised representation learning has proved to be a critical component of anomaly detection/localization in images. The challenges to learn such a representation are two-fold. Firstly, the sample size is not often large enough to learn a rich generalizable representation through conventional techniques. Secondly, while only normal samples are available at training, the learned features should be discriminative of normal and anomalous samples. Here, we propose to use the "distillation" of features at various layers of an expert network, pre-trained on ImageNet, into a simpler cloner network to tackle both issues. We detect and localize anomalies using the discrepancy between the expert and cloner networks' intermediate activation values given the input data. We show that considering multiple intermediate hints in distillation leads to better exploiting the expert's knowledge and more distinctive discrepancy compared to solely utilizing the last layer activation values. Notably, previous methods either fail in precise anomaly localization or need expensive region-based training. In contrast, with no need for any special or intensive training procedure, we incorporate interpretability algorithms in our novel framework for the localization of anomalous regions. Despite the striking contrast between some test datasets and ImageNet, we achieve competitive or significantly superior results compared to the SOTA methods on MNIST, F-MNIST, CIFAR-10, MVTecAD, Retinal-OCT, and two Medical datasets on both anomaly detection and localization.</div></details></td>
+        <td>mvtec detection AUROC 87.74%, loc AUROC 90.71%</td>
+        <td><a href="https://github.com/txyugood/Knowledge_Distillation_AD_Paddle">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>8</td>
+        <td><a href="https://arxiv.org/pdf/2106.08265.pdf">Towards Total Recall in Industrial Anomaly Detection</a></td>
+        <td><details><summary>Abstract</summary><div>Being able to spot defective parts is a critical component in large-scale industrial manufacturing. A particular challenge that we address in this work is the cold-start problem: fit a model using nominal (non-defective) example images only. While handcrafted solutions per class are possible, the goal is to build systems that work well simultaneously on many different tasks automatically. The best performing approaches combine embeddings from ImageNet models with an outlier detection model. In this paper, we extend on this line of work and propose \textbf{PatchCore}, which uses a maximally representative memory bank of nominal patch-features. PatchCore offers competitive inference times while achieving state-of-the-art performance for both detection and localization. On the challenging, widely used MVTec AD benchmark PatchCore achieves an image-level anomaly detection AUROC score of up to 99.6%, more than halving the error compared to the next best competitor. We further report competitive results on two additional datasets and also find competitive results in the few samples regime.\freefootnote{∗ Work done during a research internship at Amazon AWS.} Code: github.com/amazon-research/patchcore-inspection.</div></details></td>
+        <td>resnet18 mvtec image-level auc 0.973, pixel-level auc: 0.976</td>
+        <td><a href="https://github.com/ultranity/Anomaly.Paddle">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>9</td>
+        <td><a href="https://arxiv.org/pdf/2105.14737v1.pdf">Semi-orthogonal Embedding for Efficient Unsupervised Anomaly Segmentation</a></td>
+        <td><details><summary>Abstract</summary><div>We present the efficiency of semi-orthogonal embedding for unsupervised anomaly segmentation. The multi-scale features from pre-trained CNNs are recently used for the localized Mahalanobis distances with significant performance. However, the increased feature size is problematic to scale up to the bigger CNNs, since it requires the batch-inverse of multi-dimensional covariance tensor. Here, we generalize an ad-hoc method, random feature selection, into semi-orthogonal embedding for robust approximation, cubically reducing the computational cost for the inverse of multi-dimensional covariance tensor. With the scrutiny of ablation studies, the proposed method achieves a new state-of-the-art with significant margins for the MVTec AD, KolektorSDD, KolektorSDD2, and mSTC datasets. The theoretical and empirical analyses offer insights and verification of our straightforward yet cost-effective approach.</div></details></td>
+        <td>mvtec pro 0.942, roc 0.982</td>
+        <td><a href="https://github.com/ultranity/Anomaly.Paddle">快速开始</a></td>
+    </tr>
 </table>
 ### 人脸识别
@@ -739,6 +1178,20 @@
        <td>UCF101: 4x16, Top1=96.6%</td>
        <td><a href="https://github.com/txyugood/PaddleMVF">快速开始</a></td>
    </tr>
+    <tr>
+        <td>6</td>
+        <td><a href="https://paperswithcode.com/paper/channel-wise-topology-refinement-graph">Channel-wise Topology Refinement Graph Convolution for Skeleton-Based Action Recognition</a></td>
+        <td><details><summary>Abstract</summary><div>Graph convolutional networks (GCNs) have been widely used and achieved remarkable results in skeleton-based action recognition. In GCNs, graph topology dominates feature aggregation and therefore is the key to extracting representative features. In this work, we propose a novel Channel-wise Topology Refinement Graph Convolution (CTR-GC) to dynamically learn different topologies and effectively aggregate joint features in different channels for skeleton-based action recognition. The proposed CTR-GC models channel-wise topologies through learning a shared topology as a generic prior for all channels and refining it with channel-specific correlations for each channel. Our refinement method introduces few extra parameters and significantly reduces the difficulty of modeling channel-wise topologies. Furthermore, via reformulating graph convolutions into a unified form, we find that CTR-GC relaxes strict constraints of graph convolutions, leading to stronger representation capability. Combining CTR-GC with temporal modeling modules, we develop a powerful graph convolutional network named CTR-GCN which notably outperforms state-of-the-art methods on the NTU RGB+D, NTU RGB+D 120, and NW-UCLA datasets.</div></details></td>
+        <td>nan</td>
+        <td><a href="1https://github.com/zpc-666/CTRGCN_Light">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>7</td>
+        <td><a href="https://arxiv.org/pdf/2104.13586v1.pdf">Revisiting Skeleton-based Action Recognition</a></td>
+        <td><details><summary>Abstract</summary><div>Human skeleton, as a compact representation of human action, has received increasing attention in recent years. Many skeleton-based action recognition methods adopt graph convolutional networks (GCN) to extract features on top of human skeletons. Despite the positive results shown in previous works, GCN-based methods are subject to limitations in robustness, interoperability, and scalability. In this work, we propose PoseC3D, a new approach to skeleton-based action recognition, which relies on a 3D heatmap stack instead of a graph sequence as the base representation of human skeletons. Compared to GCN-based methods, PoseC3D is more effective in learning spatiotemporal features, more robust against pose estimation noises, and generalizes better in cross-dataset settings. Also, PoseC3D can handle multiple-person scenarios without additional computation cost, and its features can be easily integrated with other modalities at early fusion stages, which provides a great design space to further boost the performance. On four challenging datasets, PoseC3D consistently obtains superior performance, when used alone on skeletons and in combination with the RGB modality.</div></details></td>
+        <td>UCF101 split1, top1=87.0</td>
+        <td><a href="https://github.com/txyugood/PaddlePoseC3D">快速开始</a></td>
+    </tr>
 </table>
 ### 自然语言处理
@@ -939,13 +1392,6 @@
        <td>在GEM-Xsum验证集上, small model BLEU score=9.1; 在TweetQA验证集上, small model BLEU-1/ROUGE-L=65.7/69.7 (见论文table3)</td>
        <td><a href="https://github.com/yoreG123/Paddle-ByT5">快速开始</a></td>
    </tr>
-    <tr>
-        <td>28</td>
-        <td><a href="https://paperswithcode.com/paper/few-shot-question-answering-by-pretraining">Few-Shot Question Answering by Pretraining Span Selection</a></td>
-        <td><details><summary>Abstract</summary><div>In several question answering benchmarks, pretrained models have reached human parity through fine-tuning on an order of 100,000 annotated questions and answers. We explore the more realistic few-shot setting, where only a few hundred training examples are available, and observe that standard models perform poorly, highlighting the discrepancy between current pretraining objectives and question answering. We propose a new pretraining scheme tailored for question answering: recurring span selection. Given a passage with multiple sets of recurring spans, we mask in each set all recurring spans but one, and ask the model to select the correct span in the passage for each masked span. Masked spans are replaced with a special token, viewed as a question representation, that is later used during fine-tuning to select the answer span. The resulting model obtains surprisingly good results on multiple benchmarks (e.g., 72.7 F1 on SQuAD with only 128 training examples), while maintaining competitive performance in the high-resource setting.</div></details></td>
-        <td>SQuAD 1.1验证集, 16 examples F1=54.6, 128 examples F1=72.7, 1024 Examples F1=82.8(见论文Table1)</td>
-        <td><a href="https://github.com/zhoucz97/Splinter-paddle">快速开始</a></td>
-    </tr>
 </table>
 ### 多模态
@@ -992,6 +1438,87 @@
        <td>VQA val, Q->A 63.8%, QA->R: 67.2%, Q-AR: 43.1%</td>
        <td><a href="https://github.com/KiritoSSR/paddle_r2c">快速开始</a></td>
    </tr>
+    <tr>
+        <td>6</td>
+        <td><a href="https://openreview.net/forum?id=S1eL4kBYwr">Uniter: Learning universal image-text representations</a></td>
+        <td><details><summary>Abstract</summary><div>Joint image-text embedding is the bedrock for most Vision-and-Language (V+L) tasks, where multimodality inputs are simultaneously processed for joint visual and textual understanding. In this paper, we introduce UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets (COCO, Visual Genome, Conceptual Captions, and SBU Captions), which can power heterogeneous downstream V+L tasks with joint multimodal embeddings. We design four pre-training tasks: Masked Language Modeling (MLM), Masked Region Modeling (MRM, with three variants), Image-Text Matching (ITM), and Word-Region Alignment (WRA). Different from previous work that applies joint random masking to both modalities, we use conditional masking on pre-training tasks (i.e., masked language/region modeling is conditioned on full observation of image/text). In addition to ITM for global image-text alignment, we also propose WRA via the use of Optimal Transport (OT) to explicitly encourage fine-grained alignment between words and image regions during pre-training. Comprehensive analysis shows that both conditional masking and OT-based WRA contribute to better pre-training. We also conduct a thorough ablation study to find an optimal combination of pre-training tasks. Extensive experiments show that UNITER achieves new state of the art across six V+L tasks (over nine datasets), including Visual Question Answering, Image-Text Retrieval, Referring Expression Comprehension, Visual Commonsense Reasoning, Visual Entailment, and NLVR2. Code is available at https://github.com/ChenRocks/UNITER.</div></details></td>
+        <td>IR-flickr30K-R1=73.66</td>
+        <td><a href="https://github.com/Mixture-of-Rookie/UNITER-Paddle">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>7</td>
+        <td><a href="https://arxiv.org/pdf/1806.00064.pdf">Efficient Low-rank Multimodal Fusion with Modality-Specific Factors</a></td>
+        <td><details><summary>Abstract</summary><div>Multimodal research is an emerging field of artificial intelligence, and one of the main research problems in this field is multimodal fusion. The fusion of multimodal data is the process of integrating multiple unimodal representations into one compact multimodal representation. Previous research in this field has exploited the expressiveness of tensors for multimodal representation. However, these methods often suffer from exponential increase in dimensions and in computational complexity introduced by transformation of input into tensor. In this paper, we propose the Low-rank Multimodal Fusion method, which performs multimodal fusion using low-rank tensors to improve efficiency. We evaluate our model on three different tasks: multimodal sentiment analysis, speaker trait analysis, and emotion recognition. Our model achieves competitive results on all these tasks while drastically reducing computational complexity. Additional experiments also show that our model can perform robustly for a wide range of low-rank settings, and is indeed much more efficient in both training and inference compared to other methods that utilize tensor representations.</div></details></td>
+        <td>F1-Happy, 85.8%, F1-Sad 85.9%, F1-Angry 89.0%, F1-Neutral 71.7%</td>
+        <td><a href="https://github.com/18XiWenjuan/LMF_Paddle">快速开始</a></td>
+    </tr>
+</table>
+### 科学计算
+<table>
+    <tr>
+        <th>序号</th>
+        <th>论文名称(链接)</th>
+        <th>摘要</th>
+        <th>数据集</th>
+        <th width='10%'>快速开始</th>
+    </tr>
+    <tr>
+        <td>1</td>
+        <td><a href="https://arxiv.org/pdf/2007.15849.pdf">Solving inverse problems using conditional invertible neural networks</a></td>
+        <td><details><summary>Abstract</summary><div>Inverse modeling for computing a high-dimensional spatially-varying property field from indirect sparse and noisy observations is a challenging problem. This is due to the complex physical system of interest often expressed in the form of multiscale PDEs, the high-dimensionality of the spatial property of interest, and the incomplete and noisy nature of observations. To address these challenges, we develop a model that maps the given observations to the unknown input field in the form of a surrogate model. This inverse surrogate model will then allow us to estimate the unknown input field for any given sparse and noisy output observations. Here, the inverse mapping is limited to a broad prior distribution of the input field with which the surrogate model is trained. In this work, we construct a two- and three-dimensional inverse surrogate models consisting of an invertible and a conditional neural network trained in an end-to-end fashion with limited training data. The invertible network is developed using a flow-based generative model. The developed inverse surrogate model is then applied for an inversion task of a multiphase flow problem where given the pressure and saturation observations the aim is to recover a high-dimensional non-Gaussian permeability field where the two facies consist of heterogeneous permeability and varying length-scales. For both the two- and three-dimensional surrogate models, the predicted sample realizations of the non-Gaussian permeability field are diverse with the predictive mean being close to the ground truth even when the model is trained with limited data.</div></details></td>
+        <td>2D/3D模型下得到与Fig16和Fig17相吻合的结果</td>
+        <td><a href="https://github.com/DrownFish19/inn-surrogate-paddle">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>2</td>
+        <td><a href="https://arxiv.org/pdf/2111.02801v1.pdf">Gradient-enhanced physics-informed neural networks for forward and inverse PDE problems</a></td>
+        <td><details><summary>Abstract</summary><div>Deep learning has been shown to be an effective tool in solving partial differential equations (PDEs) through physics-informed neural networks (PINNs). PINNs embed the PDE residual into the loss function of the neural network, and have been successfully employed to solve diverse forward and inverse PDE problems. However, one disadvantage of the first generation of PINNs is that they usually have limited accuracy even with many training points. Here, we propose a new method, gradient-enhanced physics-informed neural networks (gPINNs), for improving the accuracy and training efficiency of PINNs. gPINNs leverage gradient information of the PDE residual and embed the gradient into the loss function. We tested gPINNs extensively and demonstrated the effectiveness of gPINNs in both forward and inverse PDE problems. Our numerical results show that gPINN performs better than PINN with fewer training points. Furthermore, we combined gPINN with the method of residual-based adaptive refinement (RAR), a method for improving the distribution of training points adaptively during training, to further improve the performance of gPINN, especially in PDEs with solutions that have steep gradients.</div></details></td>
+        <td>完成论文3.2 得到和Fig2 Fig3 Fig3相吻合的结果，论文3.3 得到fig 6 fig7相吻合的结果，论文3.4.1 gPINN fig10 11</td>
+        <td><a href="https://github.com/tianshao1992/gPINN_paddle">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>3</td>
+        <td><a href="https://arxiv.org/abs/2004.08826">DeepCFD: Efficient Steady-State Laminar Flow Approximation with Deep Convolutional Neural Networks</a></td>
+        <td><details><summary>Abstract</summary><div>Computational Fluid Dynamics (CFD) simulation by the numerical solution of the Navier-Stokes equations is an essential tool in a wide range of applications from engineering design to climate modeling. However, the computational cost and memory demand required by CFD codes may become very high for flows of practical interest, such as in aerodynamic shape optimization. This expense is associated with the complexity of the fluid flow governing equations, which include non-linear partial derivative terms that are of difficult solution, leading to long computational times and limiting the number of hypotheses that can be tested during the process of iterative design. Therefore, we propose DeepCFD: a convolutional neural network (CNN) based model that efficiently approximates solutions for the problem of non-uniform steady laminar flows. The proposed model is able to learn complete solutions of the Navier-Stokes equations, for both velocity and pressure fields, directly from ground-truth data generated using a state-of-the-art CFD code. Using DeepCFD, we found a speedup of up to 3 orders of magnitude compared to the standard CFD approach at a cost of low error rates.</div></details></td>
+        <td>DeepCFD MSE (Ux= 0.773±0.0897，Uy=0.2153±0.0186，P=1.042±0.0431，Total 2.03±0.136）</td>
+        <td><a href="https://github.com/zbyandmoon/DeepCFD_with_PaddlePaddle">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>4</td>
+        <td><a href="https://arxiv.org/pdf/2007.15324.pdf">Unsupervised deep learning for super-resolution reconstruction of turbulence</a></td>
+        <td><details><summary>Abstract</summary><div>Recent attempts to use deep learning for super-resolution reconstruction of turbulent flows have used supervised learning, which requires paired data for training. This limitation hinders more practical applications of super-resolution reconstruction. Therefore, we present an unsupervised learning model that adopts a cycle-consistent generative adversarial network (CycleGAN) that can be trained with unpaired turbulence data for super-resolution reconstruction. Our model is validated using three examples: (i) recovering the original flow field from filtered data using direct numerical simulation (DNS) of homogeneous isotropic turbulence; (ii) reconstructing full-resolution fields using partially measured data from the DNS of turbulent channel flows; and (iii) generating a DNS-resolution flow field from large-eddy simulation (LES) data for turbulent channel flows. In examples (i) and (ii), for which paired data are available for supervised learning, our unsupervised model demonstrates qualitatively and quantitatively similar performance as that of the best supervised learning model. More importantly, in example (iii), where supervised learning is impossible, our model successfully reconstructs the high-resolution flow field of statistical DNS quality from the LES data. Furthermore, we find that the present model has almost universal applicability to all values of Reynolds numbers within the tested range. This demonstrates that unsupervised learning of turbulence data is indeed possible, opening a new door for the wide application of super-resolution reconstruction of turbulent fields.</div></details></td>
+        <td>可复现三个example中的任意一个，若对于example3可得到论文中cycleGAN对应的结果（图11,12,13,14,15）</td>
+        <td><a href="https://github.com/jiamingkong/INFINITY/tree/main/examples/SR_turb_paddle">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>5</td>
+        <td><a href="https://arxiv.org/pdf/2106.12929.pdf">Lettuce: PyTorch-based Lattice Boltzmann Framework</a></td>
+        <td><details><summary>Abstract</summary><div>The lattice Boltzmann method (LBM) is an efficient simulation technique for computational fluid mechanics and beyond. It is based on a simple stream-and-collide algorithm on Cartesian grids, which is easily compatible with modern machine learning architectures. While it is becoming increasingly clear that deep learning can provide a decisive stimulus for classical simulation techniques, recent studies have not addressed possible connections between machine learning and LBM. Here, we introduce Lettuce, a PyTorch-based LBM code with a threefold aim. Lettuce enables GPU accelerated calculations with minimal source code, facilitates rapid prototyping of LBM models, and enables integrating LBM simulations with PyTorch's deep learning and automatic differentiation facility. As a proof of concept for combining machine learning with the LBM, a neural collision model is developed, trained on a doubly periodic shear layer and then transferred to a different flow, a decaying turbulence. We also exemplify the added benefit of PyTorch's automatic differentiation framework in flow control and optimization. To this end, the spectrum of a forced isotropic turbulence is maintained without further constraining the velocity field. The source code is freely available from https://github.com/lettucecfd/lettuce.</div></details></td>
+        <td>得到图1配置的展示结果与图2的曲线吻合 </td>
+        <td><a href="https://github.com/X4Science/INFINITY/tree/main/examples/lettuce_paddle">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>6</td>
+        <td><a href="https://arxiv.org/pdf/2202.01723.pdf">Systems Biology: Identifiability analysis and parameter identification via systems-biology informed neural networks</a></td>
+        <td><details><summary>Abstract</summary><div>The dynamics of systems biological processes are usually modeled by a system of ordinary differential equations (ODEs) with many unknown parameters that need to be inferred from noisy and sparse measurements. Here, we introduce systems-biology informed neural networks for parameter estimation by incorporating the system of ODEs into the neural networks. To complete the workflow of system identification, we also describe structural and practical identifiability analysis to analyze the identifiability of parameters. We use the ultridian endocrine model for glucose-insulin interaction as the example to demonstrate all these methods and their implementation.</div></details></td>
+        <td>结果吻合Fig13</td>
+        <td><a href="https://github.com/X4Science/INFINITY/tree/main/examples/sbinn_paddle">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>7</td>
+        <td><a href="https://arxiv.org/pdf/2012.12106.pdf">TorchMD: A deep learning framework for molecular simulations</a></td>
+        <td><details><summary>Abstract</summary><div>Molecular dynamics simulations provide a mechanistic description of molecules by relying on empirical potentials. The quality and transferability of such potentials can be improved leveraging data-driven models derived with machine learning approaches. Here, we present TorchMD, a framework for molecular simulations with mixed classical and machine learning potentials. All of force computations including bond, angle, dihedral, Lennard-Jones and Coulomb interactions are expressed as PyTorch arrays and operations. Moreover, TorchMD enables learning and simulating neural network potentials. We validate it using standard Amber all-atom simulations, learning an ab-initio potential, performing an end-to-end training and finally learning and simulating a coarse-grained model for protein folding. We believe that TorchMD provides a useful tool-set to support molecular simulations of machine learning potentials. Code and data are freely available at \url{github.com/torchmd}.</div></details></td>
+        <td>Di-alanine 688 8 min 44 s， Trypsin 3,248 13 min 2 s。</td>
+        <td><a href="https://github.com/skywalk163/INFINITY/tree/main/examples/PaddleMD">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>8</td>
+        <td><a href="https://arxiv.org/pdf/2102.04626.pdf">PHYSICS-INFORMED NEURAL NETWORKS WITH HARD CONSTRAINTS FOR INVERSE DESIGN∗</a></td>
+        <td><details><summary>Abstract</summary><div>We achieve the same objective as conventional PDE-constrained optimization methods based on adjoint methods and numerical PDE solvers, but find that the design obtained from hPINN is often simpler and smoother for problems whose solution is not unique.</div></details></td>
+        <td>得到图6的结果，（the PDE loss is below 10−4 (Fig. 6A), and the L2 relative error of |E|2 between hPINN and FDFD for the final ε is 1.2%）可以展示图7</td>
+        <td><a href="https://github.com/X4Science/INFINITY/tree/main/examples/hPINN4paddle">快速开始</a></td>
+    </tr>
 </table>
 ### 推荐系统
@@ -1069,7 +1596,7 @@
    <tr>
        <td>10</td>
        <td><a href="http://www.thuir.cn/group/~mzhang/publications/TheWebConf2020-Chenchong.pdf">Efficient Non-Sampling Factorization Machines for Optimal Context-Aware Recommendation </a></td>
-        <td><details><summary>Abstract</summary><div>To provide more accurate recommendation, it is a trending topic to go beyond modeling user-item interactions and take context features into account. Factorization Machines (FM) with negative sampling is a popular solution for context-aware recommendation. However, it is not robust as sampling may lost important information and usually leads to non-optimal performances in practical. Several recent e_x001D_orts have enhanced FM with deep learning architectures for modelling high-order feature interactions. While they either focus on rating prediction task only, or typically adopt the negative sampling strategy for optimizing the ranking performance. Due to the dramatic _x001E_uctuation of sampling, it is reasonable to argue that these sampling-based FM methods are still suboptimal for context-aware recommendation. In this paper, we propose to learn FM without sampling for ranking tasks that helps context-aware recommendation particularly. Despite e_x001D_ectiveness, such a non-sampling strategy presents strong challenge in learning e_x001C_ciency of the model. Accordingly, we further design a new ideal framework named E_x001C_cient Non-Sampling Factorization Machines (ENSFM). ENSFM not only seamlessly connects the relationship between FM and Matrix Factorization (MF), but also resolves the challenging e_x001C_ciency issue via novel memorization strategies. Through extensive experiments on three realworld public datasets, we show that 1) the proposed ENSFM consistently and signi_x001B_cantly outperforms the state-of-the-art methods on context-aware Top-K recommendation, and 2) ENSFM achieves signi_x001B_cant advantages in training e_x001C_ciency, which makes it more applicable to real-world large-scale systems. Moreover, the empirical results indicate that a proper learning method is even more important than advanced neural network structures for Top-K recommendation task. Our implementation has been released 1 to facilitate further developments on e_x001C_cient non-sampling methods.</div></details></td>
+        <td><details><summary>Abstract</summary><div>To provide more accurate recommendation, it is a trending topic to go beyond modeling user-item interactions and take context features into account. Factorization Machines (FM) with negative sampling is a popular solution for context-aware recommendation. However, it is not robust as sampling may lost important information and usually leads to non-optimal performances in practical. Several recent e_x001d_orts have enhanced FM with deep learning architectures for modelling high-order feature interactions. While they either focus on rating prediction task only, or typically adopt the negative sampling strategy for optimizing the ranking performance. Due to the dramatic _x001e_uctuation of sampling, it is reasonable to argue that these sampling-based FM methods are still suboptimal for context-aware recommendation. In this paper, we propose to learn FM without sampling for ranking tasks that helps context-aware recommendation particularly. Despite e_x001d_ectiveness, such a non-sampling strategy presents strong challenge in learning e_x001c_ciency of the model. Accordingly, we further design a new ideal framework named E_x001c_cient Non-Sampling Factorization Machines (ENSFM). ENSFM not only seamlessly connects the relationship between FM and Matrix Factorization (MF), but also resolves the challenging e_x001c_ciency issue via novel memorization strategies. Through extensive experiments on three realworld public datasets, we show that 1) the proposed ENSFM consistently and signi_x001b_cantly outperforms the state-of-the-art methods on context-aware Top-K recommendation, and 2) ENSFM achieves signi_x001b_cant advantages in training e_x001c_ciency, which makes it more applicable to real-world large-scale systems. Moreover, the empirical results indicate that a proper learning method is even more important than advanced neural network structures for Top-K recommendation task. Our implementation has been released 1 to facilitate further developments on e_x001c_cient non-sampling methods.</div></details></td>
        <td>Movielens: HR@5: 0.0601, HR@10: 0.1024, HR@20: 0.1690 (论文table3)</td>
        <td><a href="https://github.com/renmada/ENSFM-paddle">快速开始</a></td>
    </tr>
@@ -1115,6 +1642,69 @@
        <td>AUC: 0.7519, Logloss: 0.3944; </td>
        <td><a href="https://github.com/LinJayan/FLEN-Paddle">快速开始</a></td>
    </tr>
+    <tr>
+        <td>17</td>
+        <td><a href="https://arxiv.org/pdf/2106.05482v2.pdf">Deep Position-wise Interaction Network For CTR Prediction</a></td>
+        <td><details><summary>Abstract</summary><div>Click-through rate (CTR) prediction plays an important role in online advertising and recommender systems. In practice, the training of CTR models depends on click data which is intrinsically biased towards higher positions since higher position has higher CTR by nature. Existing methods such as actual position training with fixed position inference and inverse propensity weighted training with no position inference alleviate the bias problem to some extend. However, the different treatment of position information between training and inference will inevitably lead to inconsistency and sub-optimal online performance. Meanwhile, the basic assumption of these methods, i.e., the click probability is the product of examination probability and relevance probability, is oversimplified and insufficient to model the rich interaction between position and other information. In this paper, we propose a Deep Position-wise Interaction Network (DPIN) to efficiently combine all candidate items and positions for estimating CTR at each position, achieving consistency between offline and online as well as modeling the deep non-linear interaction among position, user, context and item under the limit of serving performance. Following our new treatment to the position bias in CTR prediction, we propose a new evaluation metrics named PAUC (position-wise AUC) that is suitable for measuring the ranking quality at a given position. Through extensive experiments on a real world dataset, we show empirically that our method is both effective and efficient in solving position bias problem. We have also deployed our method in production and observed statistically significant improvement over a highly optimized baseline in a rigorous A/B test.</div></details></td>
+        <td>按照论文数据，预计以DIN模型作为对比，AUC获得性能提升</td>
+        <td><a href="https://github.com/BamLubi/Paddle-DPIN">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>18</td>
+        <td><a href="https://arxiv.org/pdf/1810.11921.pdf">AutoInt: Automatic Feature Interaction Learning via Self-Attentive Neural Networks</a></td>
+        <td><details><summary>Abstract</summary><div>Click-through rate (CTR) prediction, which aims to predict the probability of a user clicking on an ad or an item, is critical to many online applications such as online advertising and recommender systems. The problem is very challenging since (1) the input features (e.g., the user id, user age, item id, item category) are usually sparse and high-dimensional, and (2) an effective prediction relies on high-order combinatorial features (\textit{a.k.a.} cross features), which are very time-consuming to hand-craft by domain experts and are impossible to be enumerated. Therefore, there have been efforts in finding low-dimensional representations of the sparse and high-dimensional raw features and their meaningful combinations. In this paper, we propose an effective and efficient method called the \emph{AutoInt} to automatically learn the high-order feature interactions of input features. Our proposed algorithm is very general, which can be applied to both numerical and categorical input features. Specifically, we map both the numerical and categorical features into the same low-dimensional space. Afterwards, a multi-head self-attentive neural network with residual connections is proposed to explicitly model the feature interactions in the low-dimensional space. With different layers of the multi-head self-attentive neural networks, different orders of feature combinations of input features can be modeled. The whole model can be efficiently fit on large-scale raw data in an end-to-end fashion. Experimental results on four real-world datasets show that our proposed approach not only outperforms existing state-of-the-art approaches for prediction but also offers good explainability. Code is available at: \url{https://github.com/DeepGraphLearning/RecommenderSystems}.</div></details></td>
+        <td>验收标准：1.按照论文数据Criteo，AUC 80.61% 2.复现后合入PaddleRec套件，并添加TIPC</td>
+        <td><a href="https://github.com/kafaichan/paddle-autoint">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>19</td>
+        <td><a href="https://arxiv.org/pdf/2104.10083.pdf">Personalized News Recommendation with Knowledge-aware Interactive Matching</a></td>
+        <td><details><summary>Abstract</summary><div>The most important task in personalized news recommendation is accurate matching between candidate news and user interest. Most of existing news recommendation methods model candidate news from its textual content and user interest from their clicked news in an independent way. However, a news article may cover multiple aspects and entities, and a user usually has different kinds of interest. Independent modeling of candidate news and user interest may lead to inferior matching between news and users. In this paper, we propose a knowledge-aware interactive matching method for news recommendation. Our method interactively models candidate news and user interest to facilitate their accurate matching. We design a knowledge-aware news co-encoder to interactively learn representations for both clicked news and candidate news by capturing their relatedness in both semantic and entities with the help of knowledge graphs. We also design a user-news co-encoder to learn candidate news-aware user interest representation and user-aware candidate news representation for better interest matching. Experiments on two real-world datasets validate that our method can effectively improve the performance of news recommendation.</div></details></td>
+        <td>AUC 67.13 ,MRR 32.08,NDCG@5 35.49, NDCG@10 41.79</td>
+        <td><a href="https://github.com/renmada/PaddleRec/tree/kim">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>20</td>
+        <td><a href="https://dl.acm.org/doi/10.1145/3404835.3462841">Package Recommendation with Intra- and Inter-Package Attention Networks</a></td>
+        <td><details><summary>Abstract</summary><div>With the booming of online social networks in the mobile internet, an emerging recommendation scenario has played a vital role in information acquisition for user, where users are no longer recommended with a single item or item list, but a combination of heterogeneous and diverse objects (called a package, e.g., a package including news, publisher, and friends viewing the news). Different from the conventional recommendation where users are recommended with the item itself, in package recommendation, users would show great interests on the explicitly displayed objects that could have a significant influence on the user behaviors. However, to the best of our knowledge, few effort has been made for package recommendation and existing approaches can hardly model the complex interactions of diverse objects in a package. Thus, in this paper, we make a first study on package recommendation and propose an Intra- and inter-package attention network for Package Recommendation (IPRec). Specifically, for package modeling, an intra-package attention network is put forward to capture the object-level intention of user interacting with the package, while an inter-package attention network acts as a package-level information encoder that captures collaborative features of neighboring packages. In addition, to capture users preference representation, we present a user preference learner equipped with a fine-grained feature aggregation network and coarse-grained package aggregation network. Extensive experiments on three real-world datasets demonstrate that IPRec significantly outperforms the state of the arts. Moreover, the model analysis demonstrates the interpretability of our IPRec and the characteristics of user behaviors. Codes and datasets can be obtained at https://github.com/LeeChenChen/IPRec.</div></details></td>
+        <td>论文搜集的真实数据集，3-dayAUC：0.66915-dayAUC：0.6754， 10-dayAUC：0.6853</td>
+        <td><a href="https://github.com/renmada/PaddleRec/tree/iprec/models/rank/iprec">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>21</td>
+        <td><a href="https://arxiv.org/pdf/2105.08489v2.pdf">Modeling the Sequential Dependence among Audience Multi-step Conversions with Multi-task Learning in Targeted Display Advertising</a></td>
+        <td><details><summary>Abstract</summary><div>In most real-world large-scale online applications (e.g., e-commerce or finance), customer acquisition is usually a multi-step conversion process of audiences. For example, an impression->click->purchase process is usually performed of audiences for e-commerce platforms. However, it is more difficult to acquire customers in financial advertising (e.g., credit card advertising) than in traditional advertising. On the one hand, the audience multi-step conversion path is longer. On the other hand, the positive feedback is sparser (class imbalance) step by step, and it is difficult to obtain the final positive feedback due to the delayed feedback of activation. Multi-task learning is a typical solution in this direction. While considerable multi-task efforts have been made in this direction, a long-standing challenge is how to explicitly model the long-path sequential dependence among audience multi-step conversions for improving the end-to-end conversion. In this paper, we propose an Adaptive Information Transfer Multi-task (AITM) framework, which models the sequential dependence among audience multi-step conversions via the Adaptive Information Transfer (AIT) module. The AIT module can adaptively learn what and how much information to transfer for different conversion stages. Besides, by combining the Behavioral Expectation Calibrator in the loss function, the AITM framework can yield more accurate end-to-end conversion identification. The proposed framework is deployed in Meituan app, which utilizes it to real-timely show a banner to the audience with a high end-to-end conversion rate for Meituan Co-Branded Credit Cards. Offline experimental results on both industrial and public real-world datasets clearly demonstrate that the proposed framework achieves significantly better performance compared with state-of-the-art baselines.</div></details></td>
+        <td>AUC：0.6043 purchase AUC：0.6525</td>
+        <td><a href="https://github.com/renmada/PaddleRec/tree/aitm/models/rank/aitm">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>22</td>
+        <td><a href="https://arxiv.org/pdf/2008.00404v6.pdf">Detecting Beneficial Feature Interactions for Recommender Systems</a></td>
+        <td><details><summary>Abstract</summary><div>Feature interactions are essential for achieving high accuracy in recommender systems. Many studies take into account the interaction between every pair of features. However, this is suboptimal because some feature interactions may not be that relevant to the recommendation result, and taking them into account may introduce noise and decrease recommendation accuracy. To make the best out of feature interactions, we propose a graph neural network approach to effectively model them, together with a novel technique to automatically detect those feature interactions that are beneficial in terms of recommendation accuracy. The automatic feature interaction detection is achieved via edge prediction with an L0 activation regularization. Our proposed model is proved to be effective through the information bottleneck principle and statistical interaction theory. Experimental results show that our model (i) outperforms existing baselines in terms of accuracy, and (ii) automatically identifies beneficial feature interactions.</div></details></td>
+        <td>Movielens；AUC：0.9407，ACC：0.8970</td>
+        <td><a href="https://github.com/BamLubi/Paddle-SIGN">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>23</td>
+        <td><a href="https://arxiv.org/pdf/1905.06482v1.pdf">Deep Session Interest Network for Click-Through Rate Prediction</a></td>
+        <td><details><summary>Abstract</summary><div>Easy-to-use,Modular and Extendible package of deep-learning based CTR models.DeepFM,DeepInterestNetwork(DIN),DeepInterestEvolutionNetwork(DIEN),DeepCrossNetwork(DCN),AttentionalFactorizationMachine(AFM),Neural Factorization Machine(NFM),AutoInt,Deep Session Interest Network(DSIN)</div></details></td>
+        <td>advertising-challenge-datase logloss; AUC > 0.63</td>
+        <td><a href="https://github.com/Li-fAngyU/DSIN_paddle">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>24</td>
+        <td><a href="https://arxiv.org/pdf/2105.14688.pdf">Learning to Expand Audience via Meta Hybrid Experts and Critics for Recommendation and Advertising</a></td>
+        <td><details><summary>Abstract</summary><div>In recommender systems and advertising platforms, marketers always want to deliver products, contents, or advertisements to potential audiences over media channels such as display, video, or social. Given a set of audiences or customers (seed users), the audience expansion technique (look-alike modeling) is a promising solution to identify more potential audiences, who are similar to the seed users and likely to finish the business goal of the target campaign. However, look-alike modeling faces two challenges: (1) In practice, a company could run hundreds of marketing campaigns to promote various contents within completely different categories every day, e.g., sports, politics, society. Thus, it is difficult to utilize a common method to expand audiences for all campaigns. (2) The seed set of a certain campaign could only cover limited users. Therefore, a customized approach based on such a seed set is likely to be overfitting. In this paper, to address these challenges, we propose a novel two-stage framework named Meta Hybrid Experts and Critics (MetaHeac) which has been deployed in WeChat Look-alike System. In the offline stage, a general model which can capture the relationships among various tasks is trained from a meta-learning perspective on all existing campaign tasks. In the online stage, for a new campaign, a customized model is learned with the given seed set based on the general model. According to both offline and online experiments, the proposed MetaHeac shows superior effectiveness for both content marketing campaigns in recommender systems and advertising campaigns in advertising platforms. Besides, MetaHeac has been successfully deployed in WeChat for the promotion of both contents and advertisements, leading to great improvement in the quality of marketing. The code has been available at \url{https://github.com/easezyc/MetaHeac}.</div></details></td>
+        <td>AUC>=0.7239</td>
+        <td><a href="https://github.com/simuler/MetaHeac">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>25</td>
+        <td><a href="https://arxiv.org/pdf/1904.04447v1.pdf">Feature Generation by Convolutional Neural Network for Click-Through Rate Prediction</a></td>
+        <td><details><summary>Abstract</summary><div>Easy-to-use, Modular and Extendible package of deep-learning based CTR models. DeepFM, DeepInterestNetwork(DIN), DeepInterestEvolutionNetwork(DIEN), DeepCrossNetwork(DCN), AttentionalFactorizationMachine(AFM), Neural Factorization Machine(NFM), AutoInt, Deep Session Interest Network(DSIN)</div></details></td>
+        <td>AUC:80.22% ,Log Loss:0.5388</td>
+        <td><a href="https://github.com/chenjiyan2001/Paddle-fgcnn">快速开始</a></td>
+    </tr>
 </table>
 ### 其他
@@ -1259,4 +1849,151 @@
        <td>IMDb测试集error rates=4.6%, TREC-6测试集error rates=3.6% , AG’s News测试集 error rates=5.01%(见论文Table 2 & Table 3)</td>
        <td><a href="https://github.com/PaddlePaddle/PaddleVideo">快速开始</a></td>
    </tr>
+    <tr>
+        <td>20</td>
+        <td><a href="https://paperswithcode.com/paper/paconv-position-adaptive-convolution-with">PAConv: Position Adaptive Convolution with Dynamic Kernel Assembling on Point Clouds</a></td>
+        <td><details><summary>Abstract</summary><div>We introduce Position Adaptive Convolution (PAConv), a generic convolution operation for 3D point cloud processing. The key of PAConv is to construct the convolution kernel by dynamically assembling basic weight matrices stored in Weight Bank, where the coefficients of these weight matrices are self-adaptively learned from point positions through ScoreNet. In this way, the kernel is built in a data-driven manner, endowing PAConv with more flexibility than 2D convolutions to better handle the irregular and unordered point cloud data. Besides, the complexity of the learning process is reduced by combining weight matrices instead of brutally predicting kernels from point positions. Furthermore, different from the existing point convolution operators whose network architectures are often heavily engineered, we integrate our PAConv into classical MLP-based point cloud pipelines without changing network configurations. Even built on simple networks, our method still approaches or even surpasses the state-of-the-art models, and significantly improves baseline performance on both classification and segmentation tasks, yet with decent efficiency. Thorough ablation studies and visualizations are provided to understand PAConv. Code is released on https://github.com/CVMI-Lab/PAConv.</div></details></td>
+        <td> Classification accuracy (%) on ModelNet40：93.9</td>
+        <td><a href="https://github.com/txyugood/PaddlePAConv ">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>21</td>
+        <td><a href="https://arxiv.org/abs/2011.09766">Foreground-Aware Relation Network for Geospatial Object Segmentation in High Spatial Resolution Remote Sensing Imagery</a></td>
+        <td><details><summary>Abstract</summary><div>Geospatial object segmentation, as a particular semantic segmentation task, always faces with larger-scale variation, larger intra-class variance of background, and foreground-background imbalance in the high spatial resolution (HSR) remote sensing imagery. However, general semantic segmentation methods mainly focus on scale variation in the natural scene, with inadequate consideration of the other two problems that usually happen in the large area earth observation scene. In this paper, we argue that the problems lie on the lack of foreground modeling and propose a foreground-aware relation network (FarSeg) from the perspectives of relation-based and optimization-based foreground modeling, to alleviate the above two problems. From perspective of relation, FarSeg enhances the discrimination of foreground features via foreground-correlated contexts associated by learning foreground-scene relation. Meanwhile, from perspective of optimization, a foreground-aware optimization is proposed to focus on foreground examples and hard examples of background during training for a balanced optimization. The experimental results obtained using a large scale dataset suggest that the proposed method is superior to the state-of-the-art general semantic segmentation methods and achieves a better trade-off between speed and accuracy. Code has been made available at: \url{this https URL}.</div></details></td>
+        <td>数据集 iSAID 验收指标：1. FarSeg ResNet-50 mIOU= 63.71% 参考论文Table.62. 日志中包含周期 validation 和损失结果3. 复现后合入PaddleRS</td>
+        <td><a href="https://github.com/ucsk/FarSeg">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>22</td>
+        <td><a href="https://arxiv.org/abs/2205.09443">PYSKL: Towards Good Practices for Skeleton Action Recognition</a></td>
+        <td><details><summary>Abstract</summary><div>We present PYSKL: an open-source toolbox for skeleton-based action recognition based on PyTorch. The toolbox supports a wide variety of skeleton action recognition algorithms, including approaches based on GCN and CNN. In contrast to existing open-source skeleton action recognition projects that include only one or two algorithms, PYSKL implements six different algorithms under a unified framework with both the latest and original good practices to ease the comparison of efficacy and efficiency. We also provide an original GCN-based skeleton action recognition model named ST-GCN++, which achieves competitive recognition performance without any complicated attention schemes, serving as a strong baseline. Meanwhile, PYSKL supports the training and testing of nine skeleton-based action recognition benchmarks and achieves state-of-the-art recognition performance on eight of them. To facilitate future research on skeleton action recognition, we also provide a large number of trained models and detailed benchmark results to give some insights. PYSKL is released at https://github.com/kennymckormick/pyskl and is actively maintained. We will update this report when we add new features or benchmarks. The current version corresponds to PYSKL v0.2.</div></details></td>
+        <td>1.joint top1=97.4 2.复现后合入PaddleVideo套件，并添加TIPC</td>
+        <td><a href="https://github.com/txyugood/PaddleVideo/tree/stgcn++">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>23</td>
+        <td><a href="https://www.mdpi.com/2072-4292/12/10/1662">STANet for remote sensing image change detection</a></td>
+        <td><details><summary>Abstract</summary><div>Remote sensing image change detection (CD) is done to identify desired significant changes between bitemporal images. Given two co-registered images taken at different times, the illumination variations and misregistration errors overwhelm the real object changes. Exploring the relationships among different spatial–temporal pixels may improve the performances of CD methods. In our work, we propose a novel Siamese-based spatial–temporal attention neural network. In contrast to previous methods that separately encode the bitemporal images without referring to any useful spatial–temporal dependency, we design a CD self-attention mechanism to model the spatial–temporal relationships. We integrate a new CD self-attention module in the procedure of feature extraction. Our self-attention module calculates the attention weights between any two pixels at different times and positions and uses them to generate more discriminative features. Considering that the object may have different scales, we partition the image into multi-scale subregions and introduce the self-attention in each subregion. In this way, we could capture spatial–temporal dependencies at various scales, thereby generating better representations to accommodate objects of various sizes. We also introduce a CD dataset LEVIR-CD, which is two orders of magnitude larger than other public datasets of this field. LEVIR-CD consists of a large set of bitemporal Google Earth images, with 637 image pairs (1024 _x0002_ 1024) and over 31 k independently labeled change instances. Our proposed attention module improves the F1-score of our baseline model from 83.9 to 87.3 with acceptable computational overhead. Experimental results on a public remote sensing image CD dataset show our method outperforms several other state-of-the-art methods.</div></details></td>
+        <td>数据集 LEVIR-CD验收指标：1. STANet-PAM F1=87.3% 参考论文 Table.42. 日志中包含周期 validation 和损失结果3. 复现后合入PaddleRS</td>
+        <td><a href="https://github.com/sun222/STANET_Paddle">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>24</td>
+        <td><a href="https://ieeexplore.ieee.org/document/9355573">SNUNet-CD: A Densely Connected Siamese Network for Change Detection of VHR Images</a></td>
+        <td><details><summary>Abstract</summary><div>Change detection is an important task in remote sensing (RS) image analysis. It is widely used in natural disaster monitoring and assessment, land resource planning, and other fields. As a pixel-to-pixel prediction task, change detection is sensitive about the utilization of the original position information. Recent change detection methods always focus on the extraction of deep change semantic feature, but ignore the importance of shallow-layer information containing high-resolution and fine-grained features, this often leads to the uncertainty of the pixels at the edge of the changed target and the determination miss of small targets. In this letter, we propose a densely connected siamese network for change detection, namely SNUNet-CD (the combination of Siamese network and NestedUNet). SNUNet-CD alleviates the loss of localization information in the deep layers of neural network through compact information transmission between encoder and decoder, and between decoder and decoder. In addition, Ensemble Channel Attention Module (ECAM) is proposed for deep supervision. Through ECAM, the most representative features of different semantic levels can be refined and used for the final classification. Experimental results show that our method improves greatly on many evaluation criteria and has a better tradeoff between accuracy and calculation amount than other state-of-the-art (SOTA) change detection methods.</div></details></td>
+        <td>数据集 CDD https://drive.google.com/file/d/1GX656JqqOyBi_Ef0w65kDGVto-nHrNs9/edit验收指标：1. SNUNet-c32 F1-Score=95.3% 参考https://paperswithcode.com/sota/change-detection-for-remote-sensing-images-on2. 日志中包含周期 validation 和损失结果3. 复现后合入PaddleRS</td>
+        <td><a href="https://github.com/kongdebug/SNUNet-Paddle">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>25</td>
+        <td><a href="https://arxiv.org/pdf/2103.00208.pdf">Remote Sensing Image Change Detection with Transformers</a></td>
+        <td><details><summary>Abstract</summary><div>Modern change detection (CD) has achieved remarkable success by the powerful discriminative ability of deep convolutions. However, high-resolution remote sensing CD remains challenging due to the complexity of objects in the scene. Objects with the same semantic concept may show distinct spectral characteristics at different times and spatial locations. Most recent CD pipelines using pure convolutions are still struggling to relate long-range concepts in space-time. Non-local self-attention approaches show promising performance via modeling dense relations among pixels, yet are computationally inefficient. Here, we propose a bitemporal image transformer (BIT) to efficiently and effectively model contexts within the spatial-temporal domain. Our intuition is that the high-level concepts of the change of interest can be represented by a few visual words, i.e., semantic tokens. To achieve this, we express the bitemporal image into a few tokens, and use a transformer encoder to model contexts in the compact token-based space-time. The learned context-rich tokens are then feedback to the pixel-space for refining the original features via a transformer decoder. We incorporate BIT in a deep feature differencing-based CD framework. Extensive experiments on three CD datasets demonstrate the effectiveness and efficiency of the proposed method. Notably, our BIT-based model significantly outperforms the purely convolutional baseline using only 3 times lower computational costs and model parameters. Based on a naive backbone (ResNet18) without sophisticated structures (e.g., FPN, UNet), our model surpasses several state-of-the-art CD methods, including better than four recent attention-based methods in terms of efficiency and accuracy. Our code is available at https://github.com/justchenhao/BIT\_CD.</div></details></td>
+        <td>数据集 LEVIR-CD验收指标：1. STANet-PAM F1 = 89.31% 参考论文 Table.12. 日志中包含周期 validation 和损失结果3. 复现后合入PaddleRS</td>
+        <td><a href="https://github.com/kongdebug/BIT-CD-Paddle">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>26</td>
+        <td><a href="https://arxiv.org/pdf/1805.07694v3.pdf">Two-Stream Adaptive Graph Convolutional Networks for Skeleton-Based Action Recognition</a></td>
+        <td><details><summary>Abstract</summary><div>In skeleton-based action recognition, graph convolutional networks (GCNs), which model the human body skeletons as spatiotemporal graphs, have achieved remarkable performance. However, in existing GCN-based methods, the topology of the graph is set manually, and it is fixed over all layers and input samples. This may not be optimal for the hierarchical GCN and diverse samples in action recognition tasks. In addition, the second-order information (the lengths and directions of bones) of the skeleton data, which is naturally more informative and discriminative for action recognition, is rarely investigated in existing methods. In this work, we propose a novel two-stream adaptive graph convolutional network (2s-AGCN) for skeleton-based action recognition. The topology of the graph in our model can be either uniformly or individually learned by the BP algorithm in an end-to-end manner. This data-driven method increases the flexibility of the model for graph construction and brings more generality to adapt to various data samples. Moreover, a two-stream framework is proposed to model both the first-order and the second-order information simultaneously, which shows notable improvement for the recognition accuracy. Extensive experiments on the two large-scale datasets, NTU-RGBD and Kinetics-Skeleton, demonstrate that the performance of our model exceeds the state-of-the-art with a significant margin.</div></details></td>
+        <td>NTU-RGBD数据集, X-Sub=88.5%, X-view=95.1%</td>
+        <td><a href="https://github.com/ELKYang/2s-AGCN-paddle">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>27</td>
+        <td><a href="https://ieeexplore.ieee.org/abstract/document/9416235">MLDA-Net: Multi-Level Dual Attention-BasedNetwork for Self-Supervised MonocularDepth Estimation</a></td>
+        <td><details><summary>Abstract</summary><div>The success of supervised learning-based single image depth estimation methods critically depends on the availability of large-scale dense per-pixel depth annotations, which requires both laborious and expensive annotation process. Therefore, the self-supervised methods are much desirable, which attract significant attention recently. However, depth maps predicted by existing self-supervised methods tend to be blurry with many depth details lost. To overcome these limitations, we propose a novel framework, named MLDA-Net, to obtain per-pixel depth maps with shaper boundaries and richer depth details. Our first innovation is a multi-level feature extraction (MLFE) strategy which can learn rich hierarchical representation. Then, a dual-attention strategy, combining global attention and structure attention, is proposed to intensify the obtained features both globally and locally, resulting in improved depth maps with sharper boundaries. Finally, a reweighted loss strategy based on multi-level outputs is proposed to conduct effective supervision for self-supervised depth estimation. Experimental results demonstrate that our MLDA-Net framework achieves state-of-the-art depth prediction results on the KITTI benchmark for self-supervised monocular depth estimation with different input modes and training modes. Extensive experiments on other benchmark datasets further confirm the superiority of our proposed approach.</div></details></td>
+        <td>KITTI， ResNet18，RMSE：4.690</td>
+        <td><a href="https://github.com/bitcjm/MLDA-Net-repo">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>28</td>
+        <td><a href="https://paperswithcode.com/paper/ffa-net-feature-fusion-attention-network-for">FFA-Net: Feature Fusion Attention Network for Single Image Dehazing</a></td>
+        <td><details><summary>Abstract</summary><div>In this paper, we propose an end-to-end feature fusion at-tention network (FFA-Net) to directly restore the haze-free image. The FFA-Net architecture consists of three key components: 1) A novel Feature Attention (FA) module combines Channel Attention with Pixel Attention mechanism, considering that different channel-wise features contain totally different weighted information and haze distribution is uneven on the different image pixels. FA treats different features and pixels unequally, which provides additional flexibility in dealing with different types of information, expanding the representational ability of CNNs. 2) A basic block structure consists of Local Residual Learning and Feature Attention, Local Residual Learning allowing the less important information such as thin haze region or low-frequency to be bypassed through multiple local residual connections, let main network architecture focus on more effective information. 3) An Attention-based different levels Feature Fusion (FFA) structure, the feature weights are adaptively learned from the Feature Attention (FA) module, giving more weight to important features. This structure can also retain the information of shallow layers and pass it into deep layers. The experimental results demonstrate that our proposed FFA-Net surpasses previous state-of-the-art single image dehazing methods by a very large margin both quantitatively and qualitatively, boosting the best published PSNR metric from 30.23db to 36.39db on the SOTS indoor test dataset. Code has been made available at GitHub.</div></details></td>
+        <td>nan</td>
+        <td><a href="https://github.com/bitcjm/PaddleVideo/tree/new_branch">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>29</td>
+        <td><a href="https://paperswithcode.com/paper/you-only-watch-once-a-unified-cnn">YOWO: You only watch once: A unified cnn architecture for real-time spatiotemporal action localization</a></td>
+        <td><details><summary>Abstract</summary><div>Spatiotemporal action localization requires the incorporation of two sources of information into the designed architecture: (1) temporal information from the previous frames and (2) spatial information from the key frame. Current state-of-the-art approaches usually extract these information with separate networks and use an extra mechanism for fusion to get detections. In this work, we present YOWO, a unified CNN architecture for real-time spatiotemporal action localization in video streams. YOWO is a single-stage architecture with two branches to extract temporal and spatial information concurrently and predict bounding boxes and action probabilities directly from video clips in one evaluation. Since the whole architecture is unified, it can be optimized end-to-end. The YOWO architecture is fast providing 34 frames-per-second on 16-frames input clips and 62 frames-per-second on 8-frames input clips, which is currently the fastest state-of-the-art architecture on spatiotemporal action localization task. Remarkably, YOWO outperforms the previous state-of-the art results on J-HMDB-21 and UCF101-24 with an impressive improvement of ~3% and ~12%, respectively. Moreover, YOWO is the first and only single-stage architecture that provides competitive results on AVA dataset. We make our code and pretrained models publicly available.</div></details></td>
+        <td>UCF101-24数据集，YOWO (16-frame)模型，frame-mAP under IoU threshold of 0.5=80.4</td>
+        <td><a href="https://github.com/zwtu/YOWO-Paddle">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>30</td>
+        <td><a href="https://paperswithcode.com/paper/token-shift-transformer-for-video">Token Shift Transformer for Video Classification</a></td>
+        <td><details><summary>Abstract</summary><div>Transformer achieves remarkable successes in understanding 1 and 2-dimensional signals (e.g., NLP and Image Content Understanding). As a potential alternative to convolutional neural networks, it shares merits of strong interpretability, high discriminative power on hyper-scale data, and flexibility in processing varying length inputs. However, its encoders naturally contain computational intensive operations such as pair-wise self-attention, incurring heavy computational burden when being applied on the complex 3-dimensional video signals. This paper presents Token Shift Module (i.e., TokShift), a novel, zero-parameter, zero-FLOPs operator, for modeling temporal relations within each transformer encoder. Specifically, the TokShift barely temporally shifts partial [Class] token features back-and-forth across adjacent frames. Then, we densely plug the module into each encoder of a plain 2D vision transformer for learning 3D video representation. It is worth noticing that our TokShift transformer is a pure convolutional-free video transformer pilot with computational efficiency for video understanding. Experiments on standard benchmarks verify its robustness, effectiveness, and efficiency. Particularly, with input clips of 8/12 frames, the TokShift transformer achieves SOTA precision: 79.83%/80.40% on the Kinetics-400, 66.56% on EGTEA-Gaze+, and 96.80% on UCF-101 datasets, comparable or better than existing SOTA convolutional counterparts. Our code is open-sourced in: https://github.com/VideoNetworks/TokShift-Transformer.</div></details></td>
+        <td>UCF101数据集，无预训练模型条件下，8x256x256输入尺寸，Top1=91.60</td>
+        <td><a href="https://github.com/zwtu/TokShift-Transformer-Paddle">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>31</td>
+        <td><a href="https://paperswithcode.com/paper/cross-lingual-language-model-pretraining">XLM: Cross-lingual Language Model Pretraining</a></td>
+        <td><details><summary>Abstract</summary><div>This paper shows that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks. We train a Transformer-based masked language model on one hundred languages, using more than two terabytes of filtered CommonCrawl data. Our model, dubbed XLM-R, significantly outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks, including +14.6% average accuracy on XNLI, +13% average F1 score on MLQA, and +2.4% F1 score on NER. XLM-R performs particularly well on low-resource languages, improving 15.7% in XNLI accuracy for Swahili and 11.4% for Urdu over previous XLM models. We also present a detailed empirical analysis of the key factors that are required to achieve these gains, including the trade-offs between (1) positive transfer and capacity dilution and (2) the performance of high and low resource languages at scale. Finally, we show, for the first time, the possibility of multilingual modeling without sacrificing per-language performance; XLM-R is very competitive with strong monolingual models on the GLUE and XNLI benchmarks. We will make our code, data and models publicly available.</div></details></td>
+        <td>XNLI测试集average accuracy=75.1%（见论文Table 1）</td>
+        <td><a href="https://github.com/JunnYu/xlm_paddle">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>32</td>
+        <td><a href="https://paperswithcode.com/paper/a-closer-look-at-few-shot-classification-1">A Closer Look at Few-shot Classification</a></td>
+        <td><details><summary>Abstract</summary><div>Few-shot classification aims to learn a classifier to recognize unseen classes during training with limited labeled examples. While significant progress has been made, the growing complexity of network designs, meta-learning algorithms, and differences in implementation details make a fair comparison difficult. In this paper, we present 1) a consistent comparative analysis of several representative few-shot classification algorithms, with results showing that deeper backbones significantly reduce the performance differences among methods on datasets with limited domain differences, 2) a modified baseline method that surprisingly achieves competitive performance when compared with the state-of-the-art on both the \miniI and the CUB datasets, and 3) a new experimental setting for evaluating the cross-domain generalization ability for few-shot classification algorithms. Our results reveal that reducing intra-class variation is an important factor when the feature backbone is shallow, but not as critical when using deeper backbones. In a realistic cross-domain evaluation setting, we show that a baseline method with a standard fine-tuning practice compares favorably against other state-of-the-art few-shot learning algorithms.</div></details></td>
+        <td>nan</td>
+        <td><a href="https://github.com/Lieberk/Paddle-FSL-Baseline">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>33</td>
+        <td><a href="https://paperswithcode.com/paper/matching-networks-for-one-shot-learning">Matching Networks for One Shot Learning</a></td>
+        <td><details><summary>Abstract</summary><div>Learning from a few examples remains a key challenge in machine learning. Despite recent advances in important domains such as vision and language, the standard supervised deep learning paradigm does not offer a satisfactory solution for learning new concepts rapidly from little data. In this work, we employ ideas from metric learning based on deep neural features and from recent advances that augment neural networks with external memories. Our framework learns a network that maps a small labelled support set and an unlabelled example to its label, obviating the need for fine-tuning to adapt to new class types. We then define one-shot learning problems on vision (using Omniglot, ImageNet) and language tasks. Our algorithm improves one-shot accuracy on ImageNet from 87.6% to 93.2% and from 88.0% to 93.8% on Omniglot compared to competing approaches. We also demonstrate the usefulness of the same model on language modeling by introducing a one-shot task on the Penn Treebank.</div></details></td>
+        <td>omniglot k-way=5, n-shot=1, 精度98.1</td>
+        <td><a href="https://github.com/Lieberk/Paddle-MatchingNet">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>34</td>
+        <td><a href="https://paperswithcode.com/paper/few-shot-learning-with-graph-neural-networks-1">Few-Shot Learning with Graph Neural Networks</a></td>
+        <td><details><summary>Abstract</summary><div>We propose to study the problem of few-shot learning with the prism of inference on a partially observed graphical model, constructed from a collection of input images whose label can be either observed or not. By assimilating generic message-passing inference algorithms with their neural-network counterparts, we define a graph neural network architecture that generalizes several of the recently proposed few-shot learning models. Besides providing improved numerical performance, our framework is easily extended to variants of few-shot learning, such as semi-supervised or active learning, demonstrating the ability of graph-based models to operate well on ‘relational’ tasks.</div></details></td>
+        <td>nan</td>
+        <td><a href="https://github.com/Lieberk/Paddle-FSL-GNN">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>35</td>
+        <td><a href="https://arxiv.org/pdf/2011.10566v1.pdf">Exploring Simple Siamese Representation Learning</a></td>
+        <td><details><summary>Abstract</summary><div>Siamese networks have become a common structure in various recent models for unsupervised visual representation learning. These models maximize the similarity between two augmentations of one image, subject to certain conditions for avoiding collapsing solutions. In this paper, we report surprising empirical results that simple Siamese networks can learn meaningful representations even using none of the following: (i) negative sample pairs, (ii) large batches, (iii) momentum encoders. Our experiments show that collapsing solutions do exist for the loss and structure, but a stop-gradient operation plays an essential role in preventing collapsing. We provide a hypothesis on the implication of stop-gradient, and further show proof-of-concept experiments verifying it. Our "SimSiam" method achieves competitive results on ImageNet and downstream tasks. We hope this simple baseline will motivate people to rethink the roles of Siamese architectures for unsupervised representation learning. Code will be made available.</div></details></td>
+        <td>下面二者之一即可1. bs 256的情况下，top1-acc 68.3%2. bs512的情况下，top1-acc 68.1%</td>
+        <td><a href="https://github.com/Mixture-of-Rookie/PASSL/tree/simsiam">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>36</td>
+        <td><a href="https://arxiv.org/pdf/2006.09882v5.pdf">Unsupervised Learning of Visual Features by Contrasting Cluster Assignments</a></td>
+        <td><details><summary>Abstract</summary><div>Unsupervised image representations have significantly reduced the gap with supervised pretraining, notably with the recent achievements of contrastive learning methods. These contrastive methods typically work online and rely on a large number of explicit pairwise feature comparisons, which is computationally challenging. In this paper, we propose an online algorithm, SwAV, that takes advantage of contrastive methods without requiring to compute pairwise comparisons. Specifically, our method simultaneously clusters the data while enforcing consistency between cluster assignments produced for different augmentations (or views) of the same image, instead of comparing features directly as in contrastive learning. Simply put, we use a swapped prediction mechanism where we predict the cluster assignment of a view from the representation of another view. Our method can be trained with large and small batches and can scale to unlimited amounts of data. Compared to previous contrastive methods, our method is more memory efficient since it does not require a large memory bank or a special momentum network. In addition, we also propose a new data augmentation strategy, multi-crop, that uses a mix of views with different resolutions in place of two full-resolution views, without increasing the memory or compute requirements much. We validate our findings by achieving 75.3% top-1 accuracy on ImageNet with ResNet-50, as well as surpassing supervised pretraining on all the considered transfer tasks.</div></details></td>
+        <td>swav 2x224 + 6x96 ImageNet-1k 100epoch linear top1 acc 72.1%</td>
+        <td><a href="https://github.com/PaddlePaddle/PASSL/pull/120">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>37</td>
+        <td><a href="https://arxiv.org/pdf/2011.09157v2.pdf">Dense Contrastive Learning for Self-Supervised Visual Pre-Training</a></td>
+        <td><details><summary>Abstract</summary><div>To date, most existing self-supervised learning methods are designed and optimized for image classification. These pre-trained models can be sub-optimal for dense prediction tasks due to the discrepancy between image-level prediction and pixel-level prediction. To fill this gap, we aim to design an effective, dense self-supervised learning method that directly works at the level of pixels (or local features) by taking into account the correspondence between local features. We present dense contrastive learning, which implements self-supervised learning by optimizing a pairwise contrastive (dis)similarity loss at the pixel level between two views of input images. Compared to the baseline method MoCo-v2, our method introduces negligible computation overhead (only <1% slower), but demonstrates consistently superior performance when transferring to downstream dense prediction tasks including object detection, semantic segmentation and instance segmentation; and outperforms the state-of-the-art methods by a large margin. Specifically, over the strong MoCo-v2 baseline, our method achieves significant improvements of 2.0% AP on PASCAL VOC object detection, 1.1% AP on COCO object detection, 0.9% AP on COCO instance segmentation, 3.0% mIoU on PASCAL VOC semantic segmentation and 1.8% mIoU on Cityscapes semantic segmentation. Code is available at: https://git.io/AdelaiDet</div></details></td>
+        <td>densecl_resnet50_8xb32-coslr-200epoch ImageNet-1k linear  top1 acc 63.62%</td>
+        <td><a href="https://github.com/PaddlePaddle/PASSL/pull/118">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>38</td>
+        <td><a href="https://paperswithcode.com/paper/constructing-stronger-and-faster-baselines">EfficientGCN: Constructing Stronger and Faster Baselines for Skeleton-based Action Recognition</a></td>
+        <td><details><summary>Abstract</summary><div>One essential problem in skeleton-based action recognition is how to extract discriminative features over all skeleton joints. However, the complexity of the recent State-Of-The-Art (SOTA) models for this task tends to be exceedingly sophisticated and over-parameterized. The low efficiency in model training and inference has increased the validation costs of model architectures in large-scale datasets. To address the above issue, recent advanced separable convolutional layers are embedded into an early fused Multiple Input Branches (MIB) network, constructing an efficient Graph Convolutional Network (GCN) baseline for skeleton-based action recognition. In addition, based on such the baseline, we design a compound scaling strategy to expand the model's width and depth synchronously, and eventually obtain a family of efficient GCN baselines with high accuracies and small amounts of trainable parameters, termed EfficientGCN-Bx, where "x" denotes the scaling coefficient. On two large-scale datasets, i.e., NTU RGB+D 60 and 120, the proposed EfficientGCN-B4 baseline outperforms other SOTA methods, e.g., achieving 91.7% accuracy on the cross-subject benchmark of NTU 60 dataset, while being 3.15x smaller and 3.21x faster than MS-G3D, which is one of the best SOTA methods. The source code in PyTorch version and the pretrained models are available at https://github.com/yfsong0709/EfficientGCNv1.</div></details></td>
+        <td>NTU RGB+D 60数据集，EfficientGCN-B0模型，X-sub=90.2%    X-view=94.9%</td>
+        <td><a href="https://github.com/Wuxiao85/paddle_EfficientGCNv">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>39</td>
+        <td><a href="https://paperswithcode.com/paper/canine-pre-training-an-efficient-tokenization">CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation</a></td>
+        <td><details><summary>Abstract</summary><div>Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly-used models still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword lexicons are less brittle than manually engineered tokenizers, these techniques are not equally suited to all languages, and the use of any fixed vocabulary may limit a model's ability to adapt. In this paper, we present CANINE, a neural encoder that operates directly on character sequences, without explicit tokenization or vocabulary, and a pre-training strategy that operates either directly on characters or optionally uses subwords as a soft inductive bias. To use its finer-grained input effectively and efficiently, CANINE combines downsampling, which reduces the input sequence length, with a deep transformer stack, which encodes context. CANINE outperforms a comparable mBERT model by 2.8 F1 on TyDi QA, a challenging multilingual benchmark, despite having 28% fewer model parameters.</div></details></td>
+        <td>CANINE-S模型，TYDI-QA Passage Selection Task上F1=66.0, TYDI-QA Minimal Answer Span Task上F1=52.5（见论文Table2）</td>
+        <td><a href="https://github.com/kevinng77/canine_paddle">快速开始</a></td>
+    </tr>
+    <tr>
+        <td>40</td>
+        <td><a href="https://paperswithcode.com/paper/infoxlm-an-information-theoretic-framework">INFOXLM: An Information-Theoretic Framework for Cross-Lingual Language Model Pre-Training</a></td>
+        <td><details><summary>Abstract</summary><div>In this work, we present an information-theoretic framework that formulates cross-lingual language model pre-training as maximizing mutual information between multilingual-multi-granularity texts. The unified view helps us to better understand the existing methods for learning cross-lingual representations. More importantly, inspired by the framework, we propose a new pre-training task based on contrastive learning. Specifically, we regard a bilingual sentence pair as two views of the same meaning and encourage their encoded representations to be more similar than the negative examples. By leveraging both monolingual and parallel corpora, we jointly train the pretext tasks to improve the cross-lingual transferability of pre-trained models. Experimental results on several benchmarks show that our approach achieves considerably better performance. The code and pre-trained models are available at https://aka.ms/infoxlm.</div></details></td>
+        <td>Taboeba测试集 cross-lingual sentence retrival avg(xx to en; en to xx)分别 达到77.8, 80.6(见论文table2); XNLI测试集 transfer gap score=10.3(见论文table 5)</td>
+        <td><a href="https://github.com/jiamingkong/infoxlm_paddle">快速开始</a></td>
+    </tr>
 </table>
--- a/official/PP-Models.md
+++ b/official/PP-Models.md
 ## 飞桨PP系列模型
 针对用户产业实践中的痛点问题，飞桨打造了PP系列模型，实现模型精度与预测效率的最佳平衡，满足企业落地实际需求。
@@ -9,24 +9,41 @@
 |PaddleClas|PP-LCNetv2|基于PP-LCNet优化的轻量级SOTA骨干网络，在ImageNet 1k分类数据集上，精度可达77.04%，相较MobileNetV3-Large x1.25精度提高0.64个百分点，同时在  Intel CPU 硬件上，预测速度可达 230 FPS ，相比 MobileNetV3-Large x1.25 预测速度提高 20%|[快速开始](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.5/docs/zh_CN/models/ImageNet1k/PP-LCNetV2.md)|
 |PaddleClas|PP-HGNet|GPU高性能骨干网络，在ImageNet 1k分类数据集上，精度可达79.83%、81.51%，同等速度下，相较ResNet34-D提高3.8个百分点，相较ResNet50-D提高2.4个百分点，在使用百度自研 SSLD 蒸馏策略后，精度相较ResNet50-D提高4.7个百分点。|[快速开始](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.5/docs/zh_CN/models/ImageNet1k/PP-HGNet.md)|
 |PaddleClas|PP-ShiTu|轻量图像识别系统，集成了目标检测、特征学习、图像检索等模块，广泛适用于各类图像识别任务，CPU上0.2s即可完成在10w+库的图像识别。|[快速开始](https://github.com/PaddlePaddle/PaddleClas/tree/release/2.3#pp-shitu%E5%9B%BE%E5%83%8F%E8%AF%86%E5%88%AB%E7%B3%BB%E7%BB%9F%E4%BB%8B%E7%BB%8D)|
+|PaddleClas|PP-ShiTuV2|PP-ShiTuV2 是基于 PP-ShiTuV1 改进的一个实用轻量级通用图像识别系统，由主体检测、特征提取、向量检索三个模块构成，相比 PP-ShiTuV1 具有更高的识别精度、更强的泛化能力以及相近的推理速度。|[快速开始](https://github.com/PaddlePaddle/PaddleClas/blob/release/2.5/docs/zh_CN/models/PP-ShiTu/README.md)|
 |PaddleDetection|PP-YOLO|基于YOLOv3优化的高精度目标检测模型，精度达到45.9%，在单卡V100上FP32推理速度为72.9 FPS, V100上开启TensorRT下FP16推理速度为155.6 FPS。|[快速开始](https://github.com/PaddlePaddle/PaddleDetection/blob/release/2.4/configs/ppyolo/README_cn.md)|
 |PaddleDetection|PP-YOLOv2|高精度目标检测模型，对比PP-YOLO， 精度提升 3.6%，达到49.5%；在 640*640 的输入尺寸下，速度可实现68.9FPS，采用 TensorRT 加速，FPS 还可达到106.5FPS。|[快速开始](https://github.com/PaddlePaddle/PaddleDetection/blob/release/2.4/configs/ppyolo/README_cn.md)|
 |PaddleDetection|PP-YOLOE|高精度云边一体SOTA目标检测模型，提供s/m/l/x版本，l版本COCO test2017数据集精度51.4%，V100预测速度78.1 FPS，支持混合精度训练，训练较PP-YOLOv2加速33%，全系列多尺度模型满足不同硬件算力需求，可适配服务器、边缘端GPU及其他服务器端AI加速卡。|[快速开始](https://github.com/PaddlePaddle/PaddleDetection/blob/release/2.4/configs/ppyoloe/README_cn.md)|
+|PaddleDetection|PP-YOLOE+|PP-YOLOE升级版，最高精度提升2.4% mAP，达到54.9% mAP，模型训练收敛速度提升3.75倍，端到端预测速度最高提升2.3倍；多个下游任务泛化性提升。|[快速开始](https://github.com/PaddlePaddle/PaddleDetection/tree/release/2.5/configs/ppyoloe)|
 |PaddleDetection|PP-PicoDet|超轻量级目标检测模型，提供xs/s/m/l四种尺寸，其中s版本参数量仅1.18m，却可达到32.5%mAP，相较YOLOX-Nano精度高6.7%，速度快26%，同时优化量化部署方案，实现在移动端部署加速30%+。|[快速开始](https://github.com/PaddlePaddle/PaddleDetection/tree/release/2.4/configs/picodet)|
 |PaddleDetection|PP-Tracking|实时多目标跟踪工具，融合目标检测、行人重识别、轨迹融合等核心能力，提供行人车辆跟踪、跨镜头跟踪、多类别跟踪、小目标跟踪及流量技术等能力与产业应用。|[快速开始](https://github.com/PaddlePaddle/PaddleDetection/blob/release/2.4/deploy/pptracking/README_cn.md)|
 |PaddleDetection|PP-TinyPose|超轻量级人体关键点检测算法，单人场景FP16推理可达到122FPS、精度51.8%AP，具有精度高速度快、检测人数无限制、微小目标效果好的特点。|[快速开始](https://github.com/PaddlePaddle/PaddleDetection/tree/release/2.4/configs/keypoint/tiny_pose)|
+|PaddleDetection|PP-TinyPose+|PP-TinyPose升级版，在健身、舞蹈等场景的业务数据集端到端AP提升9.1；新增体育场景真实数据，复杂动作识别效果显著提升；覆盖侧身、卧躺、跳跃、高抬腿等非常规动作；检测模型升级为[PP-PicoDet增强版](https://github.com/PaddlePaddle/PaddleDetection/blob/ede22043927a944bb4cbea0e9455dd9c91b295f0/configs/picodet/README.md)，在COCO数据集上精度提升3.1%；关键点稳定性增强；新增滤波稳定方式，视频预测结果更加稳定平滑|[快速开始](https://github.com/PaddlePaddle/PaddleDetection/blob/ede22043927a944bb4cbea0e9455dd9c91b295f0/configs/keypoint/tiny_pose/README.md)|
 |PaddleDetection|PP-Human|产业级实时行人分析工，支持属性分析、行为识别、流量计数/轨迹留存三大功能，覆盖目标检测、多目标跟踪、属性识别、关键点检测、行为识别和跨镜跟踪六大核心技术。|[快速开始](https://github.com/PaddlePaddle/PaddleDetection/tree/release/2.4/deploy/pphuman)|
-|PaddleSeg|PP-HumanSeg|PP-HumanSeg是在大规模人像数据上训练的人像分割系列模型，提供了多种模型，满足在Web端、移动端、服务端多种使用场景的需求。其中PP-HumanSeg-Lite采用轻量级网络设计、连通性学习策略、非结构化稀疏技术，实现体积、速度和精度的SOTA平衡。（参数量137K，速度达95FPS，mIoU达93%）|[快速开始](https://github.com/PaddlePaddle/PaddleSeg/tree/release/2.5/configs/pp_humanseg_lite)|
+|PaddleDetection|PP-HumanV2|新增打架、打电话、抽烟、闯入四大行为识别，底层算法性能升级，覆盖行人检测、跟踪、属性三类核心算法能力，提供保姆级全流程开发及模型优化策略。|[快速开始](https://github.com/PaddlePaddle/PaddleDetection/tree/release/2.5/deploy/pipeline)|
-|PaddleSeg|PP-HumanMatting|PP-HumanMatting通过低分辨粗预测和高分辨率Refine的两阶段设计，在增加小量计算量的情况下，有效保持了高分辨率(>2048)人像扣图中细节信息。|[快速开始](https://github.com/PaddlePaddle/PaddleSeg/tree/release/2.5/Matting)|
+|PaddleDetection|PP-Vehicle|提供车牌识别、车辆属性分析（颜色、车型）、车流量统计以及违章检测四大功能，完善的文档教程支持高效完成二次开发与模型优化。|[快速开始](https://github.com/PaddlePaddle/PaddleDetection/tree/release/2.5/deploy/pipeline)|
-|PaddleSeg|PP-LiteSeg|PP-LiteSeg是通用轻量级语义分割模型，使用灵活高效的解码模块、统一注意力融合模块、轻量的上下文模块，针对Nvidia GPU上的产业级分割任务，实现精度和速度的SOTA平衡。在1080ti上精度为mIoU 72.0（Cityscapes数据集）时，速度高达273.6 FPS|[快速开始](https://github.com/PaddlePaddle/PaddleSeg/tree/release/2.5/configs/pp_liteseg)|
+|PaddleSeg|PP-HumanSeg|PP-HumanSeg是在大规模人像数据上训练的人像分割系列模型，提供了多种模型，满足在Web端、移动端、服务端多种使用场景的需求。其中PP-HumanSeg-Lite采用轻量级网络设计、连通性学习策略、非结构化稀疏技术，实现体积、速度和精度的SOTA平衡。（参数量137K，速度达95FPS，mIoU达93%）|[快速开始](https://github.com/PaddlePaddle/PaddleSeg/blob/release/2.6/contrib/PP-HumanSeg/README_cn.md)|
-|PaddleSeg|PP-Matting|PP-Matting 通过引导流设计，实现语义引导下的高分辨率细节预测，进而实现trimap-free高精度图像抠图。在公开数据集Composition-1k和Distinctions-646测试集取得了SOTA的效果 。|[快速开始](https://github.com/PaddlePaddle/PaddleSeg/tree/release/2.5/Matting)|
+|PaddleSeg|PP-HumanSegV2|PP-HumanSegV2是PP-HumanSeg的改进版本，肖像分割模型的推理耗时减小45.5%、mIoU提升3.03%、可视化效果更佳，通用人像分割模型的推理速度和精度也有明显提升。|[快速开始](https://github.com/PaddlePaddle/PaddleSeg/blob/release/2.6/contrib/PP-HumanSeg/README_cn.md)|
+|PaddleSeg|PP-HumanMatting|PP-HumanMatting通过低分辨粗预测和高分辨率Refine的两阶段设计，在增加小量计算量的情况下，有效保持了高分辨率(>2048)人像扣图中细节信息。|[快速开始](https://github.com/PaddlePaddle/PaddleSeg/blob/release/2.6/Matting/README_CN.md)|
+|PaddleSeg|PP-LiteSeg|PP-LiteSeg是通用轻量级语义分割模型，使用灵活高效的解码模块、统一注意力融合模块、轻量的上下文模块，针对Nvidia GPU上的产业级分割任务，实现精度和速度的SOTA平衡。在1080ti上精度为mIoU 72.0（Cityscapes数据集）时，速度高达273.6 FPS|[快速开始](https://github.com/PaddlePaddle/PaddleSeg/tree/release/2.6/configs/pp_liteseg)|
+|PaddleSeg|PP-Matting|PP-Matting 通过引导流设计，实现语义引导下的高分辨率细节预测，进而实现trimap-free高精度图像抠图。在公开数据集Composition-1k和Distinctions-646测试集取得了SOTA的效果 。|[快速开始](https://github.com/PaddlePaddle/PaddleSeg/blob/release/2.6/Matting/README_CN.md)|
+|PaddleSeg|PP-MattingV2|PP-MattingV2是PaddleSeg自研的轻量级抠图SOTA模型，通过双层金字塔池化及空间注意力提取高级语义信息，并利用多级特征融合机制兼顾语义和细节的预测。 对比MODNet模型推理速度提升44.6%， 误差平均相对减小17.91%。追求更高速度，推荐使用该模型。|[快速开始](https://github.com/PaddlePaddle/PaddleSeg/blob/release/2.7/Matting/README_CN.md)|
 |PaddleOCR|PP-OCR|PP-OCR是一个两阶段超轻量OCR系统，包括文本检测、方向分类器、文本识别三个部分，支持竖排文本识别。PP-OCR mobile中英文模型3.5M，英文数字模型2.8M。在通用场景下达到产业级SOTA标准|[快速开始](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.5/doc/doc_ch/quickstart.md)|
 |PaddleOCR|PP-OCRv2|PP-OCRv2在PP-OCR的基础上进行优化，平衡PP-OCR模型的精度和速度，效果相比PP-OCR mobile 提升7%；推理速度相比于PP-OCR server提升220%；支持80种多语言模型|[快速开始](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.5/doc/doc_ch/quickstart.md)|
 |PaddleOCR|PP-OCRv3|PP-OCRv3进一步在原先系统上优化，在中文场景效果相比于PP-OCRv2再提升5%，英文场景提升11%，80语种多语言模型平均识别准确率提升5%以上|[快速开始](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.5/doc/doc_ch/quickstart.md)|
 |PaddleOCR|PP-Structure|PP-Structure是一套智能文档分析系统，支持版面分析、表格识别（含Excel导出）、关键信息提取与DocVQA（含语义实体识别和关系抽取）|[快速开始](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.5/ppstructure/docs/quickstart.md)|
+|PaddleOCR|PP-StructureV2|基于PP-Structure系统功能性能全面升级，适配中文场景，新增支持版面复原，支持一行命令完成PDF转Word|[快速开始](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.6/ppstructure/docs/quickstart.md)|
 |PaddeleGAN|PP-MSVSR|高精度视频超分算法，提供1.45M和7.4M两种参数量大小的模型，峰值信噪比与结构相似度均高于其他开源算法，以PSNR 32.53、SSIM 0.9083达到业界SOTA，同时对输入视频的分辨率不限制，支持分辨率一次提升400%。|[快速开始](https://github.com/PaddlePaddle/PaddleGAN/blob/develop/docs/zh_CN/tutorials/video_super_resolution.md)|
 |PaddleVideo|PP-TSM|高精度2D实用视频分类模型PP-TSM。在不增加参数量和计算量的情况下，在UCF-101、Kinetics-400等数据集上精度显著超过TSM原始模型|[快速开始](https://github.com/PaddlePaddle/PaddleVideo/blob/develop/docs/zh-CN/model_zoo/recognition/pp-tsm.md)|
-|PaddleNLP|ERNIE 3.0-Medium|本模型是在文心大模型ERNIE 3.0 基础上通过**在线蒸馏技术**得到的轻量级模型，模型结构与 ERNIE 2.0 保持一致，相比 ERNIE 2.0 具有更强的中文效果。|[快速开始](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/ernie-3.0)|
+|PaddleVideo|PP-TSMv2|PP-TSMv2沿用了部分PP-TSM的优化策略，从骨干网络与预训练模型选择、数据增强、tsm模块调优、输入帧数优化、解码速度优化、dml蒸馏、新增时序attention模块等7个方面进行模型调优，在中心采样评估方式下，精度达到75.16%，输入10s视频在CPU端的推理速度仅需456ms。|[快速开始](https://github.com/PaddlePaddle/PaddleVideo/blob/develop/docs/zh-CN/quick_start.md)|
+|PaddleNLP|ERNIE-M|面向多语言建模的预训练模型，ERNIE-M 提出基于回译机制，从单语语料中学习语言间的语义对齐关系，在跨语言自然语言推断、语义检索、语义相似度、命名实体识别、阅读理解等各种跨语言下游任务中取得了 SOTA 效果。|[快速开始](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/ernie-m)|
+|PaddleNLP|ERNIE-UIE|通用信息抽取模型，实现了实体抽取、关系抽取、事件抽取、情感分析等任务的统一建模，并使得不同任务间具备良好的迁移和泛化能力。支持文本、跨模态文档的信息抽取。支持中、英、中英混合文本抽取。零样本和小样本能力卓越。|[快速开始](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/uie)|
+|PaddleNLP|ERNIE 3.0-Medium|文本领域预训练模型，在文心大模型 ERNIE 3.0 基础上通过在线蒸馏技术得到的轻量级模型，CLUE 评测验证其在同等规模模型(6-layer, 768-hidden, 12-heads)中效果SOTA。|[快速开始](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/ernie-3.0)|
+|PaddleNLP|ERNIE 3.0-Mini|文本领域预训练模型，在文心大模型 ERNIE 3.0 基础上通过在线蒸馏技术得到的轻量级模型，CLUE 评测验证其在同等规模模型(6-layer, 384-hidden, 12-heads)中效果SOTA。|[快速开始](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/ernie-3.0)|
+|PaddleNLP|ERNIE 3.0-Micro|文本领域预训练模型，在文心大模型 ERNIE 3.0 基础上通过在线蒸馏技术得到的轻量级模型，CLUE 评测验证其在同等规模模型(4-layer, 384-hidden, 12-heads)中效果SOTA。|[快速开始](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/ernie-3.0)|
+|PaddleNLP|ERNIE 3.0-Nano|文本领域预训练模型，在文心大模型 ERNIE 3.0 基础上通过在线蒸馏技术得到的轻量级模型，CLUE 评测验证其在同等规模模型(4-layer, 312-hidden, 12-heads)中效果SOTA。|[快速开始](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/ernie-3.0)|
+|PaddleNLP|ERNIE-Layout|多语言跨模态布局增强文档智能大模型，将布局知识增强技术融入跨模态文档预训练，在4项文档理解任务上刷新世界最好效果，登顶 DocVQA 榜首。|[快速开始](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/ernie-layout)|
+|PaddleNLP|ERNIE-ViL|业界首个融合场景图知识的多模态预训练模型，在包括视觉常识推理、视觉问答、引用表达式理解、跨模态图像检索、跨模态文本检索等 5 项典型多模态任务中刷新了世界最好效果，并在多模态领域权威榜单视觉常识推理任务（VCR）上登顶榜首。|[快速开始](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/paddlenlp/transformers/ernie_vil)|
 |PaddleSpeech|PP-ASR|PP-ASR是一套基于端到端神经网络结构模型的流式语音识别系统，支持实时语音识别服务，支持Language Model解码与个性化识别|[快速开始](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/asr/PPASR_cn.md)|
 |PaddleSpeech|PP-TTS|PP-TTS是一套基于基于端到端神经网络结构的流式语音合成系统，支持流式声学模型与流式声码器，开源快速部署流式合成服务方案|[快速开始](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/tts/PPTTS_cn.md)|
 |PaddleSpeech|PP-VPR|PP-VPR是一套声纹提取与检索系统，使用ECAPA-TDNN模型提取声纹特征，识别等错误率（EER，Equal error rate）低至0.95%，并且通过串联Mysql和Milvus，搭建完整的音频检索系统，实现毫秒级声音检索。|[快速开始](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/vpr/PPVPR_cn.md)|
+|PaddleSpeech|ERNIE-SAT|语音-语言跨模态大模型文心 ERNIE-SAT 在语音编辑、个性化语音合成以及跨语言的语音合成等多个任务取得了领先效果，可以应用于语音编辑、个性化合成、语音克隆、同传翻译等一系列场景|[快速开始](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3_vctk/ernie_sat)|
--- a/official/README.md
+++ b/official/README.md
--- a/research/README.md
+++ b/research/README.md