未验证 提交 ffcde213 编写于 作者: C chenjian 提交者: GitHub

add disco_diffusion_ernievil_base

上级 fc35d0c8
# disco_diffusion_ernievil_base
|模型名称|disco_diffusion_ernievil_base|
| :--- | :---: |
|类别|图像-文图生成|
|网络|dd+ERNIE-ViL|
|数据集|-|
|是否支持Fine-tuning|否|
|模型大小|2.9GB|
|最新更新日期|2022-08-02|
|数据指标|-|
## 一、模型基本信息
### 应用效果展示
- 输入文本 "小桥流水人家"
- 输出图像
<p align="center">
<img src="https://user-images.githubusercontent.com/22424850/183010362-ec76fa49-5170-462f-8fc8-4353fd648924.png" width = "80%" hspace='10'/>
<br />
- 生成过程
<p align="center">
<img src="https://user-images.githubusercontent.com/22424850/183010368-834e6388-411b-4e73-9bc4-a0ac97cb58d5.gif" width = "80%" hspace='10'/>
<br />
### 模型介绍
disco_diffusion_ernievil_base 是一个文图生成模型,可以通过输入一段文字来生成符合该句子语义的图像。该模型由两部分组成,一部分是扩散模型,是一种生成模型,可以从噪声输入中重建出原始图像。另一部分是多模态预训练模型(ERNIE-ViL), 可以将文本和图像表示在同一个特征空间,相近语义的文本和图像在该特征空间里距离会更相近。在该文图生成模型中,扩散模型负责从初始噪声或者指定初始图像中来生成目标图像,ERNIE-ViL负责引导生成图像的语义和输入的文本的语义尽可能接近,随着扩散模型在ERNIE-ViL的引导下不断的迭代生成新图像,最终能够生成文本所描述内容的图像。该模块中使用的模型为ERNIE-ViL,由ERNIE 3.0+ViT构成。
更多详情请参考论文:[Diffusion Models Beat GANs on Image Synthesis](https://arxiv.org/abs/2105.05233)
## 二、安装
- ### 1、环境依赖
- paddlepaddle >= 2.0.0
- paddlehub >= 2.2.0 | [如何安装PaddleHub](../../../../docs/docs_ch/get_start/installation.rst)
- ### 2、安装
- ```shell
$ hub install disco_diffusion_ernievil_base
```
- 如您安装时遇到问题,可参考:[零基础windows安装](../../../../docs/docs_ch/get_start/windows_quickstart.md)
| [零基础Linux安装](../../../../docs/docs_ch/get_start/linux_quickstart.md) | [零基础MacOS安装](../../../../docs/docs_ch/get_start/mac_quickstart.md)
## 三、模型API预测
- ### 1、命令行预测
- ```shell
$ hub run disco_diffusion_ernievil_base --text_prompts "孤舟蓑笠翁,独钓寒江雪。风格如齐白石所作。" --output_dir disco_diffusion_ernievil_base_out
```
- ### 2、预测代码示例
- ```python
import paddlehub as hub
module = hub.Module(name="disco_diffusion_ernievil_base")
text_prompts = ["孤舟蓑笠翁,独钓寒江雪。"]
# 生成图像, 默认会在disco_diffusion_ernievil_base_out目录保存图像
# 返回的da是一个DocumentArray对象,保存了所有的结果,包括最终结果和迭代过程的中间结果
# 可以通过操作DocumentArray对象对生成的图像做后处理,保存或者分析
da = module.generate_image(text_prompts=text_prompts, artist='齐白石', output_dir='./disco_diffusion_ernievil_base_out/')
# 手动将最终生成的图像保存到指定路径
da[0].save_uri_to_file('disco_diffusion_ernievil_base_out-result.png')
# 展示所有的中间结果
da[0].chunks.plot_image_sprites(skip_empty=True, show_index=True, keep_aspect_ratio=True)
# 将整个生成过程保存为一个动态图gif
da[0].chunks.save_gif('disco_diffusion_ernievil_base_out-result.gif', show_index=True, inline_display=True, size_ratio=0.5)
```
- ### 3、API
- ```python
def generate_image(
text_prompts,
style: Optional[str] = None,
artist: Optional[str] = None,
width_height: Optional[List[int]] = [1280, 768],
seed: Optional[int] = None,
output_dir: Optional[str] = 'disco_diffusion_ernievil_base_out'):
```
- 文图生成API,生成文本描述内容的图像。
- **参数**
- text_prompts(str): 输入的语句,描述想要生成的图像的内容。通常比较有效的构造方式为 "一段描述性的文字内容" + "指定艺术家的名字",如"孤舟蓑笠翁,独钓寒江雪。风格如齐白石所作"。
- style(Optional[str]): 指定绘画的风格,如水墨画、油画、水彩画等。当不指定时,风格完全由您所填写的prompt决定。
- artist(Optional[str]): 指定特定的艺术家,如齐白石、Greg Rutkowsk,将会生成所指定艺术家的绘画风格。当不指定时,风格完全由您所填写的prompt决定。各种艺术家的风格可以参考[网站](https://weirdwonderfulai.art/resources/disco-diffusion-70-plus-artist-studies/)。
- width_height(Optional[List[int]]): 指定最终输出图像的宽高,宽和高都需要是64的倍数,生成的图像越大,所需要的计算时间越长。
- seed(Optional[int]): 随机种子,由于输入默认是随机高斯噪声,设置不同的随机种子会由不同的初始输入,从而最终生成不同的结果,可以设置该参数来获得不同的输出图像。
- output_dir(Optional[str]): 保存输出图像的目录,默认为"disco_diffusion_ernievil_base_out"。
- **返回**
- ra(DocumentArray): DocumentArray对象, 包含`n_batches`个Documents,其中每个Document都保存了迭代过程的所有中间结果。详细可参考[DocumentArray使用文档](https://docarray.jina.ai/fundamentals/documentarray/index.html)。
## 四、更新历史
* 1.0.0
初始发布
```shell
$ hub install disco_diffusion_ernievil_base == 1.0.0
```
# copyright (c) 2022 PaddlePaddle Authors. All Rights Reserve.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import ast
import os
import sys
from functools import partial
from typing import List
from typing import Optional
import paddle
from disco_diffusion_ernievil_base import resize_right
from disco_diffusion_ernievil_base.reverse_diffusion import create
from disco_diffusion_ernievil_base.vit_b_16x import ernievil2
import paddlehub as hub
from paddlehub.module.module import moduleinfo
from paddlehub.module.module import runnable
from paddlehub.module.module import serving
@moduleinfo(name="disco_diffusion_ernievil_base",
version="1.0.0",
type="image/text_to_image",
summary="",
author="paddlepaddle",
author_email="paddle-dev@baidu.com")
class DiscoDiffusionClip:
def generate_image(self,
text_prompts,
style: Optional[str] = None,
artist: Optional[str] = None,
init_image: Optional[str] = None,
width_height: Optional[List[int]] = [1280, 768],
skip_steps: Optional[int] = 0,
steps: Optional[int] = 250,
cut_ic_pow: Optional[int] = 1,
init_scale: Optional[int] = 1000,
clip_guidance_scale: Optional[int] = 5000,
tv_scale: Optional[int] = 0,
range_scale: Optional[int] = 0,
sat_scale: Optional[int] = 0,
cutn_batches: Optional[int] = 4,
diffusion_sampling_mode: Optional[str] = 'ddim',
perlin_init: Optional[bool] = False,
perlin_mode: Optional[str] = 'mixed',
seed: Optional[int] = None,
eta: Optional[float] = 0.8,
clamp_grad: Optional[bool] = True,
clamp_max: Optional[float] = 0.05,
randomize_class: Optional[bool] = True,
clip_denoised: Optional[bool] = False,
fuzzy_prompt: Optional[bool] = False,
rand_mag: Optional[float] = 0.05,
cut_overview: Optional[str] = '[12]*400+[4]*600',
cut_innercut: Optional[str] = '[4]*400+[12]*600',
cut_icgray_p: Optional[str] = '[0.2]*400+[0]*600',
display_rate: Optional[int] = 10,
n_batches: Optional[int] = 1,
batch_size: Optional[int] = 1,
batch_name: Optional[str] = '',
use_gpu: Optional[bool] = True,
output_dir: Optional[str] = 'disco_diffusion_ernievil_base_out'):
"""
Create Disco Diffusion artworks and save the result into a DocumentArray.
:param text_prompts: Phrase, sentence, or string of words and phrases describing what the image should look like. The words will be analyzed by the AI and will guide the diffusion process toward the image(s) you describe. These can include commas and weights to adjust the relative importance of each element. E.g. "A beautiful painting of a singular lighthouse, shining its light across a tumultuous sea of blood by greg rutkowski and thomas kinkade, Trending on artstation."Notice that this prompt loosely follows a structure: [subject], [prepositional details], [setting], [meta modifiers and artist]; this is a good starting point for your experiments. Developing text prompts takes practice and experience, and is not the subject of this guide. If you are a beginner to writing text prompts, a good place to start is on a simple AI art app like Night Cafe, starry ai or WOMBO prior to using DD, to get a feel for how text gets translated into images by GAN tools. These other apps use different technologies, but many of the same principles apply.
:param init_image: Recall that in the image sequence above, the first image shown is just noise. If an init_image is provided, diffusion will replace the noise with the init_image as its starting state. To use an init_image, upload the image to the Colab instance or your Google Drive, and enter the full image path here. If using an init_image, you may need to increase skip_steps to ~ 50% of total steps to retain the character of the init. See skip_steps above for further discussion.
:param style: Image style, such as oil paintings, if specified, style will be used to construct prompts.
:param artist: Artist style, if specified, style will be used to construct prompts.
:param width_height: Desired final image size, in pixels. You can have a square, wide, or tall image, but each edge length should be set to a multiple of 64px, and a minimum of 512px on the default CLIP model setting. If you forget to use multiples of 64px in your dimensions, DD will adjust the dimensions of your image to make it so.
:param skip_steps: Consider the chart shown here. Noise scheduling (denoise strength) starts very high and progressively gets lower and lower as diffusion steps progress. The noise levels in the first few steps are very high, so images change dramatically in early steps.As DD moves along the curve, noise levels (and thus the amount an image changes per step) declines, and image coherence from one step to the next increases.The first few steps of denoising are often so dramatic that some steps (maybe 10-15% of total) can be skipped without affecting the final image. You can experiment with this as a way to cut render times.If you skip too many steps, however, the remaining noise may not be high enough to generate new content, and thus may not have ‘time left’ to finish an image satisfactorily.Also, depending on your other settings, you may need to skip steps to prevent CLIP from overshooting your goal, resulting in ‘blown out’ colors (hyper saturated, solid white, or solid black regions) or otherwise poor image quality. Consider that the denoising process is at its strongest in the early steps, so skipping steps can sometimes mitigate other problems.Lastly, if using an init_image, you will need to skip ~50% of the diffusion steps to retain the shapes in the original init image. However, if you’re using an init_image, you can also adjust skip_steps up or down for creative reasons. With low skip_steps you can get a result "inspired by" the init_image which will retain the colors and rough layout and shapes but look quite different. With high skip_steps you can preserve most of the init_image contents and just do fine tuning of the texture.
:param steps: When creating an image, the denoising curve is subdivided into steps for processing. Each step (or iteration) involves the AI looking at subsets of the image called ‘cuts’ and calculating the ‘direction’ the image should be guided to be more like the prompt. Then it adjusts the image with the help of the diffusion denoiser, and moves to the next step.Increasing steps will provide more opportunities for the AI to adjust the image, and each adjustment will be smaller, and thus will yield a more precise, detailed image. Increasing steps comes at the expense of longer render times. Also, while increasing steps should generally increase image quality, there is a diminishing return on additional steps beyond 250 - 500 steps. However, some intricate images can take 1000, 2000, or more steps. It is really up to the user. Just know that the render time is directly related to the number of steps, and many other parameters have a major impact on image quality, without costing additional time.
:param cut_ic_pow: This sets the size of the border used for inner cuts. High cut_ic_pow values have larger borders, and therefore the cuts themselves will be smaller and provide finer details. If you have too many or too-small inner cuts, you may lose overall image coherency and/or it may cause an undesirable ‘mosaic’ effect. Low cut_ic_pow values will allow the inner cuts to be larger, helping image coherency while still helping with some details.
:param init_scale: This controls how strongly CLIP will try to match the init_image provided. This is balanced against the clip_guidance_scale (CGS) above. Too much init scale, and the image won’t change much during diffusion. Too much CGS and the init image will be lost.
:param clip_guidance_scale: CGS is one of the most important parameters you will use. It tells DD how strongly you want CLIP to move toward your prompt each timestep. Higher is generally better, but if CGS is too strong it will overshoot the goal and distort the image. So a happy medium is needed, and it takes experience to learn how to adjust CGS. Note that this parameter generally scales with image dimensions. In other words, if you increase your total dimensions by 50% (e.g. a change from 512 x 512 to 512 x 768), then to maintain the same effect on the image, you’d want to increase clip_guidance_scale from 5000 to 7500. Of the basic settings, clip_guidance_scale, steps and skip_steps are the most important contributors to image quality, so learn them well.
:param tv_scale: Total variance denoising. Optional, set to zero to turn off. Controls ‘smoothness’ of final output. If used, tv_scale will try to smooth out your final image to reduce overall noise. If your image is too ‘crunchy’, increase tv_scale. TV denoising is good at preserving edges while smoothing away noise in flat regions. See https://en.wikipedia.org/wiki/Total_variation_denoising
:param range_scale: Optional, set to zero to turn off. Used for adjustment of color contrast. Lower range_scale will increase contrast. Very low numbers create a reduced color palette, resulting in more vibrant or poster-like images. Higher range_scale will reduce contrast, for more muted images.
:param sat_scale: Saturation scale. Optional, set to zero to turn off. If used, sat_scale will help mitigate oversaturation. If your image is too saturated, increase sat_scale to reduce the saturation.
:param cutn_batches: Each iteration, the AI cuts the image into smaller pieces known as cuts, and compares each cut to the prompt to decide how to guide the next diffusion step. More cuts can generally lead to better images, since DD has more chances to fine-tune the image precision in each timestep. Additional cuts are memory intensive, however, and if DD tries to evaluate too many cuts at once, it can run out of memory. You can use cutn_batches to increase cuts per timestep without increasing memory usage. At the default settings, DD is scheduled to do 16 cuts per timestep. If cutn_batches is set to 1, there will indeed only be 16 cuts total per timestep. However, if cutn_batches is increased to 4, DD will do 64 cuts total in each timestep, divided into 4 sequential batches of 16 cuts each. Because the cuts are being evaluated only 16 at a time, DD uses the memory required for only 16 cuts, but gives you the quality benefit of 64 cuts. The tradeoff, of course, is that this will take ~4 times as long to render each image.So, (scheduled cuts) x (cutn_batches) = (total cuts per timestep). Increasing cutn_batches will increase render times, however, as the work is being done sequentially. DD’s default cut schedule is a good place to start, but the cut schedule can be adjusted in the Cutn Scheduling section, explained below.
:param diffusion_sampling_mode: Two alternate diffusion denoising algorithms. ddim has been around longer, and is more established and tested. plms is a newly added alternate method that promises good diffusion results in fewer steps, but has not been as fully tested and may have side effects. This new plms mode is actively being researched in the #settings-and-techniques channel in the DD Discord.
:param perlin_init: Normally, DD will use an image filled with random noise as a starting point for the diffusion curve. If perlin_init is selected, DD will instead use a Perlin noise model as an initial state. Perlin has very interesting characteristics, distinct from random noise, so it’s worth experimenting with this for your projects. Beyond perlin, you can, of course, generate your own noise images (such as with GIMP, etc) and use them as an init_image (without skipping steps). Choosing perlin_init does not affect the actual diffusion process, just the starting point for the diffusion. Please note that selecting a perlin_init will replace and override any init_image you may have specified. Further, because the 2D, 3D and video animation systems all rely on the init_image system, if you enable Perlin while using animation modes, the perlin_init will jump in front of any previous image or video input, and DD will NOT give you the expected sequence of coherent images. All of that said, using Perlin and animation modes together do make a very colorful rainbow effect, which can be used creatively.
:param perlin_mode: sets type of Perlin noise: colored, gray, or a mix of both, giving you additional options for noise types. Experiment to see what these do in your projects.
:param seed: Deep in the diffusion code, there is a random number ‘seed’ which is used as the basis for determining the initial state of the diffusion. By default, this is random, but you can also specify your own seed. This is useful if you like a particular result and would like to run more iterations that will be similar. After each run, the actual seed value used will be reported in the parameters report, and can be reused if desired by entering seed # here. If a specific numerical seed is used repeatedly, the resulting images will be quite similar but not identical.
:param eta: eta (greek letter η) is a diffusion model variable that mixes in a random amount of scaled noise into each timestep. 0 is no noise, 1.0 is more noise. As with most DD parameters, you can go below zero for eta, but it may give you unpredictable results. The steps parameter has a close relationship with the eta parameter. If you set eta to 0, then you can get decent output with only 50-75 steps. Setting eta to 1.0 favors higher step counts, ideally around 250 and up. eta has a subtle, unpredictable effect on image, so you’ll need to experiment to see how this affects your projects.
:param clamp_grad: As I understand it, clamp_grad is an internal limiter that stops DD from producing extreme results. Try your images with and without clamp_grad. If the image changes drastically with clamp_grad turned off, it probably means your clip_guidance_scale is too high and should be reduced.
:param clamp_max: Sets the value of the clamp_grad limitation. Default is 0.05, providing for smoother, more muted coloration in images, but setting higher values (0.15-0.3) can provide interesting contrast and vibrancy.
:param fuzzy_prompt: Controls whether to add multiple noisy prompts to the prompt losses. If True, can increase variability of image output. Experiment with this.
:param rand_mag: Affects only the fuzzy_prompt. Controls the magnitude of the random noise added by fuzzy_prompt.
:param cut_overview: The schedule of overview cuts
:param cut_innercut: The schedule of inner cuts
:param cut_icgray_p: This sets the size of the border used for inner cuts. High cut_ic_pow values have larger borders, and therefore the cuts themselves will be smaller and provide finer details. If you have too many or too-small inner cuts, you may lose overall image coherency and/or it may cause an undesirable ‘mosaic’ effect. Low cut_ic_pow values will allow the inner cuts to be larger, helping image coherency while still helping with some details.
:param display_rate: During a diffusion run, you can monitor the progress of each image being created with this variable. If display_rate is set to 50, DD will show you the in-progress image every 50 timesteps. Setting this to a lower value, like 5 or 10, is a good way to get an early peek at where your image is heading. If you don’t like the progression, just interrupt execution, change some settings, and re-run. If you are planning a long, unmonitored batch, it’s better to set display_rate equal to steps, because displaying interim images does slow Colab down slightly.
:param n_batches: This variable sets the number of still images you want DD to create. If you are using an animation mode (see below for details) DD will ignore n_batches and create a single set of animated frames based on the animation settings.
:param batch_name: The name of the batch, the batch id will be named as "discoart-[batch_name]-seed". To avoid your artworks be overridden by other users, please use a unique name.
:param use_gpu: whether to use gpu or not.
:return: a DocumentArray object that has `n_batches` Documents
"""
if use_gpu:
try:
_places = os.environ.get("CUDA_VISIBLE_DEVICES", None)
if _places:
paddle.device.set_device("gpu:{}".format(0))
except:
raise RuntimeError(
"Environment Variable CUDA_VISIBLE_DEVICES is not set correctly. If you wanna use gpu, please set CUDA_VISIBLE_DEVICES as cuda_device_id."
)
else:
paddle.device.set_device("cpu")
paddle.disable_static()
if not os.path.exists(output_dir):
os.makedirs(output_dir, exist_ok=True)
if isinstance(text_prompts, str):
text_prompts = text_prompts.rstrip(',.,。')
if style is not None:
text_prompts += ",{}".format(style)
if artist is not None:
text_prompts += ",由{}所作".format(artist)
elif isinstance(text_prompts, list):
text_prompts[0] = text_prompts[0].rstrip(',.,。')
if style is not None:
text_prompts[0] += ",{}".format(style)
if artist is not None:
text_prompts[0] += ",由{}所作".format(artist)
return create(text_prompts=text_prompts,
init_image=init_image,
width_height=width_height,
skip_steps=skip_steps,
steps=steps,
cut_ic_pow=cut_ic_pow,
init_scale=init_scale,
clip_guidance_scale=clip_guidance_scale,
tv_scale=tv_scale,
range_scale=range_scale,
sat_scale=sat_scale,
cutn_batches=cutn_batches,
diffusion_sampling_mode=diffusion_sampling_mode,
perlin_init=perlin_init,
perlin_mode=perlin_mode,
seed=seed,
eta=eta,
clamp_grad=clamp_grad,
clamp_max=clamp_max,
randomize_class=randomize_class,
clip_denoised=clip_denoised,
fuzzy_prompt=fuzzy_prompt,
rand_mag=rand_mag,
cut_overview=cut_overview,
cut_innercut=cut_innercut,
cut_icgray_p=cut_icgray_p,
display_rate=display_rate,
n_batches=n_batches,
batch_size=batch_size,
batch_name=batch_name,
clip_models=['vit_b_16x'],
output_dir=output_dir)
@serving
def serving_method(self, text_prompts, **kwargs):
"""
Run as a service.
"""
results = []
for text_prompt in text_prompts:
result = self.generate_image(text_prompts=text_prompt, **kwargs)[0].to_base64()
results.append(result)
return results
@runnable
def run_cmd(self, argvs):
"""
Run as a command.
"""
self.parser = argparse.ArgumentParser(description="Run the {} module.".format(self.name),
prog='hub run {}'.format(self.name),
usage='%(prog)s',
add_help=True)
self.arg_input_group = self.parser.add_argument_group(title="Input options", description="Input data. Required")
self.arg_config_group = self.parser.add_argument_group(
title="Config options", description="Run configuration for controlling module behavior, not required.")
self.add_module_config_arg()
self.add_module_input_arg()
args = self.parser.parse_args(argvs)
results = self.generate_image(text_prompts=args.text_prompts,
style=args.style,
artist=args.artist,
init_image=args.init_image,
width_height=args.width_height,
skip_steps=args.skip_steps,
steps=args.steps,
cut_ic_pow=args.cut_ic_pow,
init_scale=args.init_scale,
clip_guidance_scale=args.clip_guidance_scale,
tv_scale=args.tv_scale,
range_scale=args.range_scale,
sat_scale=args.sat_scale,
cutn_batches=args.cutn_batches,
diffusion_sampling_mode=args.diffusion_sampling_mode,
perlin_init=args.perlin_init,
perlin_mode=args.perlin_mode,
seed=args.seed,
eta=args.eta,
clamp_grad=args.clamp_grad,
clamp_max=args.clamp_max,
randomize_class=args.randomize_class,
clip_denoised=args.clip_denoised,
fuzzy_prompt=args.fuzzy_prompt,
rand_mag=args.rand_mag,
cut_overview=args.cut_overview,
cut_innercut=args.cut_innercut,
cut_icgray_p=args.cut_icgray_p,
display_rate=args.display_rate,
n_batches=args.n_batches,
batch_size=args.batch_size,
batch_name=args.batch_name,
output_dir=args.output_dir)
return results
def add_module_config_arg(self):
"""
Add the command config options.
"""
self.arg_input_group.add_argument(
'--skip_steps',
type=int,
default=0,
help=
'Consider the chart shown here. Noise scheduling (denoise strength) starts very high and progressively gets lower and lower as diffusion steps progress. The noise levels in the first few steps are very high, so images change dramatically in early steps.As DD moves along the curve, noise levels (and thus the amount an image changes per step) declines, and image coherence from one step to the next increases.The first few steps of denoising are often so dramatic that some steps (maybe 10-15%% of total) can be skipped without affecting the final image. You can experiment with this as a way to cut render times.If you skip too many steps, however, the remaining noise may not be high enough to generate new content, and thus may not have ‘time left’ to finish an image satisfactorily.Also, depending on your other settings, you may need to skip steps to prevent CLIP from overshooting your goal, resulting in ‘blown out’ colors (hyper saturated, solid white, or solid black regions) or otherwise poor image quality. Consider that the denoising process is at its strongest in the early steps, so skipping steps can sometimes mitigate other problems.Lastly, if using an init_image, you will need to skip ~50%% of the diffusion steps to retain the shapes in the original init image. However, if you’re using an init_image, you can also adjust skip_steps up or down for creative reasons. With low skip_steps you can get a result "inspired by" the init_image which will retain the colors and rough layout and shapes but look quite different. With high skip_steps you can preserve most of the init_image contents and just do fine tuning of the texture'
)
self.arg_input_group.add_argument(
'--steps',
type=int,
default=250,
help=
"When creating an image, the denoising curve is subdivided into steps for processing. Each step (or iteration) involves the AI looking at subsets of the image called ‘cuts’ and calculating the ‘direction’ the image should be guided to be more like the prompt. Then it adjusts the image with the help of the diffusion denoiser, and moves to the next step.Increasing steps will provide more opportunities for the AI to adjust the image, and each adjustment will be smaller, and thus will yield a more precise, detailed image. Increasing steps comes at the expense of longer render times. Also, while increasing steps should generally increase image quality, there is a diminishing return on additional steps beyond 250 - 500 steps. However, some intricate images can take 1000, 2000, or more steps. It is really up to the user. Just know that the render time is directly related to the number of steps, and many other parameters have a major impact on image quality, without costing additional time."
)
self.arg_input_group.add_argument(
'--cut_ic_pow',
type=int,
default=1,
help=
"This sets the size of the border used for inner cuts. High cut_ic_pow values have larger borders, and therefore the cuts themselves will be smaller and provide finer details. If you have too many or too-small inner cuts, you may lose overall image coherency and/or it may cause an undesirable ‘mosaic’ effect. Low cut_ic_pow values will allow the inner cuts to be larger, helping image coherency while still helping with some details."
)
self.arg_input_group.add_argument(
'--init_scale',
type=int,
default=1000,
help=
"This controls how strongly CLIP will try to match the init_image provided. This is balanced against the clip_guidance_scale (CGS) above. Too much init scale, and the image won’t change much during diffusion. Too much CGS and the init image will be lost."
)
self.arg_input_group.add_argument(
'--clip_guidance_scale',
type=int,
default=5000,
help=
"CGS is one of the most important parameters you will use. It tells DD how strongly you want CLIP to move toward your prompt each timestep. Higher is generally better, but if CGS is too strong it will overshoot the goal and distort the image. So a happy medium is needed, and it takes experience to learn how to adjust CGS. Note that this parameter generally scales with image dimensions. In other words, if you increase your total dimensions by 50% (e.g. a change from 512 x 512 to 512 x 768), then to maintain the same effect on the image, you’d want to increase clip_guidance_scale from 5000 to 7500. Of the basic settings, clip_guidance_scale, steps and skip_steps are the most important contributors to image quality, so learn them well."
)
self.arg_input_group.add_argument(
'--tv_scale',
type=int,
default=0,
help=
"Total variance denoising. Optional, set to zero to turn off. Controls ‘smoothness’ of final output. If used, tv_scale will try to smooth out your final image to reduce overall noise. If your image is too ‘crunchy’, increase tv_scale. TV denoising is good at preserving edges while smoothing away noise in flat regions. See https://en.wikipedia.org/wiki/Total_variation_denoising"
)
self.arg_input_group.add_argument(
'--range_scale',
type=int,
default=0,
help=
"Optional, set to zero to turn off. Used for adjustment of color contrast. Lower range_scale will increase contrast. Very low numbers create a reduced color palette, resulting in more vibrant or poster-like images. Higher range_scale will reduce contrast, for more muted images."
)
self.arg_input_group.add_argument(
'--sat_scale',
type=int,
default=0,
help=
"Saturation scale. Optional, set to zero to turn off. If used, sat_scale will help mitigate oversaturation. If your image is too saturated, increase sat_scale to reduce the saturation."
)
self.arg_input_group.add_argument(
'--cutn_batches',
type=int,
default=4,
help=
"Each iteration, the AI cuts the image into smaller pieces known as cuts, and compares each cut to the prompt to decide how to guide the next diffusion step. More cuts can generally lead to better images, since DD has more chances to fine-tune the image precision in each timestep. Additional cuts are memory intensive, however, and if DD tries to evaluate too many cuts at once, it can run out of memory. You can use cutn_batches to increase cuts per timestep without increasing memory usage. At the default settings, DD is scheduled to do 16 cuts per timestep. If cutn_batches is set to 1, there will indeed only be 16 cuts total per timestep. However, if cutn_batches is increased to 4, DD will do 64 cuts total in each timestep, divided into 4 sequential batches of 16 cuts each. Because the cuts are being evaluated only 16 at a time, DD uses the memory required for only 16 cuts, but gives you the quality benefit of 64 cuts. The tradeoff, of course, is that this will take ~4 times as long to render each image.So, (scheduled cuts) x (cutn_batches) = (total cuts per timestep). Increasing cutn_batches will increase render times, however, as the work is being done sequentially. DD’s default cut schedule is a good place to start, but the cut schedule can be adjusted in the Cutn Scheduling section, explained below."
)
self.arg_input_group.add_argument(
'--diffusion_sampling_mode',
type=str,
default='ddim',
help=
"Two alternate diffusion denoising algorithms. ddim has been around longer, and is more established and tested. plms is a newly added alternate method that promises good diffusion results in fewer steps, but has not been as fully tested and may have side effects. This new plms mode is actively being researched in the #settings-and-techniques channel in the DD Discord."
)
self.arg_input_group.add_argument(
'--perlin_init',
type=bool,
default=False,
help=
"Normally, DD will use an image filled with random noise as a starting point for the diffusion curve. If perlin_init is selected, DD will instead use a Perlin noise model as an initial state. Perlin has very interesting characteristics, distinct from random noise, so it’s worth experimenting with this for your projects. Beyond perlin, you can, of course, generate your own noise images (such as with GIMP, etc) and use them as an init_image (without skipping steps). Choosing perlin_init does not affect the actual diffusion process, just the starting point for the diffusion. Please note that selecting a perlin_init will replace and override any init_image you may have specified. Further, because the 2D, 3D and video animation systems all rely on the init_image system, if you enable Perlin while using animation modes, the perlin_init will jump in front of any previous image or video input, and DD will NOT give you the expected sequence of coherent images. All of that said, using Perlin and animation modes together do make a very colorful rainbow effect, which can be used creatively."
)
self.arg_input_group.add_argument(
'--perlin_mode',
type=str,
default='mixed',
help=
"sets type of Perlin noise: colored, gray, or a mix of both, giving you additional options for noise types. Experiment to see what these do in your projects."
)
self.arg_input_group.add_argument(
'--seed',
type=int,
default=None,
help=
"Deep in the diffusion code, there is a random number ‘seed’ which is used as the basis for determining the initial state of the diffusion. By default, this is random, but you can also specify your own seed. This is useful if you like a particular result and would like to run more iterations that will be similar. After each run, the actual seed value used will be reported in the parameters report, and can be reused if desired by entering seed # here. If a specific numerical seed is used repeatedly, the resulting images will be quite similar but not identical."
)
self.arg_input_group.add_argument(
'--eta',
type=float,
default=0.8,
help=
"eta (greek letter η) is a diffusion model variable that mixes in a random amount of scaled noise into each timestep. 0 is no noise, 1.0 is more noise. As with most DD parameters, you can go below zero for eta, but it may give you unpredictable results. The steps parameter has a close relationship with the eta parameter. If you set eta to 0, then you can get decent output with only 50-75 steps. Setting eta to 1.0 favors higher step counts, ideally around 250 and up. eta has a subtle, unpredictable effect on image, so you’ll need to experiment to see how this affects your projects."
)
self.arg_input_group.add_argument(
'--clamp_grad',
type=bool,
default=True,
help=
"As I understand it, clamp_grad is an internal limiter that stops DD from producing extreme results. Try your images with and without clamp_grad. If the image changes drastically with clamp_grad turned off, it probably means your clip_guidance_scale is too high and should be reduced."
)
self.arg_input_group.add_argument(
'--clamp_max',
type=float,
default=0.05,
help=
"Sets the value of the clamp_grad limitation. Default is 0.05, providing for smoother, more muted coloration in images, but setting higher values (0.15-0.3) can provide interesting contrast and vibrancy."
)
self.arg_input_group.add_argument('--randomize_class', type=bool, default=True, help="Random class.")
self.arg_input_group.add_argument('--clip_denoised', type=bool, default=False, help="Clip denoised.")
self.arg_input_group.add_argument(
'--fuzzy_prompt',
type=bool,
default=False,
help=
"Controls whether to add multiple noisy prompts to the prompt losses. If True, can increase variability of image output. Experiment with this."
)
self.arg_input_group.add_argument(
'--rand_mag',
type=float,
default=0.5,
help="Affects only the fuzzy_prompt. Controls the magnitude of the random noise added by fuzzy_prompt.")
self.arg_input_group.add_argument('--cut_overview',
type=str,
default='[12]*400+[4]*600',
help="The schedule of overview cuts")
self.arg_input_group.add_argument('--cut_innercut',
type=str,
default='[4]*400+[12]*600',
help="The schedule of inner cuts")
self.arg_input_group.add_argument(
'--cut_icgray_p',
type=str,
default='[0.2]*400+[0]*600',
help=
"This sets the size of the border used for inner cuts. High cut_ic_pow values have larger borders, and therefore the cuts themselves will be smaller and provide finer details. If you have too many or too-small inner cuts, you may lose overall image coherency and/or it may cause an undesirable ‘mosaic’ effect. Low cut_ic_pow values will allow the inner cuts to be larger, helping image coherency while still helping with some details."
)
self.arg_input_group.add_argument(
'--display_rate',
type=int,
default=10,
help=
"During a diffusion run, you can monitor the progress of each image being created with this variable. If display_rate is set to 50, DD will show you the in-progress image every 50 timesteps. Setting this to a lower value, like 5 or 10, is a good way to get an early peek at where your image is heading. If you don’t like the progression, just interrupt execution, change some settings, and re-run. If you are planning a long, unmonitored batch, it’s better to set display_rate equal to steps, because displaying interim images does slow Colab down slightly."
)
self.arg_config_group.add_argument('--use_gpu',
type=ast.literal_eval,
default=True,
help="whether use GPU or not")
self.arg_config_group.add_argument('--output_dir',
type=str,
default='disco_diffusion_ernievil_base_out',
help='Output directory.')
def add_module_input_arg(self):
"""
Add the command input options.
"""
self.arg_input_group.add_argument('--text_prompts', type=str)
self.arg_input_group.add_argument(
'--style',
type=str,
default=None,
help='Image style, such as oil paintings, if specified, style will be used to construct prompts.')
self.arg_input_group.add_argument('--artist',
type=str,
default=None,
help='Artist style, if specified, style will be used to construct prompts.')
self.arg_input_group.add_argument(
'--init_image',
type=str,
default=None,
help=
"Recall that in the image sequence above, the first image shown is just noise. If an init_image is provided, diffusion will replace the noise with the init_image as its starting state. To use an init_image, upload the image to the Colab instance or your Google Drive, and enter the full image path here. If using an init_image, you may need to increase skip_steps to ~ 50% of total steps to retain the character of the init. See skip_steps above for further discussion."
)
self.arg_input_group.add_argument(
'--width_height',
type=ast.literal_eval,
default=[1280, 768],
help=
"Desired final image size, in pixels. You can have a square, wide, or tall image, but each edge length should be set to a multiple of 64px, and a minimum of 512px on the default CLIP model setting. If you forget to use multiples of 64px in your dimensions, DD will adjust the dimensions of your image to make it so."
)
self.arg_input_group.add_argument(
'--n_batches',
type=int,
default=1,
help=
"This variable sets the number of still images you want DD to create. If you are using an animation mode (see below for details) DD will ignore n_batches and create a single set of animated frames based on the animation settings."
)
self.arg_input_group.add_argument('--batch_size', type=int, default=1, help="Batch size.")
self.arg_input_group.add_argument(
'--batch_name',
type=str,
default='',
help=
'The name of the batch, the batch id will be named as "discoart-[batch_name]-seed". To avoid your artworks be overridden by other users, please use a unique name.'
)
numpy
paddle_lpips==0.1.2
ftfy
docarray>=0.13.29
pyyaml
regex
tqdm
ipywidgets
# ResizeRight (Paddle)
Fully differentiable resize function implemented by Paddle.
This module is based on [assafshocher/ResizeRight](https://github.com/assafshocher/ResizeRight).
from math import pi
try:
import paddle
except ImportError:
paddle = None
try:
import numpy
import numpy as np
except ImportError:
numpy = None
if numpy is None and paddle is None:
raise ImportError("Must have either Numpy or PyTorch but both not found")
def set_framework_dependencies(x):
if type(x) is numpy.ndarray:
to_dtype = lambda a: a
fw = numpy
else:
to_dtype = lambda a: paddle.cast(a, x.dtype)
fw = paddle
# eps = fw.finfo(fw.float32).eps
eps = paddle.to_tensor(np.finfo(np.float32).eps)
return fw, to_dtype, eps
def support_sz(sz):
def wrapper(f):
f.support_sz = sz
return f
return wrapper
@support_sz(4)
def cubic(x):
fw, to_dtype, eps = set_framework_dependencies(x)
absx = fw.abs(x)
absx2 = absx**2
absx3 = absx**3
return ((1.5 * absx3 - 2.5 * absx2 + 1.) * to_dtype(absx <= 1.) +
(-0.5 * absx3 + 2.5 * absx2 - 4. * absx + 2.) * to_dtype((1. < absx) & (absx <= 2.)))
@support_sz(4)
def lanczos2(x):
fw, to_dtype, eps = set_framework_dependencies(x)
return (((fw.sin(pi * x) * fw.sin(pi * x / 2) + eps) / ((pi**2 * x**2 / 2) + eps)) * to_dtype(abs(x) < 2))
@support_sz(6)
def lanczos3(x):
fw, to_dtype, eps = set_framework_dependencies(x)
return (((fw.sin(pi * x) * fw.sin(pi * x / 3) + eps) / ((pi**2 * x**2 / 3) + eps)) * to_dtype(abs(x) < 3))
@support_sz(2)
def linear(x):
fw, to_dtype, eps = set_framework_dependencies(x)
return ((x + 1) * to_dtype((-1 <= x) & (x < 0)) + (1 - x) * to_dtype((0 <= x) & (x <= 1)))
@support_sz(1)
def box(x):
fw, to_dtype, eps = set_framework_dependencies(x)
return to_dtype((-1 <= x) & (x < 0)) + to_dtype((0 <= x) & (x <= 1))
import warnings
from fractions import Fraction
from math import ceil
from typing import Tuple
import disco_diffusion_ernievil_base.resize_right.interp_methods as interp_methods
class NoneClass:
pass
try:
import paddle
from paddle import nn
nnModuleWrapped = nn.Layer
except ImportError:
warnings.warn('No PyTorch found, will work only with Numpy')
paddle = None
nnModuleWrapped = NoneClass
try:
import numpy
import numpy as np
except ImportError:
warnings.warn('No Numpy found, will work only with PyTorch')
numpy = None
if numpy is None and paddle is None:
raise ImportError("Must have either Numpy or PyTorch but both not found")
def resize(input,
scale_factors=None,
out_shape=None,
interp_method=interp_methods.cubic,
support_sz=None,
antialiasing=True,
by_convs=False,
scale_tolerance=None,
max_numerator=10,
pad_mode='constant'):
# get properties of the input tensor
in_shape, n_dims = input.shape, input.ndim
# fw stands for framework that can be either numpy or paddle,
# determined by the input type
fw = numpy if type(input) is numpy.ndarray else paddle
eps = np.finfo(np.float32).eps if fw == numpy else paddle.to_tensor(np.finfo(np.float32).eps)
device = input.place if fw is paddle else None
# set missing scale factors or output shapem one according to another,
# scream if both missing. this is also where all the defults policies
# take place. also handling the by_convs attribute carefully.
scale_factors, out_shape, by_convs = set_scale_and_out_sz(in_shape, out_shape, scale_factors, by_convs,
scale_tolerance, max_numerator, eps, fw)
# sort indices of dimensions according to scale of each dimension.
# since we are going dim by dim this is efficient
sorted_filtered_dims_and_scales = [(dim, scale_factors[dim], by_convs[dim], in_shape[dim], out_shape[dim])
for dim in sorted(range(n_dims), key=lambda ind: scale_factors[ind])
if scale_factors[dim] != 1.]
# unless support size is specified by the user, it is an attribute
# of the interpolation method
if support_sz is None:
support_sz = interp_method.support_sz
# output begins identical to input and changes with each iteration
output = input
# iterate over dims
for (dim, scale_factor, dim_by_convs, in_sz, out_sz) in sorted_filtered_dims_and_scales:
# STEP 1- PROJECTED GRID: The non-integer locations of the projection
# of output pixel locations to the input tensor
projected_grid = get_projected_grid(in_sz, out_sz, scale_factor, fw, dim_by_convs, device)
# STEP 1.5: ANTIALIASING- If antialiasing is taking place, we modify
# the window size and the interpolation method (see inside function)
cur_interp_method, cur_support_sz = apply_antialiasing_if_needed(interp_method, support_sz, scale_factor,
antialiasing)
# STEP 2- FIELDS OF VIEW: for each output pixels, map the input pixels
# that influence it. Also calculate needed padding and update grid
# accoedingly
field_of_view = get_field_of_view(projected_grid, cur_support_sz, fw, eps, device)
# STEP 2.5- CALCULATE PAD AND UPDATE: according to the field of view,
# the input should be padded to handle the boundaries, coordinates
# should be updated. actual padding only occurs when weights are
# aplied (step 4). if using by_convs for this dim, then we need to
# calc right and left boundaries for each filter instead.
pad_sz, projected_grid, field_of_view = calc_pad_sz(in_sz, out_sz, field_of_view, projected_grid, scale_factor,
dim_by_convs, fw, device)
# STEP 3- CALCULATE WEIGHTS: Match a set of weights to the pixels in
# the field of view for each output pixel
weights = get_weights(cur_interp_method, projected_grid, field_of_view)
# STEP 4- APPLY WEIGHTS: Each output pixel is calculated by multiplying
# its set of weights with the pixel values in its field of view.
# We now multiply the fields of view with their matching weights.
# We do this by tensor multiplication and broadcasting.
# if by_convs is true for this dim, then we do this action by
# convolutions. this is equivalent but faster.
if not dim_by_convs:
output = apply_weights(output, field_of_view, weights, dim, n_dims, pad_sz, pad_mode, fw)
else:
output = apply_convs(output, scale_factor, in_sz, out_sz, weights, dim, pad_sz, pad_mode, fw)
return output
def get_projected_grid(in_sz, out_sz, scale_factor, fw, by_convs, device=None):
# we start by having the ouput coordinates which are just integer locations
# in the special case when usin by_convs, we only need two cycles of grid
# points. the first and last.
grid_sz = out_sz if not by_convs else scale_factor.numerator
out_coordinates = fw_arange(grid_sz, fw, device)
# This is projecting the ouput pixel locations in 1d to the input tensor,
# as non-integer locations.
# the following fomrula is derived in the paper
# "From Discrete to Continuous Convolutions" by Shocher et al.
return (out_coordinates / float(scale_factor) + (in_sz - 1) / 2 - (out_sz - 1) / (2 * float(scale_factor)))
def get_field_of_view(projected_grid, cur_support_sz, fw, eps, device):
# for each output pixel, map which input pixels influence it, in 1d.
# we start by calculating the leftmost neighbor, using half of the window
# size (eps is for when boundary is exact int)
left_boundaries = fw_ceil(projected_grid - cur_support_sz / 2 - eps, fw)
# then we simply take all the pixel centers in the field by counting
# window size pixels from the left boundary
ordinal_numbers = fw_arange(ceil(cur_support_sz - eps), fw, device)
return left_boundaries[:, None] + ordinal_numbers
def calc_pad_sz(in_sz, out_sz, field_of_view, projected_grid, scale_factor, dim_by_convs, fw, device):
if not dim_by_convs:
# determine padding according to neighbor coords out of bound.
# this is a generalized notion of padding, when pad<0 it means crop
pad_sz = [-field_of_view[0, 0].item(), field_of_view[-1, -1].item() - in_sz + 1]
# since input image will be changed by padding, coordinates of both
# field_of_view and projected_grid need to be updated
field_of_view += pad_sz[0]
projected_grid += pad_sz[0]
else:
# only used for by_convs, to calc the boundaries of each filter the
# number of distinct convolutions is the numerator of the scale factor
num_convs, stride = scale_factor.numerator, scale_factor.denominator
# calculate left and right boundaries for each conv. left can also be
# negative right can be bigger than in_sz. such cases imply padding if
# needed. however if# both are in-bounds, it means we need to crop,
# practically apply the conv only on part of the image.
left_pads = -field_of_view[:, 0]
# next calc is tricky, explanation by rows:
# 1) counting output pixels between the first position of each filter
# to the right boundary of the input
# 2) dividing it by number of filters to count how many 'jumps'
# each filter does
# 3) multiplying by the stride gives us the distance over the input
# coords done by all these jumps for each filter
# 4) to this distance we add the right boundary of the filter when
# placed in its leftmost position. so now we get the right boundary
# of that filter in input coord.
# 5) the padding size needed is obtained by subtracting the rightmost
# input coordinate. if the result is positive padding is needed. if
# negative then negative padding means shaving off pixel columns.
right_pads = (((out_sz - fw_arange(num_convs, fw, device) - 1) # (1)
// num_convs) # (2)
* stride # (3)
+ field_of_view[:, -1] # (4)
- in_sz + 1) # (5)
# in the by_convs case pad_sz is a list of left-right pairs. one per
# each filter
pad_sz = list(zip(left_pads, right_pads))
return pad_sz, projected_grid, field_of_view
def get_weights(interp_method, projected_grid, field_of_view):
# the set of weights per each output pixels is the result of the chosen
# interpolation method applied to the distances between projected grid
# locations and the pixel-centers in the field of view (distances are
# directed, can be positive or negative)
weights = interp_method(projected_grid[:, None] - field_of_view)
# we now carefully normalize the weights to sum to 1 per each output pixel
sum_weights = weights.sum(1, keepdim=True)
sum_weights[sum_weights == 0] = 1
return weights / sum_weights
def apply_weights(input, field_of_view, weights, dim, n_dims, pad_sz, pad_mode, fw):
# for this operation we assume the resized dim is the first one.
# so we transpose and will transpose back after multiplying
tmp_input = fw_swapaxes(input, dim, 0, fw)
# apply padding
tmp_input = fw_pad(tmp_input, fw, pad_sz, pad_mode)
# field_of_view is a tensor of order 2: for each output (1d location
# along cur dim)- a list of 1d neighbors locations.
# note that this whole operations is applied to each dim separately,
# this is why it is all in 1d.
# neighbors = tmp_input[field_of_view] is a tensor of order image_dims+1:
# for each output pixel (this time indicated in all dims), these are the
# values of the neighbors in the 1d field of view. note that we only
# consider neighbors along the current dim, but such set exists for every
# multi-dim location, hence the final tensor order is image_dims+1.
paddle.device.cuda.empty_cache()
neighbors = tmp_input[field_of_view]
# weights is an order 2 tensor: for each output location along 1d- a list
# of weights matching the field of view. we augment it with ones, for
# broadcasting, so that when multiplies some tensor the weights affect
# only its first dim.
tmp_weights = fw.reshape(weights, (*weights.shape, *[1] * (n_dims - 1)))
# now we simply multiply the weights with the neighbors, and then sum
# along the field of view, to get a single value per out pixel
tmp_output = (neighbors * tmp_weights).sum(1)
# we transpose back the resized dim to its original position
return fw_swapaxes(tmp_output, 0, dim, fw)
def apply_convs(input, scale_factor, in_sz, out_sz, weights, dim, pad_sz, pad_mode, fw):
# for this operations we assume the resized dim is the last one.
# so we transpose and will transpose back after multiplying
input = fw_swapaxes(input, dim, -1, fw)
# the stride for all convs is the denominator of the scale factor
stride, num_convs = scale_factor.denominator, scale_factor.numerator
# prepare an empty tensor for the output
tmp_out_shape = list(input.shape)
tmp_out_shape[-1] = out_sz
tmp_output = fw_empty(tuple(tmp_out_shape), fw, input.device)
# iterate over the conv operations. we have as many as the numerator
# of the scale-factor. for each we need boundaries and a filter.
for conv_ind, (pad_sz, filt) in enumerate(zip(pad_sz, weights)):
# apply padding (we pad last dim, padding can be negative)
pad_dim = input.ndim - 1
tmp_input = fw_pad(input, fw, pad_sz, pad_mode, dim=pad_dim)
# apply convolution over last dim. store in the output tensor with
# positional strides so that when the loop is comlete conv results are
# interwind
tmp_output[..., conv_ind::num_convs] = fw_conv(tmp_input, filt, stride)
return fw_swapaxes(tmp_output, -1, dim, fw)
def set_scale_and_out_sz(in_shape, out_shape, scale_factors, by_convs, scale_tolerance, max_numerator, eps, fw):
# eventually we must have both scale-factors and out-sizes for all in/out
# dims. however, we support many possible partial arguments
if scale_factors is None and out_shape is None:
raise ValueError("either scale_factors or out_shape should be "
"provided")
if out_shape is not None:
# if out_shape has less dims than in_shape, we defaultly resize the
# first dims for numpy and last dims for paddle
out_shape = (list(out_shape) +
list(in_shape[len(out_shape):]) if fw is numpy else list(in_shape[:-len(out_shape)]) +
list(out_shape))
if scale_factors is None:
# if no scale given, we calculate it as the out to in ratio
# (not recomended)
scale_factors = [out_sz / in_sz for out_sz, in_sz in zip(out_shape, in_shape)]
if scale_factors is not None:
# by default, if a single number is given as scale, we assume resizing
# two dims (most common are images with 2 spatial dims)
scale_factors = (scale_factors if isinstance(scale_factors, (list, tuple)) else [scale_factors, scale_factors])
# if less scale_factors than in_shape dims, we defaultly resize the
# first dims for numpy and last dims for paddle
scale_factors = (list(scale_factors) + [1] * (len(in_shape) - len(scale_factors)) if fw is numpy else [1] *
(len(in_shape) - len(scale_factors)) + list(scale_factors))
if out_shape is None:
# when no out_shape given, it is calculated by multiplying the
# scale by the in_shape (not recomended)
out_shape = [ceil(scale_factor * in_sz) for scale_factor, in_sz in zip(scale_factors, in_shape)]
# next part intentionally after out_shape determined for stability
# we fix by_convs to be a list of truth values in case it is not
if not isinstance(by_convs, (list, tuple)):
by_convs = [by_convs] * len(out_shape)
# next loop fixes the scale for each dim to be either frac or float.
# this is determined by by_convs and by tolerance for scale accuracy.
for ind, (sf, dim_by_convs) in enumerate(zip(scale_factors, by_convs)):
# first we fractionaize
if dim_by_convs:
frac = Fraction(1 / sf).limit_denominator(max_numerator)
frac = Fraction(numerator=frac.denominator, denominator=frac.numerator)
# if accuracy is within tolerance scale will be frac. if not, then
# it will be float and the by_convs attr will be set false for
# this dim
if scale_tolerance is None:
scale_tolerance = eps
if dim_by_convs and abs(frac - sf) < scale_tolerance:
scale_factors[ind] = frac
else:
scale_factors[ind] = float(sf)
by_convs[ind] = False
return scale_factors, out_shape, by_convs
def apply_antialiasing_if_needed(interp_method, support_sz, scale_factor, antialiasing):
# antialiasing is "stretching" the field of view according to the scale
# factor (only for downscaling). this is low-pass filtering. this
# requires modifying both the interpolation (stretching the 1d
# function and multiplying by the scale-factor) and the window size.
scale_factor = float(scale_factor)
if scale_factor >= 1.0 or not antialiasing:
return interp_method, support_sz
cur_interp_method = (lambda arg: scale_factor * interp_method(scale_factor * arg))
cur_support_sz = support_sz / scale_factor
return cur_interp_method, cur_support_sz
def fw_ceil(x, fw):
if fw is numpy:
return fw.int_(fw.ceil(x))
else:
return paddle.cast(x.ceil(), dtype='int64')
def fw_floor(x, fw):
if fw is numpy:
return fw.int_(fw.floor(x))
else:
return paddle.cast(x.floor(), dtype='int64')
def fw_cat(x, fw):
if fw is numpy:
return fw.concatenate(x)
else:
return fw.concat(x)
def fw_swapaxes(x, ax_1, ax_2, fw):
if fw is numpy:
return fw.swapaxes(x, ax_1, ax_2)
else:
if ax_1 == -1:
ax_1 = len(x.shape) - 1
if ax_2 == -1:
ax_2 = len(x.shape) - 1
perm0 = list(range(len(x.shape)))
temp = ax_1
perm0[temp] = ax_2
perm0[ax_2] = temp
return fw.transpose(x, perm0)
def fw_pad(x, fw, pad_sz, pad_mode, dim=0):
if pad_sz == (0, 0):
return x
if fw is numpy:
pad_vec = [(0, 0)] * x.ndim
pad_vec[dim] = pad_sz
return fw.pad(x, pad_width=pad_vec, mode=pad_mode)
else:
if x.ndim < 3:
x = x[None, None, ...]
pad_vec = [0] * ((x.ndim - 2) * 2)
pad_vec[0:2] = pad_sz
return fw_swapaxes(fw.nn.functional.pad(fw_swapaxes(x, dim, -1, fw), pad=pad_vec, mode=pad_mode), dim, -1, fw)
def fw_conv(input, filter, stride):
# we want to apply 1d conv to any nd array. the way to do it is to reshape
# the input to a 4D tensor. first two dims are singeletons, 3rd dim stores
# all the spatial dims that we are not convolving along now. then we can
# apply conv2d with a 1xK filter. This convolves the same way all the other
# dims stored in the 3d dim. like depthwise conv over these.
# TODO: numpy support
reshaped_input = input.reshape(1, 1, -1, input.shape[-1])
reshaped_output = paddle.nn.functional.conv2d(reshaped_input, filter.view(1, 1, 1, -1), stride=(1, stride))
return reshaped_output.reshape(*input.shape[:-1], -1)
def fw_arange(upper_bound, fw, device):
if fw is numpy:
return fw.arange(upper_bound)
else:
return fw.arange(upper_bound)
def fw_empty(shape, fw, device):
if fw is numpy:
return fw.empty(shape)
else:
return fw.empty(shape=shape)
# Diffusion model (Paddle)
This module implements diffusion model which accepts a text prompt and outputs images semantically close to the text. The code is rewritten by Paddle, and mainly refer to two projects: jina-ai/discoart[https://github.com/jina-ai/discoart] and openai/guided-diffusion[https://github.com/openai/guided-diffusion]. Thanks for their wonderful work.
'''
https://github.com/jina-ai/discoart/blob/main/discoart/__init__.py
'''
import os
import warnings
os.environ['KMP_DUPLICATE_LIB_OK'] = 'TRUE'
__all__ = ['create']
import sys
__resources_path__ = os.path.join(
os.path.dirname(sys.modules.get(__package__).__file__ if __package__ in sys.modules else __file__),
'resources',
)
import gc
# check if GPU is available
import paddle
# download and load models, this will take some time on the first load
from .helper import load_all_models, load_diffusion_model, load_clip_models
model_config, secondary_model = load_all_models('512x512_diffusion_uncond_finetune_008100', use_secondary_model=True)
from typing import TYPE_CHECKING, overload, List, Optional
if TYPE_CHECKING:
from docarray import DocumentArray, Document
_clip_models_cache = {}
# begin_create_overload
@overload
def create(text_prompts: Optional[List[str]] = [
'A beautiful painting of a singular lighthouse, shining its light across a tumultuous sea of blood by greg rutkowski and thomas kinkade, Trending on artstation.',
'yellow color scheme',
],
init_image: Optional[str] = None,
width_height: Optional[List[int]] = [1280, 768],
skip_steps: Optional[int] = 10,
steps: Optional[int] = 250,
cut_ic_pow: Optional[int] = 1,
init_scale: Optional[int] = 1000,
clip_guidance_scale: Optional[int] = 5000,
tv_scale: Optional[int] = 0,
range_scale: Optional[int] = 150,
sat_scale: Optional[int] = 0,
cutn_batches: Optional[int] = 4,
diffusion_model: Optional[str] = '512x512_diffusion_uncond_finetune_008100',
use_secondary_model: Optional[bool] = True,
diffusion_sampling_mode: Optional[str] = 'ddim',
perlin_init: Optional[bool] = False,
perlin_mode: Optional[str] = 'mixed',
seed: Optional[int] = None,
eta: Optional[float] = 0.8,
clamp_grad: Optional[bool] = True,
clamp_max: Optional[float] = 0.05,
randomize_class: Optional[bool] = True,
clip_denoised: Optional[bool] = False,
fuzzy_prompt: Optional[bool] = False,
rand_mag: Optional[float] = 0.05,
cut_overview: Optional[str] = '[12]*400+[4]*600',
cut_innercut: Optional[str] = '[4]*400+[12]*600',
cut_icgray_p: Optional[str] = '[0.2]*400+[0]*600',
display_rate: Optional[int] = 10,
n_batches: Optional[int] = 4,
batch_size: Optional[int] = 1,
batch_name: Optional[str] = '',
clip_models: Optional[list] = ['ViTB32', 'ViTB16', 'RN50'],
output_dir: Optional[str] = 'discoart_output') -> 'DocumentArray':
"""
Create Disco Diffusion artworks and save the result into a DocumentArray.
:param text_prompts: Phrase, sentence, or string of words and phrases describing what the image should look like. The words will be analyzed by the AI and will guide the diffusion process toward the image(s) you describe. These can include commas and weights to adjust the relative importance of each element. E.g. "A beautiful painting of a singular lighthouse, shining its light across a tumultuous sea of blood by greg rutkowski and thomas kinkade, Trending on artstation."Notice that this prompt loosely follows a structure: [subject], [prepositional details], [setting], [meta modifiers and artist]; this is a good starting point for your experiments. Developing text prompts takes practice and experience, and is not the subject of this guide. If you are a beginner to writing text prompts, a good place to start is on a simple AI art app like Night Cafe, starry ai or WOMBO prior to using DD, to get a feel for how text gets translated into images by GAN tools. These other apps use different technologies, but many of the same principles apply.
:param init_image: Recall that in the image sequence above, the first image shown is just noise. If an init_image is provided, diffusion will replace the noise with the init_image as its starting state. To use an init_image, upload the image to the Colab instance or your Google Drive, and enter the full image path here. If using an init_image, you may need to increase skip_steps to ~ 50% of total steps to retain the character of the init. See skip_steps above for further discussion.
:param width_height: Desired final image size, in pixels. You can have a square, wide, or tall image, but each edge length should be set to a multiple of 64px, and a minimum of 512px on the default CLIP model setting. If you forget to use multiples of 64px in your dimensions, DD will adjust the dimensions of your image to make it so.
:param skip_steps: Consider the chart shown here. Noise scheduling (denoise strength) starts very high and progressively gets lower and lower as diffusion steps progress. The noise levels in the first few steps are very high, so images change dramatically in early steps.As DD moves along the curve, noise levels (and thus the amount an image changes per step) declines, and image coherence from one step to the next increases.The first few steps of denoising are often so dramatic that some steps (maybe 10-15% of total) can be skipped without affecting the final image. You can experiment with this as a way to cut render times.If you skip too many steps, however, the remaining noise may not be high enough to generate new content, and thus may not have ‘time left’ to finish an image satisfactorily.Also, depending on your other settings, you may need to skip steps to prevent CLIP from overshooting your goal, resulting in ‘blown out’ colors (hyper saturated, solid white, or solid black regions) or otherwise poor image quality. Consider that the denoising process is at its strongest in the early steps, so skipping steps can sometimes mitigate other problems.Lastly, if using an init_image, you will need to skip ~50% of the diffusion steps to retain the shapes in the original init image. However, if you’re using an init_image, you can also adjust skip_steps up or down for creative reasons. With low skip_steps you can get a result "inspired by" the init_image which will retain the colors and rough layout and shapes but look quite different. With high skip_steps you can preserve most of the init_image contents and just do fine tuning of the texture.
:param steps: When creating an image, the denoising curve is subdivided into steps for processing. Each step (or iteration) involves the AI looking at subsets of the image called ‘cuts’ and calculating the ‘direction’ the image should be guided to be more like the prompt. Then it adjusts the image with the help of the diffusion denoiser, and moves to the next step.Increasing steps will provide more opportunities for the AI to adjust the image, and each adjustment will be smaller, and thus will yield a more precise, detailed image. Increasing steps comes at the expense of longer render times. Also, while increasing steps should generally increase image quality, there is a diminishing return on additional steps beyond 250 - 500 steps. However, some intricate images can take 1000, 2000, or more steps. It is really up to the user. Just know that the render time is directly related to the number of steps, and many other parameters have a major impact on image quality, without costing additional time.
:param cut_ic_pow: This sets the size of the border used for inner cuts. High cut_ic_pow values have larger borders, and therefore the cuts themselves will be smaller and provide finer details. If you have too many or too-small inner cuts, you may lose overall image coherency and/or it may cause an undesirable ‘mosaic’ effect. Low cut_ic_pow values will allow the inner cuts to be larger, helping image coherency while still helping with some details.
:param init_scale: This controls how strongly CLIP will try to match the init_image provided. This is balanced against the clip_guidance_scale (CGS) above. Too much init scale, and the image won’t change much during diffusion. Too much CGS and the init image will be lost.
:param clip_guidance_scale: CGS is one of the most important parameters you will use. It tells DD how strongly you want CLIP to move toward your prompt each timestep. Higher is generally better, but if CGS is too strong it will overshoot the goal and distort the image. So a happy medium is needed, and it takes experience to learn how to adjust CGS. Note that this parameter generally scales with image dimensions. In other words, if you increase your total dimensions by 50% (e.g. a change from 512 x 512 to 512 x 768), then to maintain the same effect on the image, you’d want to increase clip_guidance_scale from 5000 to 7500. Of the basic settings, clip_guidance_scale, steps and skip_steps are the most important contributors to image quality, so learn them well.
:param tv_scale: Total variance denoising. Optional, set to zero to turn off. Controls ‘smoothness’ of final output. If used, tv_scale will try to smooth out your final image to reduce overall noise. If your image is too ‘crunchy’, increase tv_scale. TV denoising is good at preserving edges while smoothing away noise in flat regions. See https://en.wikipedia.org/wiki/Total_variation_denoising
:param range_scale: Optional, set to zero to turn off. Used for adjustment of color contrast. Lower range_scale will increase contrast. Very low numbers create a reduced color palette, resulting in more vibrant or poster-like images. Higher range_scale will reduce contrast, for more muted images.
:param sat_scale: Saturation scale. Optional, set to zero to turn off. If used, sat_scale will help mitigate oversaturation. If your image is too saturated, increase sat_scale to reduce the saturation.
:param cutn_batches: Each iteration, the AI cuts the image into smaller pieces known as cuts, and compares each cut to the prompt to decide how to guide the next diffusion step. More cuts can generally lead to better images, since DD has more chances to fine-tune the image precision in each timestep. Additional cuts are memory intensive, however, and if DD tries to evaluate too many cuts at once, it can run out of memory. You can use cutn_batches to increase cuts per timestep without increasing memory usage. At the default settings, DD is scheduled to do 16 cuts per timestep. If cutn_batches is set to 1, there will indeed only be 16 cuts total per timestep. However, if cutn_batches is increased to 4, DD will do 64 cuts total in each timestep, divided into 4 sequential batches of 16 cuts each. Because the cuts are being evaluated only 16 at a time, DD uses the memory required for only 16 cuts, but gives you the quality benefit of 64 cuts. The tradeoff, of course, is that this will take ~4 times as long to render each image.So, (scheduled cuts) x (cutn_batches) = (total cuts per timestep). Increasing cutn_batches will increase render times, however, as the work is being done sequentially. DD’s default cut schedule is a good place to start, but the cut schedule can be adjusted in the Cutn Scheduling section, explained below.
:param diffusion_model: Diffusion_model of choice.
:param use_secondary_model: Option to use a secondary purpose-made diffusion model to clean up interim diffusion images for CLIP evaluation. If this option is turned off, DD will use the regular (large) diffusion model. Using the secondary model is faster - one user reported a 50% improvement in render speed! However, the secondary model is much smaller, and may reduce image quality and detail. I suggest you experiment with this.
:param diffusion_sampling_mode: Two alternate diffusion denoising algorithms. ddim has been around longer, and is more established and tested. plms is a newly added alternate method that promises good diffusion results in fewer steps, but has not been as fully tested and may have side effects. This new plms mode is actively being researched in the #settings-and-techniques channel in the DD Discord.
:param perlin_init: Normally, DD will use an image filled with random noise as a starting point for the diffusion curve. If perlin_init is selected, DD will instead use a Perlin noise model as an initial state. Perlin has very interesting characteristics, distinct from random noise, so it’s worth experimenting with this for your projects. Beyond perlin, you can, of course, generate your own noise images (such as with GIMP, etc) and use them as an init_image (without skipping steps). Choosing perlin_init does not affect the actual diffusion process, just the starting point for the diffusion. Please note that selecting a perlin_init will replace and override any init_image you may have specified. Further, because the 2D, 3D and video animation systems all rely on the init_image system, if you enable Perlin while using animation modes, the perlin_init will jump in front of any previous image or video input, and DD will NOT give you the expected sequence of coherent images. All of that said, using Perlin and animation modes together do make a very colorful rainbow effect, which can be used creatively.
:param perlin_mode: sets type of Perlin noise: colored, gray, or a mix of both, giving you additional options for noise types. Experiment to see what these do in your projects.
:param seed: Deep in the diffusion code, there is a random number ‘seed’ which is used as the basis for determining the initial state of the diffusion. By default, this is random, but you can also specify your own seed. This is useful if you like a particular result and would like to run more iterations that will be similar. After each run, the actual seed value used will be reported in the parameters report, and can be reused if desired by entering seed # here. If a specific numerical seed is used repeatedly, the resulting images will be quite similar but not identical.
:param eta: eta (greek letter η) is a diffusion model variable that mixes in a random amount of scaled noise into each timestep. 0 is no noise, 1.0 is more noise. As with most DD parameters, you can go below zero for eta, but it may give you unpredictable results. The steps parameter has a close relationship with the eta parameter. If you set eta to 0, then you can get decent output with only 50-75 steps. Setting eta to 1.0 favors higher step counts, ideally around 250 and up. eta has a subtle, unpredictable effect on image, so you’ll need to experiment to see how this affects your projects.
:param clamp_grad: As I understand it, clamp_grad is an internal limiter that stops DD from producing extreme results. Try your images with and without clamp_grad. If the image changes drastically with clamp_grad turned off, it probably means your clip_guidance_scale is too high and should be reduced.
:param clamp_max: Sets the value of the clamp_grad limitation. Default is 0.05, providing for smoother, more muted coloration in images, but setting higher values (0.15-0.3) can provide interesting contrast and vibrancy.
:param fuzzy_prompt: Controls whether to add multiple noisy prompts to the prompt losses. If True, can increase variability of image output. Experiment with this.
:param rand_mag: Affects only the fuzzy_prompt. Controls the magnitude of the random noise added by fuzzy_prompt.
:param cut_overview: The schedule of overview cuts
:param cut_innercut: The schedule of inner cuts
:param cut_icgray_p: This sets the size of the border used for inner cuts. High cut_ic_pow values have larger borders, and therefore the cuts themselves will be smaller and provide finer details. If you have too many or too-small inner cuts, you may lose overall image coherency and/or it may cause an undesirable ‘mosaic’ effect. Low cut_ic_pow values will allow the inner cuts to be larger, helping image coherency while still helping with some details.
:param display_rate: During a diffusion run, you can monitor the progress of each image being created with this variable. If display_rate is set to 50, DD will show you the in-progress image every 50 timesteps. Setting this to a lower value, like 5 or 10, is a good way to get an early peek at where your image is heading. If you don’t like the progression, just interrupt execution, change some settings, and re-run. If you are planning a long, unmonitored batch, it’s better to set display_rate equal to steps, because displaying interim images does slow Colab down slightly.
:param n_batches: This variable sets the number of still images you want DD to create. If you are using an animation mode (see below for details) DD will ignore n_batches and create a single set of animated frames based on the animation settings.
:param batch_name: The name of the batch, the batch id will be named as "discoart-[batch_name]-seed". To avoid your artworks be overridden by other users, please use a unique name.
:param clip_models: CLIP Model selectors. ViTB32, ViTB16, ViTL14, RN101, RN50, RN50x4, RN50x16, RN50x64.These various CLIP models are available for you to use during image generation. Models have different styles or ‘flavors,’ so look around. You can mix in multiple models as well for different results. However, keep in mind that some models are extremely memory-hungry, and turning on additional models will take additional memory and may cause a crash.The rough order of speed/mem usage is (smallest/fastest to largest/slowest):VitB32RN50RN101VitB16RN50x4RN50x16RN50x64ViTL14For RN50x64 & ViTL14 you may need to use fewer cuts, depending on your VRAM.
:return: a DocumentArray object that has `n_batches` Documents
"""
# end_create_overload
@overload
def create(init_document: 'Document') -> 'DocumentArray':
"""
Create an artwork using a DocArray ``Document`` object as initial state.
:param init_document: its ``.tags`` will be used as parameters, ``.uri`` (if present) will be used as init image.
:return: a DocumentArray object that has `n_batches` Documents
"""
def create(**kwargs) -> 'DocumentArray':
from .config import load_config
from .runner import do_run
if 'init_document' in kwargs:
d = kwargs['init_document']
_kwargs = d.tags
if not _kwargs:
warnings.warn('init_document has no .tags, fallback to default config')
if d.uri:
_kwargs['init_image'] = kwargs['init_document'].uri
else:
warnings.warn('init_document has no .uri, fallback to no init image')
kwargs.pop('init_document')
if kwargs:
warnings.warn('init_document has .tags and .uri, but kwargs are also present, will override .tags')
_kwargs.update(kwargs)
_args = load_config(user_config=_kwargs)
else:
_args = load_config(user_config=kwargs)
model, diffusion = load_diffusion_model(model_config, _args.diffusion_model, steps=_args.steps)
clip_models = load_clip_models(enabled=_args.clip_models, clip_models=_clip_models_cache)
gc.collect()
paddle.device.cuda.empty_cache()
try:
return do_run(_args, (model, diffusion, clip_models, secondary_model))
except KeyboardInterrupt:
pass
'''
https://github.com/jina-ai/discoart/blob/main/discoart/config.py
'''
import copy
import random
import warnings
from types import SimpleNamespace
from typing import Dict
import yaml
from yaml import Loader
from . import __resources_path__
with open(f'{__resources_path__}/default.yml') as ymlfile:
default_args = yaml.load(ymlfile, Loader=Loader)
def load_config(user_config: Dict, ):
cfg = copy.deepcopy(default_args)
if user_config:
cfg.update(**user_config)
for k in user_config.keys():
if k not in cfg:
warnings.warn(f'unknown argument {k}, ignored')
for k, v in cfg.items():
if k in ('batch_size', 'display_rate', 'seed', 'skip_steps', 'steps', 'n_batches',
'cutn_batches') and isinstance(v, float):
cfg[k] = int(v)
if k == 'width_height':
cfg[k] = [int(vv) for vv in v]
cfg.update(**{
'seed': cfg['seed'] or random.randint(0, 2**32),
})
if cfg['batch_name']:
da_name = f'{__package__}-{cfg["batch_name"]}-{cfg["seed"]}'
else:
da_name = f'{__package__}-{cfg["seed"]}'
warnings.warn('you did not set `batch_name`, set it to have unique session ID')
cfg.update(**{'name_docarray': da_name})
print_args_table(cfg)
return SimpleNamespace(**cfg)
def print_args_table(cfg):
from rich.table import Table
from rich import box
from rich.console import Console
console = Console()
param_str = Table(
title=cfg['name_docarray'],
box=box.ROUNDED,
highlight=True,
title_justify='left',
)
param_str.add_column('Argument', justify='right')
param_str.add_column('Value', justify='left')
for k, v in sorted(cfg.items()):
value = str(v)
if not default_args.get(k, None) == v:
value = f'[b]{value}[/]'
param_str.add_row(k, value)
console.print(param_str)
'''
This code is rewritten by Paddle based on Jina-ai/discoart.
https://github.com/jina-ai/discoart/blob/main/discoart/helper.py
'''
import hashlib
import logging
import os
import subprocess
import sys
from os.path import expanduser
from pathlib import Path
from typing import Any
from typing import Dict
from typing import List
import paddle
def _get_logger():
logger = logging.getLogger(__package__)
_log_level = os.environ.get('DISCOART_LOG_LEVEL', 'INFO')
logger.setLevel(_log_level)
ch = logging.StreamHandler()
ch.setLevel(_log_level)
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
ch.setFormatter(formatter)
logger.addHandler(ch)
return logger
logger = _get_logger()
def load_clip_models(enabled: List[str], clip_models: Dict[str, Any] = {}):
import disco_diffusion_ernievil_base.vit_b_16x.ernievil2 as ernievil2
from disco_diffusion_ernievil_base.vit_b_16x.ernievil2.utils.utils import build_model
# load enabled models
for k in enabled:
if k not in clip_models:
clip_models[k] = build_model(name=k)
clip_models[k].eval()
for parameter in clip_models[k].parameters():
parameter.stop_gradient = True
# disable not enabled models to save memory
for k in clip_models:
if k not in enabled:
clip_models.pop(k)
return list(clip_models.values())
def load_all_models(diffusion_model, use_secondary_model):
from .model.script_util import (
model_and_diffusion_defaults, )
model_config = model_and_diffusion_defaults()
if diffusion_model == '512x512_diffusion_uncond_finetune_008100':
model_config.update({
'attention_resolutions': '32, 16, 8',
'class_cond': False,
'diffusion_steps': 1000, # No need to edit this, it is taken care of later.
'rescale_timesteps': True,
'timestep_respacing': 250, # No need to edit this, it is taken care of later.
'image_size': 512,
'learn_sigma': True,
'noise_schedule': 'linear',
'num_channels': 256,
'num_head_channels': 64,
'num_res_blocks': 2,
'resblock_updown': True,
'use_fp16': False,
'use_scale_shift_norm': True,
})
elif diffusion_model == '256x256_diffusion_uncond':
model_config.update({
'attention_resolutions': '32, 16, 8',
'class_cond': False,
'diffusion_steps': 1000, # No need to edit this, it is taken care of later.
'rescale_timesteps': True,
'timestep_respacing': 250, # No need to edit this, it is taken care of later.
'image_size': 256,
'learn_sigma': True,
'noise_schedule': 'linear',
'num_channels': 256,
'num_head_channels': 64,
'num_res_blocks': 2,
'resblock_updown': True,
'use_fp16': False,
'use_scale_shift_norm': True,
})
secondary_model = None
if use_secondary_model:
from .model.sec_diff import SecondaryDiffusionImageNet2
secondary_model = SecondaryDiffusionImageNet2()
model_dict = paddle.load(
os.path.join(os.path.dirname(__file__), 'pre_trained', 'secondary_model_imagenet_2.pdparams'))
secondary_model.set_state_dict(model_dict)
secondary_model.eval()
for parameter in secondary_model.parameters():
parameter.stop_gradient = True
return model_config, secondary_model
def load_diffusion_model(model_config, diffusion_model, steps):
from .model.script_util import (
create_model_and_diffusion, )
timestep_respacing = f'ddim{steps}'
diffusion_steps = (1000 // steps) * steps if steps < 1000 else steps
model_config.update({
'timestep_respacing': timestep_respacing,
'diffusion_steps': diffusion_steps,
})
model, diffusion = create_model_and_diffusion(**model_config)
model.set_state_dict(
paddle.load(os.path.join(os.path.dirname(__file__), 'pre_trained', f'{diffusion_model}.pdparams')))
model.eval()
for name, param in model.named_parameters():
param.stop_gradient = True
return model, diffusion
def parse_prompt(prompt):
if prompt.startswith('http://') or prompt.startswith('https://'):
vals = prompt.rsplit(':', 2)
vals = [vals[0] + ':' + vals[1], *vals[2:]]
else:
vals = prompt.rsplit(':', 1)
vals = vals + ['', '1'][len(vals):]
return vals[0], float(vals[1])
"""
Codebase for "Improved Denoising Diffusion Probabilistic Models" implemented by Paddle.
"""
"""
Diffusion model implemented by Paddle.
This code is rewritten based on Pytorch version of of Ho et al's diffusion models:
https://github.com/hojonathanho/diffusion/blob/1e0dceb3b3495bbe19116a5e1b3596cd0706c543/diffusion_tf/diffusion_utils_2.py
"""
import enum
import math
import numpy as np
import paddle
from .losses import discretized_gaussian_log_likelihood
from .losses import normal_kl
from .nn import mean_flat
def get_named_beta_schedule(schedule_name, num_diffusion_timesteps):
"""
Get a pre-defined beta schedule for the given name.
The beta schedule library consists of beta schedules which remain similar
in the limit of num_diffusion_timesteps.
Beta schedules may be added, but should not be removed or changed once
they are committed to maintain backwards compatibility.
"""
if schedule_name == "linear":
# Linear schedule from Ho et al, extended to work for any number of
# diffusion steps.
scale = 1000 / num_diffusion_timesteps
beta_start = scale * 0.0001
beta_end = scale * 0.02
return np.linspace(beta_start, beta_end, num_diffusion_timesteps, dtype=np.float64)
elif schedule_name == "cosine":
return betas_for_alpha_bar(
num_diffusion_timesteps,
lambda t: math.cos((t + 0.008) / 1.008 * math.pi / 2)**2,
)
else:
raise NotImplementedError(f"unknown beta schedule: {schedule_name}")
def betas_for_alpha_bar(num_diffusion_timesteps, alpha_bar, max_beta=0.999):
"""
Create a beta schedule that discretizes the given alpha_t_bar function,
which defines the cumulative product of (1-beta) over time from t = [0,1].
:param num_diffusion_timesteps: the number of betas to produce.
:param alpha_bar: a lambda that takes an argument t from 0 to 1 and
produces the cumulative product of (1-beta) up to that
part of the diffusion process.
:param max_beta: the maximum beta to use; use values lower than 1 to
prevent singularities.
"""
betas = []
for i in range(num_diffusion_timesteps):
t1 = i / num_diffusion_timesteps
t2 = (i + 1) / num_diffusion_timesteps
betas.append(min(1 - alpha_bar(t2) / alpha_bar(t1), max_beta))
return np.array(betas)
class ModelMeanType(enum.Enum):
"""
Which type of output the model predicts.
"""
PREVIOUS_X = enum.auto() # the model predicts x_{t-1}
START_X = enum.auto() # the model predicts x_0
EPSILON = enum.auto() # the model predicts epsilon
class ModelVarType(enum.Enum):
"""
What is used as the model's output variance.
The LEARNED_RANGE option has been added to allow the model to predict
values between FIXED_SMALL and FIXED_LARGE, making its job easier.
"""
LEARNED = enum.auto()
FIXED_SMALL = enum.auto()
FIXED_LARGE = enum.auto()
LEARNED_RANGE = enum.auto()
class LossType(enum.Enum):
MSE = enum.auto() # use raw MSE loss (and KL when learning variances)
RESCALED_MSE = (enum.auto()) # use raw MSE loss (with RESCALED_KL when learning variances)
KL = enum.auto() # use the variational lower-bound
RESCALED_KL = enum.auto() # like KL, but rescale to estimate the full VLB
def is_vb(self):
return self == LossType.KL or self == LossType.RESCALED_KL
class GaussianDiffusion:
"""
Utilities for training and sampling diffusion models.
Ported directly from here, and then adapted over time to further experimentation.
https://github.com/hojonathanho/diffusion/blob/1e0dceb3b3495bbe19116a5e1b3596cd0706c543/diffusion_tf/diffusion_utils_2.py#L42
:param betas: a 1-D numpy array of betas for each diffusion timestep,
starting at T and going to 1.
:param model_mean_type: a ModelMeanType determining what the model outputs.
:param model_var_type: a ModelVarType determining how variance is output.
:param loss_type: a LossType determining the loss function to use.
:param rescale_timesteps: if True, pass floating point timesteps into the
model so that they are always scaled like in the
original paper (0 to 1000).
"""
def __init__(
self,
*,
betas,
model_mean_type,
model_var_type,
loss_type,
rescale_timesteps=False,
):
self.model_mean_type = model_mean_type
self.model_var_type = model_var_type
self.loss_type = loss_type
self.rescale_timesteps = rescale_timesteps
# Use float64 for accuracy.
betas = np.array(betas, dtype=np.float64)
self.betas = betas
assert len(betas.shape) == 1, "betas must be 1-D"
assert (betas > 0).all() and (betas <= 1).all()
self.num_timesteps = int(betas.shape[0])
alphas = 1.0 - betas
self.alphas_cumprod = np.cumprod(alphas, axis=0)
self.alphas_cumprod_prev = np.append(1.0, self.alphas_cumprod[:-1])
self.alphas_cumprod_next = np.append(self.alphas_cumprod[1:], 0.0)
assert self.alphas_cumprod_prev.shape == (self.num_timesteps, )
# calculations for diffusion q(x_t | x_{t-1}) and others
self.sqrt_alphas_cumprod = np.sqrt(self.alphas_cumprod)
self.sqrt_one_minus_alphas_cumprod = np.sqrt(1.0 - self.alphas_cumprod)
self.log_one_minus_alphas_cumprod = np.log(1.0 - self.alphas_cumprod)
self.sqrt_recip_alphas_cumprod = np.sqrt(1.0 / self.alphas_cumprod)
self.sqrt_recipm1_alphas_cumprod = np.sqrt(1.0 / self.alphas_cumprod - 1)
# calculations for posterior q(x_{t-1} | x_t, x_0)
self.posterior_variance = (betas * (1.0 - self.alphas_cumprod_prev) / (1.0 - self.alphas_cumprod))
# log calculation clipped because the posterior variance is 0 at the
# beginning of the diffusion chain.
self.posterior_log_variance_clipped = np.log(np.append(self.posterior_variance[1], self.posterior_variance[1:]))
self.posterior_mean_coef1 = (betas * np.sqrt(self.alphas_cumprod_prev) / (1.0 - self.alphas_cumprod))
self.posterior_mean_coef2 = ((1.0 - self.alphas_cumprod_prev) * np.sqrt(alphas) / (1.0 - self.alphas_cumprod))
def q_mean_variance(self, x_start, t):
"""
Get the distribution q(x_t | x_0).
:param x_start: the [N x C x ...] tensor of noiseless inputs.
:param t: the number of diffusion steps (minus 1). Here, 0 means one step.
:return: A tuple (mean, variance, log_variance), all of x_start's shape.
"""
mean = (_extract_into_tensor(self.sqrt_alphas_cumprod, t, x_start.shape) * x_start)
variance = _extract_into_tensor(1.0 - self.alphas_cumprod, t, x_start.shape)
log_variance = _extract_into_tensor(self.log_one_minus_alphas_cumprod, t, x_start.shape)
return mean, variance, log_variance
def q_sample(self, x_start, t, noise=None):
"""
Diffuse the data for a given number of diffusion steps.
In other words, sample from q(x_t | x_0).
:param x_start: the initial data batch.
:param t: the number of diffusion steps (minus 1). Here, 0 means one step.
:param noise: if specified, the split-out normal noise.
:return: A noisy version of x_start.
"""
if noise is None:
# noise = th.randn_like(x_start)
noise = paddle.randn(x_start.shape, x_start.dtype)
assert noise.shape == x_start.shape
return (_extract_into_tensor(self.sqrt_alphas_cumprod, t, x_start.shape) * x_start +
_extract_into_tensor(self.sqrt_one_minus_alphas_cumprod, t, x_start.shape) * noise)
def q_posterior_mean_variance(self, x_start, x_t, t):
"""
Compute the mean and variance of the diffusion posterior:
q(x_{t-1} | x_t, x_0)
"""
assert x_start.shape == x_t.shape
posterior_mean = (_extract_into_tensor(self.posterior_mean_coef1, t, x_t.shape) * x_start +
_extract_into_tensor(self.posterior_mean_coef2, t, x_t.shape) * x_t)
posterior_variance = _extract_into_tensor(self.posterior_variance, t, x_t.shape)
posterior_log_variance_clipped = _extract_into_tensor(self.posterior_log_variance_clipped, t, x_t.shape)
assert (posterior_mean.shape[0] == posterior_variance.shape[0] == posterior_log_variance_clipped.shape[0] ==
x_start.shape[0])
return posterior_mean, posterior_variance, posterior_log_variance_clipped
def p_mean_variance(self, model, x, t, clip_denoised=True, denoised_fn=None, model_kwargs=None):
"""
Apply the model to get p(x_{t-1} | x_t), as well as a prediction of
the initial x, x_0.
:param model: the model, which takes a signal and a batch of timesteps
as input.
:param x: the [N x C x ...] tensor at time t.
:param t: a 1-D Tensor of timesteps.
:param clip_denoised: if True, clip the denoised signal into [-1, 1].
:param denoised_fn: if not None, a function which applies to the
x_start prediction before it is used to sample. Applies before
clip_denoised.
:param model_kwargs: if not None, a dict of extra keyword arguments to
pass to the model. This can be used for conditioning.
:return: a dict with the following keys:
- 'mean': the model mean output.
- 'variance': the model variance output.
- 'log_variance': the log of 'variance'.
- 'pred_xstart': the prediction for x_0.
"""
if model_kwargs is None:
model_kwargs = {}
B, C = x.shape[:2]
assert t.shape == [B]
model_output = model(x, self._scale_timesteps(t), **model_kwargs)
if self.model_var_type in [ModelVarType.LEARNED, ModelVarType.LEARNED_RANGE]:
assert model_output.shape == [B, C * 2, *x.shape[2:]]
model_output, model_var_values = paddle.split(model_output, 2, axis=1)
if self.model_var_type == ModelVarType.LEARNED:
model_log_variance = model_var_values
model_variance = paddle.exp(model_log_variance)
else:
min_log = _extract_into_tensor(self.posterior_log_variance_clipped, t, x.shape)
max_log = _extract_into_tensor(np.log(self.betas), t, x.shape)
# The model_var_values is [-1, 1] for [min_var, max_var].
frac = (model_var_values + 1) / 2
model_log_variance = frac * max_log + (1 - frac) * min_log
model_variance = paddle.exp(model_log_variance)
else:
model_variance, model_log_variance = {
# for fixedlarge, we set the initial (log-)variance like so
# to get a better decoder log likelihood.
ModelVarType.FIXED_LARGE: (
np.append(self.posterior_variance[1], self.betas[1:]),
np.log(np.append(self.posterior_variance[1], self.betas[1:])),
),
ModelVarType.FIXED_SMALL: (
self.posterior_variance,
self.posterior_log_variance_clipped,
),
}[self.model_var_type]
model_variance = _extract_into_tensor(model_variance, t, x.shape)
model_log_variance = _extract_into_tensor(model_log_variance, t, x.shape)
def process_xstart(x):
if denoised_fn is not None:
x = denoised_fn(x)
if clip_denoised:
return x.clamp(-1, 1)
return x
if self.model_mean_type == ModelMeanType.PREVIOUS_X:
pred_xstart = process_xstart(self._predict_xstart_from_xprev(x_t=x, t=t, xprev=model_output))
model_mean = model_output
elif self.model_mean_type in [ModelMeanType.START_X, ModelMeanType.EPSILON]:
if self.model_mean_type == ModelMeanType.START_X:
pred_xstart = process_xstart(model_output)
else:
pred_xstart = process_xstart(self._predict_xstart_from_eps(x_t=x, t=t, eps=model_output))
model_mean, _, _ = self.q_posterior_mean_variance(x_start=pred_xstart, x_t=x, t=t)
else:
raise NotImplementedError(self.model_mean_type)
assert (model_mean.shape == model_log_variance.shape == pred_xstart.shape == x.shape)
return {
"mean": model_mean,
"variance": model_variance,
"log_variance": model_log_variance,
"pred_xstart": pred_xstart,
}
def _predict_xstart_from_eps(self, x_t, t, eps):
assert x_t.shape == eps.shape
return (_extract_into_tensor(self.sqrt_recip_alphas_cumprod, t, x_t.shape) * x_t -
_extract_into_tensor(self.sqrt_recipm1_alphas_cumprod, t, x_t.shape) * eps)
def _predict_xstart_from_xprev(self, x_t, t, xprev):
assert x_t.shape == xprev.shape
return ( # (xprev - coef2*x_t) / coef1
_extract_into_tensor(1.0 / self.posterior_mean_coef1, t, x_t.shape) * xprev -
_extract_into_tensor(self.posterior_mean_coef2 / self.posterior_mean_coef1, t, x_t.shape) * x_t)
def _predict_eps_from_xstart(self, x_t, t, pred_xstart):
return (_extract_into_tensor(self.sqrt_recip_alphas_cumprod, t, x_t.shape) * x_t -
pred_xstart) / _extract_into_tensor(self.sqrt_recipm1_alphas_cumprod, t, x_t.shape)
def _scale_timesteps(self, t):
if self.rescale_timesteps:
return paddle.cast((t), 'float32') * (1000.0 / self.num_timesteps)
return t
def condition_mean(self, cond_fn, p_mean_var, x, t, model_kwargs=None):
"""
Compute the mean for the previous step, given a function cond_fn that
computes the gradient of a conditional log probability with respect to
x. In particular, cond_fn computes grad(log(p(y|x))), and we want to
condition on y.
This uses the conditioning strategy from Sohl-Dickstein et al. (2015).
"""
gradient = cond_fn(x, self._scale_timesteps(t), **model_kwargs)
new_mean = (paddle.cast((p_mean_var["mean"]), 'float32') + p_mean_var["variance"] * paddle.cast(
(gradient), 'float32'))
return new_mean
def condition_mean_with_grad(self, cond_fn, p_mean_var, x, t, model_kwargs=None):
"""
Compute the mean for the previous step, given a function cond_fn that
computes the gradient of a conditional log probability with respect to
x. In particular, cond_fn computes grad(log(p(y|x))), and we want to
condition on y.
This uses the conditioning strategy from Sohl-Dickstein et al. (2015).
"""
gradient = cond_fn(x, t, p_mean_var, **model_kwargs)
new_mean = (paddle.cast((p_mean_var["mean"]), 'float32') + p_mean_var["variance"] * paddle.cast(
(gradient), 'float32'))
return new_mean
def condition_score(self, cond_fn, p_mean_var, x, t, model_kwargs=None):
"""
Compute what the p_mean_variance output would have been, should the
model's score function be conditioned by cond_fn.
See condition_mean() for details on cond_fn.
Unlike condition_mean(), this instead uses the conditioning strategy
from Song et al (2020).
"""
alpha_bar = _extract_into_tensor(self.alphas_cumprod, t, x.shape)
eps = self._predict_eps_from_xstart(x, t, p_mean_var["pred_xstart"])
eps = eps - (1 - alpha_bar).sqrt() * cond_fn(x, self._scale_timesteps(t), **model_kwargs)
out = p_mean_var.copy()
out["pred_xstart"] = self._predict_xstart_from_eps(x, t, eps)
out["mean"], _, _ = self.q_posterior_mean_variance(x_start=out["pred_xstart"], x_t=x, t=t)
return out
def condition_score_with_grad(self, cond_fn, p_mean_var, x, t, model_kwargs=None):
"""
Compute what the p_mean_variance output would have been, should the
model's score function be conditioned by cond_fn.
See condition_mean() for details on cond_fn.
Unlike condition_mean(), this instead uses the conditioning strategy
from Song et al (2020).
"""
alpha_bar = _extract_into_tensor(self.alphas_cumprod, t, x.shape)
eps = self._predict_eps_from_xstart(x, t, p_mean_var["pred_xstart"])
eps = eps - (1 - alpha_bar).sqrt() * cond_fn(x, t, p_mean_var, **model_kwargs)
out = p_mean_var.copy()
out["pred_xstart"] = self._predict_xstart_from_eps(x, t, eps)
out["mean"], _, _ = self.q_posterior_mean_variance(x_start=out["pred_xstart"], x_t=x, t=t)
return out
def p_sample(
self,
model,
x,
t,
clip_denoised=True,
denoised_fn=None,
cond_fn=None,
model_kwargs=None,
):
"""
Sample x_{t-1} from the model at the given timestep.
:param model: the model to sample from.
:param x: the current tensor at x_{t-1}.
:param t: the value of t, starting at 0 for the first diffusion step.
:param clip_denoised: if True, clip the x_start prediction to [-1, 1].
:param denoised_fn: if not None, a function which applies to the
x_start prediction before it is used to sample.
:param cond_fn: if not None, this is a gradient function that acts
similarly to the model.
:param model_kwargs: if not None, a dict of extra keyword arguments to
pass to the model. This can be used for conditioning.
:return: a dict containing the following keys:
- 'sample': a random sample from the model.
- 'pred_xstart': a prediction of x_0.
"""
out = self.p_mean_variance(
model,
x,
t,
clip_denoised=clip_denoised,
denoised_fn=denoised_fn,
model_kwargs=model_kwargs,
)
# noise = th.randn_like(x)
noise = paddle.randn(x.shape, x.dtype)
nonzero_mask = (paddle.cast((t != 0), 'float32').reshape([-1,
*([1] * (len(x.shape) - 1))])) # no noise when t == 0
if cond_fn is not None:
out["mean"] = self.condition_mean(cond_fn, out, x, t, model_kwargs=model_kwargs)
sample = out["mean"] + nonzero_mask * paddle.exp(0.5 * out["log_variance"]) * noise
return {"sample": sample, "pred_xstart": out["pred_xstart"]}
def p_sample_with_grad(
self,
model,
x,
t,
clip_denoised=True,
denoised_fn=None,
cond_fn=None,
model_kwargs=None,
):
"""
Sample x_{t-1} from the model at the given timestep.
:param model: the model to sample from.
:param x: the current tensor at x_{t-1}.
:param t: the value of t, starting at 0 for the first diffusion step.
:param clip_denoised: if True, clip the x_start prediction to [-1, 1].
:param denoised_fn: if not None, a function which applies to the
x_start prediction before it is used to sample.
:param cond_fn: if not None, this is a gradient function that acts
similarly to the model.
:param model_kwargs: if not None, a dict of extra keyword arguments to
pass to the model. This can be used for conditioning.
:return: a dict containing the following keys:
- 'sample': a random sample from the model.
- 'pred_xstart': a prediction of x_0.
"""
# with th.enable_grad():
# x = x.detach().requires_grad_()
x = x.detach()
# x.stop_gradient = False
out = self.p_mean_variance(
model,
x,
t,
clip_denoised=clip_denoised,
denoised_fn=denoised_fn,
model_kwargs=model_kwargs,
)
# noise = th.randn_like(x)
noise = paddle.randn(x.shape, x.dtype)
nonzero_mask = (paddle.cast((t != 0), 'float32').reshape([-1,
*([1] * (len(x.shape) - 1))])) # no noise when t == 0
if cond_fn is not None:
out["mean"] = self.condition_mean_with_grad(cond_fn, out, x, t, model_kwargs=model_kwargs)
sample = out["mean"] + nonzero_mask * paddle.exp(0.5 * out["log_variance"]) * noise
return {"sample": sample, "pred_xstart": out["pred_xstart"].detach()}
def p_sample_loop(
self,
model,
shape,
noise=None,
clip_denoised=True,
denoised_fn=None,
cond_fn=None,
model_kwargs=None,
device=None,
progress=False,
skip_timesteps=0,
init_image=None,
randomize_class=False,
cond_fn_with_grad=False,
):
"""
Generate samples from the model.
:param model: the model module.
:param shape: the shape of the samples, (N, C, H, W).
:param noise: if specified, the noise from the encoder to sample.
Should be of the same shape as `shape`.
:param clip_denoised: if True, clip x_start predictions to [-1, 1].
:param denoised_fn: if not None, a function which applies to the
x_start prediction before it is used to sample.
:param cond_fn: if not None, this is a gradient function that acts
similarly to the model.
:param model_kwargs: if not None, a dict of extra keyword arguments to
pass to the model. This can be used for conditioning.
:param device: if specified, the device to create the samples on.
If not specified, use a model parameter's device.
:param progress: if True, show a tqdm progress bar.
:return: a non-differentiable batch of samples.
"""
final = None
for sample in self.p_sample_loop_progressive(
model,
shape,
noise=noise,
clip_denoised=clip_denoised,
denoised_fn=denoised_fn,
cond_fn=cond_fn,
model_kwargs=model_kwargs,
device=device,
progress=progress,
skip_timesteps=skip_timesteps,
init_image=init_image,
randomize_class=randomize_class,
cond_fn_with_grad=cond_fn_with_grad,
):
final = sample
return final["sample"]
def p_sample_loop_progressive(
self,
model,
shape,
noise=None,
clip_denoised=True,
denoised_fn=None,
cond_fn=None,
model_kwargs=None,
device=None,
progress=False,
skip_timesteps=0,
init_image=None,
randomize_class=False,
cond_fn_with_grad=False,
):
"""
Generate samples from the model and yield intermediate samples from
each timestep of diffusion.
Arguments are the same as p_sample_loop().
Returns a generator over dicts, where each dict is the return value of
p_sample().
"""
if device is None:
device = model.parameters()[0].place
assert isinstance(shape, (tuple, list))
if noise is not None:
img = noise
else:
img = paddle.randn(shape)
if skip_timesteps and init_image is None:
init_image = paddle.zeros_like(img)
indices = list(range(self.num_timesteps - skip_timesteps))[::-1]
if init_image is not None:
my_t = paddle.ones([shape[0]], dtype='int64') * indices[0]
img = self.q_sample(init_image, my_t, img)
if progress:
# Lazy import so that we don't depend on tqdm.
from tqdm.auto import tqdm
indices = tqdm(indices)
for i in indices:
t = paddle.to_tensor([i] * shape[0], place=device)
if randomize_class and 'y' in model_kwargs:
model_kwargs['y'] = paddle.randint(low=0, high=model.num_classes, shape=model_kwargs['y'].shape)
# with paddle.no_grad():
sample_fn = self.p_sample_with_grad if cond_fn_with_grad else self.p_sample
out = sample_fn(
model,
img,
t,
clip_denoised=clip_denoised,
denoised_fn=denoised_fn,
cond_fn=cond_fn,
model_kwargs=model_kwargs,
)
yield out
img = out["sample"]
def ddim_sample(
self,
model,
x,
t,
clip_denoised=True,
denoised_fn=None,
cond_fn=None,
model_kwargs=None,
eta=0.0,
):
"""
Sample x_{t-1} from the model using DDIM.
Same usage as p_sample().
"""
out_orig = self.p_mean_variance(
model,
x,
t,
clip_denoised=clip_denoised,
denoised_fn=denoised_fn,
model_kwargs=model_kwargs,
)
if cond_fn is not None:
out = self.condition_score(cond_fn, out_orig, x, t, model_kwargs=model_kwargs)
else:
out = out_orig
# Usually our model outputs epsilon, but we re-derive it
# in case we used x_start or x_prev prediction.
eps = self._predict_eps_from_xstart(x, t, out["pred_xstart"])
alpha_bar = _extract_into_tensor(self.alphas_cumprod, t, x.shape)
alpha_bar_prev = _extract_into_tensor(self.alphas_cumprod_prev, t, x.shape)
sigma = (eta * paddle.sqrt(
(1 - alpha_bar_prev) / (1 - alpha_bar)) * paddle.sqrt(1 - alpha_bar / alpha_bar_prev))
# Equation 12.
# noise = th.randn_like(x)
noise = paddle.randn(x.shape, x.dtype)
mean_pred = (out["pred_xstart"] * paddle.sqrt(alpha_bar_prev) +
paddle.sqrt(1 - alpha_bar_prev - sigma**2) * eps)
nonzero_mask = (paddle.cast((t != 0), 'float32').reshape([-1,
*([1] * (len(x.shape) - 1))])) # no noise when t == 0
sample = mean_pred + nonzero_mask * sigma * noise
return {"sample": sample, "pred_xstart": out_orig["pred_xstart"]}
def ddim_sample_with_grad(
self,
model,
x,
t,
clip_denoised=True,
denoised_fn=None,
cond_fn=None,
model_kwargs=None,
eta=0.0,
):
"""
Sample x_{t-1} from the model using DDIM.
Same usage as p_sample().
"""
# with th.enable_grad():
# x = x.detach().requires_grad_()
x = x.detach()
# x.stop_gradient = False
out_orig = self.p_mean_variance(
model,
x,
t,
clip_denoised=clip_denoised,
denoised_fn=denoised_fn,
model_kwargs=model_kwargs,
)
if cond_fn is not None:
out = self.condition_score_with_grad(cond_fn, out_orig, x, t, model_kwargs=model_kwargs)
else:
out = out_orig
out["pred_xstart"] = out["pred_xstart"].detach()
# Usually our model outputs epsilon, but we re-derive it
# in case we used x_start or x_prev prediction.
eps = self._predict_eps_from_xstart(x, t, out["pred_xstart"])
alpha_bar = _extract_into_tensor(self.alphas_cumprod, t, x.shape)
alpha_bar_prev = _extract_into_tensor(self.alphas_cumprod_prev, t, x.shape)
sigma = (eta * paddle.sqrt(
(1 - alpha_bar_prev) / (1 - alpha_bar)) * paddle.sqrt(1 - alpha_bar / alpha_bar_prev))
# Equation 12.
# noise = th.randn_like(x)
noise = paddle.randn(x.shape, x.dtype)
mean_pred = (out["pred_xstart"] * paddle.sqrt(alpha_bar_prev) +
paddle.sqrt(1 - alpha_bar_prev - sigma**2) * eps)
nonzero_mask = (paddle.cast((t != 0), 'float32').reshape([-1,
*([1] * (len(x.shape) - 1))])) # no noise when t == 0
sample = mean_pred + nonzero_mask * sigma * noise
return {"sample": sample, "pred_xstart": out_orig["pred_xstart"].detach()}
def ddim_reverse_sample(
self,
model,
x,
t,
clip_denoised=True,
denoised_fn=None,
model_kwargs=None,
eta=0.0,
):
"""
Sample x_{t+1} from the model using DDIM reverse ODE.
"""
assert eta == 0.0, "Reverse ODE only for deterministic path"
out = self.p_mean_variance(
model,
x,
t,
clip_denoised=clip_denoised,
denoised_fn=denoised_fn,
model_kwargs=model_kwargs,
)
# Usually our model outputs epsilon, but we re-derive it
# in case we used x_start or x_prev prediction.
eps = (_extract_into_tensor(self.sqrt_recip_alphas_cumprod, t, x.shape) * x -
out["pred_xstart"]) / _extract_into_tensor(self.sqrt_recipm1_alphas_cumprod, t, x.shape)
alpha_bar_next = _extract_into_tensor(self.alphas_cumprod_next, t, x.shape)
# Equation 12. reversed
mean_pred = (out["pred_xstart"] * paddle.sqrt(alpha_bar_next) + paddle.sqrt(1 - alpha_bar_next) * eps)
return {"sample": mean_pred, "pred_xstart": out["pred_xstart"]}
def ddim_sample_loop(
self,
model,
shape,
noise=None,
clip_denoised=True,
denoised_fn=None,
cond_fn=None,
model_kwargs=None,
device=None,
progress=False,
eta=0.0,
skip_timesteps=0,
init_image=None,
randomize_class=False,
cond_fn_with_grad=False,
):
"""
Generate samples from the model using DDIM.
Same usage as p_sample_loop().
"""
final = None
for sample in self.ddim_sample_loop_progressive(
model,
shape,
noise=noise,
clip_denoised=clip_denoised,
denoised_fn=denoised_fn,
cond_fn=cond_fn,
model_kwargs=model_kwargs,
device=device,
progress=progress,
eta=eta,
skip_timesteps=skip_timesteps,
init_image=init_image,
randomize_class=randomize_class,
cond_fn_with_grad=cond_fn_with_grad,
):
final = sample
return final["sample"]
def ddim_sample_loop_progressive(
self,
model,
shape,
noise=None,
clip_denoised=True,
denoised_fn=None,
cond_fn=None,
model_kwargs=None,
device=None,
progress=False,
eta=0.0,
skip_timesteps=0,
init_image=None,
randomize_class=False,
cond_fn_with_grad=False,
):
"""
Use DDIM to sample from the model and yield intermediate samples from
each timestep of DDIM.
Same usage as p_sample_loop_progressive().
"""
# if device is None:
# device = next(model.parameters()).device
if device is None:
device = model.parameters()[0].place
assert isinstance(shape, (tuple, list))
if noise is not None:
img = noise
else:
img = paddle.randn(shape)
if skip_timesteps and init_image is None:
init_image = paddle.zeros_like(img)
indices = list(range(self.num_timesteps - skip_timesteps))[::-1]
if init_image is not None:
my_t = paddle.ones([shape[0]], dtype='int64') * indices[0]
img = self.q_sample(init_image, my_t, img)
if progress:
# Lazy import so that we don't depend on tqdm.
from tqdm.auto import tqdm
indices = tqdm(indices)
for i in indices:
t = paddle.to_tensor([i] * shape[0])
if randomize_class and 'y' in model_kwargs:
model_kwargs['y'] = paddle.randint(
low=0,
high=model.num_classes,
shape=model_kwargs['y'].shape,
)
sample_fn = self.ddim_sample_with_grad if cond_fn_with_grad else self.ddim_sample
out = sample_fn(
model,
img,
t,
clip_denoised=clip_denoised,
denoised_fn=denoised_fn,
cond_fn=cond_fn,
model_kwargs=model_kwargs,
eta=eta,
)
yield out
img = out["sample"]
def plms_sample(
self,
model,
x,
t,
clip_denoised=True,
denoised_fn=None,
cond_fn=None,
model_kwargs=None,
cond_fn_with_grad=False,
order=2,
old_out=None,
):
"""
Sample x_{t-1} from the model using Pseudo Linear Multistep.
Same usage as p_sample().
"""
if not int(order) or not 1 <= order <= 4:
raise ValueError('order is invalid (should be int from 1-4).')
def get_model_output(x, t):
with paddle.set_grad_enabled(cond_fn_with_grad and cond_fn is not None):
x = x.detach().requires_grad_() if cond_fn_with_grad else x
out_orig = self.p_mean_variance(
model,
x,
t,
clip_denoised=clip_denoised,
denoised_fn=denoised_fn,
model_kwargs=model_kwargs,
)
if cond_fn is not None:
if cond_fn_with_grad:
out = self.condition_score_with_grad(cond_fn, out_orig, x, t, model_kwargs=model_kwargs)
x = x.detach()
else:
out = self.condition_score(cond_fn, out_orig, x, t, model_kwargs=model_kwargs)
else:
out = out_orig
# Usually our model outputs epsilon, but we re-derive it
# in case we used x_start or x_prev prediction.
eps = self._predict_eps_from_xstart(x, t, out["pred_xstart"])
return eps, out, out_orig
alpha_bar = _extract_into_tensor(self.alphas_cumprod, t, x.shape)
alpha_bar_prev = _extract_into_tensor(self.alphas_cumprod_prev, t, x.shape)
eps, out, out_orig = get_model_output(x, t)
if order > 1 and old_out is None:
# Pseudo Improved Euler
old_eps = [eps]
mean_pred = out["pred_xstart"] * paddle.sqrt(alpha_bar_prev) + paddle.sqrt(1 - alpha_bar_prev) * eps
eps_2, _, _ = get_model_output(mean_pred, t - 1)
eps_prime = (eps + eps_2) / 2
pred_prime = self._predict_xstart_from_eps(x, t, eps_prime)
mean_pred = pred_prime * paddle.sqrt(alpha_bar_prev) + paddle.sqrt(1 - alpha_bar_prev) * eps_prime
else:
# Pseudo Linear Multistep (Adams-Bashforth)
old_eps = old_out["old_eps"]
old_eps.append(eps)
cur_order = min(order, len(old_eps))
if cur_order == 1:
eps_prime = old_eps[-1]
elif cur_order == 2:
eps_prime = (3 * old_eps[-1] - old_eps[-2]) / 2
elif cur_order == 3:
eps_prime = (23 * old_eps[-1] - 16 * old_eps[-2] + 5 * old_eps[-3]) / 12
elif cur_order == 4:
eps_prime = (55 * old_eps[-1] - 59 * old_eps[-2] + 37 * old_eps[-3] - 9 * old_eps[-4]) / 24
else:
raise RuntimeError('cur_order is invalid.')
pred_prime = self._predict_xstart_from_eps(x, t, eps_prime)
mean_pred = pred_prime * paddle.sqrt(alpha_bar_prev) + paddle.sqrt(1 - alpha_bar_prev) * eps_prime
if len(old_eps) >= order:
old_eps.pop(0)
nonzero_mask = paddle.cast((t != 0), 'float32').reshape([-1, *([1] * (len(x.shape) - 1))])
sample = mean_pred * nonzero_mask + out["pred_xstart"] * (1 - nonzero_mask)
return {"sample": sample, "pred_xstart": out_orig["pred_xstart"], "old_eps": old_eps}
def plms_sample_loop(
self,
model,
shape,
noise=None,
clip_denoised=True,
denoised_fn=None,
cond_fn=None,
model_kwargs=None,
device=None,
progress=False,
skip_timesteps=0,
init_image=None,
randomize_class=False,
cond_fn_with_grad=False,
order=2,
):
"""
Generate samples from the model using Pseudo Linear Multistep.
Same usage as p_sample_loop().
"""
final = None
for sample in self.plms_sample_loop_progressive(
model,
shape,
noise=noise,
clip_denoised=clip_denoised,
denoised_fn=denoised_fn,
cond_fn=cond_fn,
model_kwargs=model_kwargs,
device=device,
progress=progress,
skip_timesteps=skip_timesteps,
init_image=init_image,
randomize_class=randomize_class,
cond_fn_with_grad=cond_fn_with_grad,
order=order,
):
final = sample
return final["sample"]
def plms_sample_loop_progressive(
self,
model,
shape,
noise=None,
clip_denoised=True,
denoised_fn=None,
cond_fn=None,
model_kwargs=None,
device=None,
progress=False,
skip_timesteps=0,
init_image=None,
randomize_class=False,
cond_fn_with_grad=False,
order=2,
):
"""
Use PLMS to sample from the model and yield intermediate samples from each
timestep of PLMS.
Same usage as p_sample_loop_progressive().
"""
if device is None:
device = model.parameters()[0].place
assert isinstance(shape, (tuple, list))
if noise is not None:
img = noise
else:
img = paddle.randn(shape)
if skip_timesteps and init_image is None:
init_image = paddle.zeros_like(img)
indices = list(range(self.num_timesteps - skip_timesteps))[::-1]
if init_image is not None:
my_t = paddle.ones([shape[0]], dtype='int64') * indices[0]
img = self.q_sample(init_image, my_t, img)
if progress:
# Lazy import so that we don't depend on tqdm.
from tqdm.auto import tqdm
indices = tqdm(indices)
old_out = None
for i in indices:
t = paddle.to_tensor([i] * shape[0], place=device)
if randomize_class and 'y' in model_kwargs:
model_kwargs['y'] = paddle.randint(low=0, high=model.num_classes, shape=model_kwargs['y'].shape)
# with paddle.no_grad():
out = self.plms_sample(
model,
img,
t,
clip_denoised=clip_denoised,
denoised_fn=denoised_fn,
cond_fn=cond_fn,
model_kwargs=model_kwargs,
cond_fn_with_grad=cond_fn_with_grad,
order=order,
old_out=old_out,
)
yield out
old_out = out
img = out["sample"]
def _vb_terms_bpd(self, model, x_start, x_t, t, clip_denoised=True, model_kwargs=None):
"""
Get a term for the variational lower-bound.
The resulting units are bits (rather than nats, as one might expect).
This allows for comparison to other papers.
:return: a dict with the following keys:
- 'output': a shape [N] tensor of NLLs or KLs.
- 'pred_xstart': the x_0 predictions.
"""
true_mean, _, true_log_variance_clipped = self.q_posterior_mean_variance(x_start=x_start, x_t=x_t, t=t)
out = self.p_mean_variance(model, x_t, t, clip_denoised=clip_denoised, model_kwargs=model_kwargs)
kl = normal_kl(true_mean, true_log_variance_clipped, out["mean"], out["log_variance"])
kl = mean_flat(kl) / np.log(2.0)
decoder_nll = -discretized_gaussian_log_likelihood(
x_start, means=out["mean"], log_scales=0.5 * out["log_variance"])
assert decoder_nll.shape == x_start.shape
decoder_nll = mean_flat(decoder_nll) / np.log(2.0)
# At the first timestep return the decoder NLL,
# otherwise return KL(q(x_{t-1}|x_t,x_0) || p(x_{t-1}|x_t))
output = paddle.where((t == 0), decoder_nll, kl)
return {"output": output, "pred_xstart": out["pred_xstart"]}
def training_losses(self, model, x_start, t, model_kwargs=None, noise=None):
"""
Compute training losses for a single timestep.
:param model: the model to evaluate loss on.
:param x_start: the [N x C x ...] tensor of inputs.
:param t: a batch of timestep indices.
:param model_kwargs: if not None, a dict of extra keyword arguments to
pass to the model. This can be used for conditioning.
:param noise: if specified, the specific Gaussian noise to try to remove.
:return: a dict with the key "loss" containing a tensor of shape [N].
Some mean or variance settings may also have other keys.
"""
if model_kwargs is None:
model_kwargs = {}
if noise is None:
# noise = th.randn_like(x_start)
noise = paddle.randn(x_start.shape, x_start.dtype)
x_t = self.q_sample(x_start, t, noise=noise)
terms = {}
if self.loss_type == LossType.KL or self.loss_type == LossType.RESCALED_KL:
terms["loss"] = self._vb_terms_bpd(
model=model,
x_start=x_start,
x_t=x_t,
t=t,
clip_denoised=False,
model_kwargs=model_kwargs,
)["output"]
if self.loss_type == LossType.RESCALED_KL:
terms["loss"] *= self.num_timesteps
elif self.loss_type == LossType.MSE or self.loss_type == LossType.RESCALED_MSE:
model_output = model(x_t, self._scale_timesteps(t), **model_kwargs)
if self.model_var_type in [
ModelVarType.LEARNED,
ModelVarType.LEARNED_RANGE,
]:
B, C = x_t.shape[:2]
assert model_output.shape == (B, C * 2, *x_t.shape[2:])
model_output, model_var_values = paddle.split(model_output, 2, dim=1)
# Learn the variance using the variational bound, but don't let
# it affect our mean prediction.
frozen_out = paddle.concat([model_output.detach(), model_var_values], axis=1)
terms["vb"] = self._vb_terms_bpd(
model=lambda *args, r=frozen_out: r,
x_start=x_start,
x_t=x_t,
t=t,
clip_denoised=False,
)["output"]
if self.loss_type == LossType.RESCALED_MSE:
# Divide by 1000 for equivalence with initial implementation.
# Without a factor of 1/1000, the VB term hurts the MSE term.
terms["vb"] *= self.num_timesteps / 1000.0
target = {
ModelMeanType.PREVIOUS_X: self.q_posterior_mean_variance(x_start=x_start, x_t=x_t, t=t)[0],
ModelMeanType.START_X: x_start,
ModelMeanType.EPSILON: noise,
}[self.model_mean_type]
assert model_output.shape == target.shape == x_start.shape
terms["mse"] = mean_flat((target - model_output)**2)
if "vb" in terms:
terms["loss"] = terms["mse"] + terms["vb"]
else:
terms["loss"] = terms["mse"]
else:
raise NotImplementedError(self.loss_type)
return terms
def _prior_bpd(self, x_start):
"""
Get the prior KL term for the variational lower-bound, measured in
bits-per-dim.
This term can't be optimized, as it only depends on the encoder.
:param x_start: the [N x C x ...] tensor of inputs.
:return: a batch of [N] KL values (in bits), one per batch element.
"""
batch_size = x_start.shape[0]
t = paddle.to_tensor([self.num_timesteps - 1] * batch_size, place=x_start.place)
qt_mean, _, qt_log_variance = self.q_mean_variance(x_start, t)
kl_prior = normal_kl(mean1=qt_mean, logvar1=qt_log_variance, mean2=0.0, logvar2=0.0)
return mean_flat(kl_prior) / np.log(2.0)
def calc_bpd_loop(self, model, x_start, clip_denoised=True, model_kwargs=None):
"""
Compute the entire variational lower-bound, measured in bits-per-dim,
as well as other related quantities.
:param model: the model to evaluate loss on.
:param x_start: the [N x C x ...] tensor of inputs.
:param clip_denoised: if True, clip denoised samples.
:param model_kwargs: if not None, a dict of extra keyword arguments to
pass to the model. This can be used for conditioning.
:return: a dict containing the following keys:
- total_bpd: the total variational lower-bound, per batch element.
- prior_bpd: the prior term in the lower-bound.
- vb: an [N x T] tensor of terms in the lower-bound.
- xstart_mse: an [N x T] tensor of x_0 MSEs for each timestep.
- mse: an [N x T] tensor of epsilon MSEs for each timestep.
"""
device = x_start.place
batch_size = x_start.shape[0]
vb = []
xstart_mse = []
mse = []
for t in list(range(self.num_timesteps))[::-1]:
t_batch = paddle.to_tensor([t] * batch_size, place=device)
# noise = th.randn_like(x_start)
noise = paddle.randn(x_start.shape, x_start.dtype)
x_t = self.q_sample(x_start=x_start, t=t_batch, noise=noise)
# Calculate VLB term at the current timestep
# with paddle.no_grad():
out = self._vb_terms_bpd(
model,
x_start=x_start,
x_t=x_t,
t=t_batch,
clip_denoised=clip_denoised,
model_kwargs=model_kwargs,
)
vb.append(out["output"])
xstart_mse.append(mean_flat((out["pred_xstart"] - x_start)**2))
eps = self._predict_eps_from_xstart(x_t, t_batch, out["pred_xstart"])
mse.append(mean_flat((eps - noise)**2))
vb = paddle.stack(vb, axis=1)
xstart_mse = paddle.stack(xstart_mse, axis=1)
mse = paddle.stack(mse, axis=1)
prior_bpd = self._prior_bpd(x_start)
total_bpd = vb.sum(axis=1) + prior_bpd
return {
"total_bpd": total_bpd,
"prior_bpd": prior_bpd,
"vb": vb,
"xstart_mse": xstart_mse,
"mse": mse,
}
def _extract_into_tensor(arr, timesteps, broadcast_shape):
"""
Extract values from a 1-D numpy array for a batch of indices.
:param arr: the 1-D numpy array.
:param timesteps: a tensor of indices into the array to extract.
:param broadcast_shape: a larger shape of K dimensions with the batch
dimension equal to the length of timesteps.
:return: a tensor of shape [batch_size, 1, ...] where the shape has K dims.
"""
res = paddle.to_tensor(arr, place=timesteps.place)[timesteps]
while len(res.shape) < len(broadcast_shape):
res = res[..., None]
return res.expand(broadcast_shape)
"""
Helpers for various likelihood-based losses implemented by Paddle. These are ported from the original
Ho et al. diffusion models codebase:
https://github.com/hojonathanho/diffusion/blob/1e0dceb3b3495bbe19116a5e1b3596cd0706c543/diffusion_tf/utils.py
"""
import numpy as np
import paddle
import paddle.nn.functional as F
def normal_kl(mean1, logvar1, mean2, logvar2):
"""
Compute the KL divergence between two gaussians.
Shapes are automatically broadcasted, so batches can be compared to
scalars, among other use cases.
"""
tensor = None
for obj in (mean1, logvar1, mean2, logvar2):
if isinstance(obj, paddle.Tensor):
tensor = obj
break
assert tensor is not None, "at least one argument must be a Tensor"
# Force variances to be Tensors. Broadcasting helps convert scalars to
# Tensors, but it does not work for th.exp().
logvar1, logvar2 = [x if isinstance(x, paddle.Tensor) else paddle.to_tensor(x) for x in (logvar1, logvar2)]
return 0.5 * (-1.0 + logvar2 - logvar1 + paddle.exp(logvar1 - logvar2) +
((mean1 - mean2)**2) * paddle.exp(-logvar2))
def approx_standard_normal_cdf(x):
"""
A fast approximation of the cumulative distribution function of the
standard normal.
"""
return 0.5 * (1.0 + paddle.tanh(np.sqrt(2.0 / np.pi) * (x + 0.044715 * paddle.pow(x, 3))))
def discretized_gaussian_log_likelihood(x, *, means, log_scales):
"""
Compute the log-likelihood of a Gaussian distribution discretizing to a
given image.
:param x: the target images. It is assumed that this was uint8 values,
rescaled to the range [-1, 1].
:param means: the Gaussian mean Tensor.
:param log_scales: the Gaussian log stddev Tensor.
:return: a tensor like x of log probabilities (in nats).
"""
assert x.shape == means.shape == log_scales.shape
centered_x = x - means
inv_stdv = paddle.exp(-log_scales)
plus_in = inv_stdv * (centered_x + 1.0 / 255.0)
cdf_plus = approx_standard_normal_cdf(plus_in)
min_in = inv_stdv * (centered_x - 1.0 / 255.0)
cdf_min = approx_standard_normal_cdf(min_in)
log_cdf_plus = paddle.log(cdf_plus.clip(min=1e-12))
log_one_minus_cdf_min = paddle.log((1.0 - cdf_min).clip(min=1e-12))
cdf_delta = cdf_plus - cdf_min
log_probs = paddle.where(
x < -0.999,
log_cdf_plus,
paddle.where(x > 0.999, log_one_minus_cdf_min, paddle.log(cdf_delta.clip(min=1e-12))),
)
assert log_probs.shape == x.shape
return log_probs
def spherical_dist_loss(x, y):
x = F.normalize(x, axis=-1)
y = F.normalize(y, axis=-1)
return (x - y).norm(axis=-1).divide(paddle.to_tensor(2.0)).asin().pow(2).multiply(paddle.to_tensor(2.0))
def tv_loss(input):
"""L2 total variation loss, as in Mahendran et al."""
input = F.pad(input, (0, 1, 0, 1), 'replicate')
x_diff = input[..., :-1, 1:] - input[..., :-1, :-1]
y_diff = input[..., 1:, :-1] - input[..., :-1, :-1]
return (x_diff**2 + y_diff**2).mean([1, 2, 3])
def range_loss(input):
return (input - input.clip(-1, 1)).pow(2).mean([1, 2, 3])
'''
This code is rewritten by Paddle based on Jina-ai/discoart.
https://github.com/jina-ai/discoart/blob/main/discoart/nn/make_cutouts.py
'''
import math
import paddle
import paddle.nn as nn
from disco_diffusion_ernievil_base.resize_right.resize_right import resize
from paddle.nn import functional as F
from . import transforms as T
skip_augs = False # @param{type: 'boolean'}
def sinc(x):
return paddle.where(x != 0, paddle.sin(math.pi * x) / (math.pi * x), x.new_ones([]))
def lanczos(x, a):
cond = paddle.logical_and(-a < x, x < a)
out = paddle.where(cond, sinc(x) * sinc(x / a), x.new_zeros([]))
return out / out.sum()
def ramp(ratio, width):
n = math.ceil(width / ratio + 1)
out = paddle.empty([n])
cur = 0
for i in range(out.shape[0]):
out[i] = cur
cur += ratio
return paddle.concat([-out[1:].flip([0]), out])[1:-1]
class MakeCutouts(nn.Layer):
def __init__(self, cut_size, cutn, skip_augs=False):
super().__init__()
self.cut_size = cut_size
self.cutn = cutn
self.skip_augs = skip_augs
self.augs = nn.Sequential(*[
T.RandomHorizontalFlip(prob=0.5),
T.Lambda(lambda x: x + paddle.randn(x.shape) * 0.01),
T.RandomAffine(degrees=15, translate=(0.1, 0.1)),
T.Lambda(lambda x: x + paddle.randn(x.shape) * 0.01),
T.RandomPerspective(distortion_scale=0.4, p=0.7),
T.Lambda(lambda x: x + paddle.randn(x.shape) * 0.01),
T.RandomGrayscale(p=0.15),
T.Lambda(lambda x: x + paddle.randn(x.shape) * 0.01),
T.ColorJitter(brightness=0.1, contrast=0.1, saturation=0.1, hue=0.1),
])
def forward(self, input):
input = T.Pad(input.shape[2] // 4, fill=0)(input)
sideY, sideX = input.shape[2:4]
max_size = min(sideX, sideY)
cutouts = []
for ch in range(self.cutn):
if ch > self.cutn - self.cutn // 4:
cutout = input.clone()
else:
size = int(max_size *
paddle.zeros(1, ).normal_(mean=0.8, std=0.3).clip(float(self.cut_size / max_size), 1.0))
offsetx = paddle.randint(0, abs(sideX - size + 1), ())
offsety = paddle.randint(0, abs(sideY - size + 1), ())
cutout = input[:, :, offsety:offsety + size, offsetx:offsetx + size]
if not self.skip_augs:
cutout = self.augs(cutout)
cutouts.append(resample(cutout, (self.cut_size, self.cut_size)))
del cutout
cutouts = paddle.concat(cutouts, axis=0)
return cutouts
class MakeCutoutsDango(nn.Layer):
def __init__(self, cut_size, Overview=4, InnerCrop=0, IC_Size_Pow=0.5, IC_Grey_P=0.2):
super().__init__()
self.cut_size = cut_size
self.Overview = Overview
self.InnerCrop = InnerCrop
self.IC_Size_Pow = IC_Size_Pow
self.IC_Grey_P = IC_Grey_P
self.augs = nn.Sequential(*[
T.RandomHorizontalFlip(prob=0.5),
T.Lambda(lambda x: x + paddle.randn(x.shape) * 0.01),
T.RandomAffine(
degrees=10,
translate=(0.05, 0.05),
interpolation=T.InterpolationMode.BILINEAR,
),
T.Lambda(lambda x: x + paddle.randn(x.shape) * 0.01),
T.RandomGrayscale(p=0.1),
T.Lambda(lambda x: x + paddle.randn(x.shape) * 0.01),
T.ColorJitter(brightness=0.1, contrast=0.1, saturation=0.1, hue=0.1),
])
def forward(self, input):
cutouts = []
gray = T.Grayscale(3)
sideY, sideX = input.shape[2:4]
max_size = min(sideX, sideY)
min_size = min(sideX, sideY, self.cut_size)
output_shape = [1, 3, self.cut_size, self.cut_size]
pad_input = F.pad(
input,
(
(sideY - max_size) // 2,
(sideY - max_size) // 2,
(sideX - max_size) // 2,
(sideX - max_size) // 2,
),
**padargs,
)
cutout = resize(pad_input, out_shape=output_shape)
if self.Overview > 0:
if self.Overview <= 4:
if self.Overview >= 1:
cutouts.append(cutout)
if self.Overview >= 2:
cutouts.append(gray(cutout))
if self.Overview >= 3:
cutouts.append(cutout[:, :, :, ::-1])
if self.Overview == 4:
cutouts.append(gray(cutout[:, :, :, ::-1]))
else:
cutout = resize(pad_input, out_shape=output_shape)
for _ in range(self.Overview):
cutouts.append(cutout)
if self.InnerCrop > 0:
for i in range(self.InnerCrop):
size = int(paddle.rand([1])**self.IC_Size_Pow * (max_size - min_size) + min_size)
offsetx = paddle.randint(0, sideX - size + 1)
offsety = paddle.randint(0, sideY - size + 1)
cutout = input[:, :, offsety:offsety + size, offsetx:offsetx + size]
if i <= int(self.IC_Grey_P * self.InnerCrop):
cutout = gray(cutout)
cutout = resize(cutout, out_shape=output_shape)
cutouts.append(cutout)
cutouts = paddle.concat(cutouts)
if skip_augs is not True:
cutouts = self.augs(cutouts)
return cutouts
def resample(input, size, align_corners=True):
n, c, h, w = input.shape
dh, dw = size
input = input.reshape([n * c, 1, h, w])
if dh < h:
kernel_h = lanczos(ramp(dh / h, 2), 2).to(input.device, input.dtype)
pad_h = (kernel_h.shape[0] - 1) // 2
input = F.pad(input, (0, 0, pad_h, pad_h), 'reflect')
input = F.conv2d(input, kernel_h[None, None, :, None])
if dw < w:
kernel_w = lanczos(ramp(dw / w, 2), 2).to(input.device, input.dtype)
pad_w = (kernel_w.shape[0] - 1) // 2
input = F.pad(input, (pad_w, pad_w, 0, 0), 'reflect')
input = F.conv2d(input, kernel_w[None, None, None, :])
input = input.reshape([n, c, h, w])
return F.interpolate(input, size, mode='bicubic', align_corners=align_corners)
padargs = {}
"""
Various utilities for neural networks implemented by Paddle. This code is rewritten based on:
https://github.com/openai/guided-diffusion/blob/main/guided_diffusion/nn.py
"""
import math
import paddle
import paddle.nn as nn
class SiLU(nn.Layer):
def forward(self, x):
return x * nn.functional.sigmoid(x)
class GroupNorm32(nn.GroupNorm):
def forward(self, x):
return super().forward(x)
def conv_nd(dims, *args, **kwargs):
"""
Create a 1D, 2D, or 3D convolution module.
"""
if dims == 1:
return nn.Conv1D(*args, **kwargs)
elif dims == 2:
return nn.Conv2D(*args, **kwargs)
elif dims == 3:
return nn.Conv3D(*args, **kwargs)
raise ValueError(f"unsupported dimensions: {dims}")
def linear(*args, **kwargs):
"""
Create a linear module.
"""
return nn.Linear(*args, **kwargs)
def avg_pool_nd(dims, *args, **kwargs):
"""
Create a 1D, 2D, or 3D average pooling module.
"""
if dims == 1:
return nn.AvgPool1D(*args, **kwargs)
elif dims == 2:
return nn.AvgPool2D(*args, **kwargs)
elif dims == 3:
return nn.AvgPool3D(*args, **kwargs)
raise ValueError(f"unsupported dimensions: {dims}")
def update_ema(target_params, source_params, rate=0.99):
"""
Update target parameters to be closer to those of source parameters using
an exponential moving average.
:param target_params: the target parameter sequence.
:param source_params: the source parameter sequence.
:param rate: the EMA rate (closer to 1 means slower).
"""
for targ, src in zip(target_params, source_params):
targ.detach().mul_(rate).add_(src, alpha=1 - rate)
def zero_module(module):
"""
Zero out the parameters of a module and return it.
"""
for p in module.parameters():
p.detach().zero_()
return module
def scale_module(module, scale):
"""
Scale the parameters of a module and return it.
"""
for p in module.parameters():
p.detach().mul_(scale)
return module
def mean_flat(tensor):
"""
Take the mean over all non-batch dimensions.
"""
return tensor.mean(axis=list(range(1, len(tensor.shape))))
def normalization(channels):
"""
Make a standard normalization layer.
:param channels: number of input channels.
:return: an nn.Module for normalization.
"""
return GroupNorm32(32, channels)
def timestep_embedding(timesteps, dim, max_period=10000):
"""
Create sinusoidal timestep embeddings.
:param timesteps: a 1-D Tensor of N indices, one per batch element.
These may be fractional.
:param dim: the dimension of the output.
:param max_period: controls the minimum frequency of the embeddings.
:return: an [N x dim] Tensor of positional embeddings.
"""
half = dim // 2
freqs = paddle.exp(-math.log(max_period) * paddle.arange(start=0, end=half, dtype=paddle.float32) / half)
args = paddle.cast(timesteps[:, None], 'float32') * freqs[None]
embedding = paddle.concat([paddle.cos(args), paddle.sin(args)], axis=-1)
if dim % 2:
embedding = paddle.concat([embedding, paddle.zeros_like(embedding[:, :1])], axis=-1)
return embedding
def checkpoint(func, inputs, params, flag):
"""
This function is disabled. And now just forward.
"""
return func(*inputs)
'''
Perlin noise implementation by Paddle.
This code is rewritten based on:
https://github.com/jina-ai/discoart/blob/main/discoart/nn/perlin_noises.py
'''
import numpy as np
import paddle
import paddle.vision.transforms as TF
from PIL import Image
from PIL import ImageOps
def interp(t):
return 3 * t**2 - 2 * t**3
def perlin(width, height, scale=10):
gx, gy = paddle.randn([2, width + 1, height + 1, 1, 1])
xs = paddle.linspace(0, 1, scale + 1)[:-1, None]
ys = paddle.linspace(0, 1, scale + 1)[None, :-1]
wx = 1 - interp(xs)
wy = 1 - interp(ys)
dots = 0
dots += wx * wy * (gx[:-1, :-1] * xs + gy[:-1, :-1] * ys)
dots += (1 - wx) * wy * (-gx[1:, :-1] * (1 - xs) + gy[1:, :-1] * ys)
dots += wx * (1 - wy) * (gx[:-1, 1:] * xs - gy[:-1, 1:] * (1 - ys))
dots += (1 - wx) * (1 - wy) * (-gx[1:, 1:] * (1 - xs) - gy[1:, 1:] * (1 - ys))
return dots.transpose([0, 2, 1, 3]).reshape([width * scale, height * scale])
def perlin_ms(octaves, width, height, grayscale):
out_array = [0.5] if grayscale else [0.5, 0.5, 0.5]
# out_array = [0.0] if grayscale else [0.0, 0.0, 0.0]
for i in range(1 if grayscale else 3):
scale = 2**len(octaves)
oct_width = width
oct_height = height
for oct in octaves:
p = perlin(oct_width, oct_height, scale)
out_array[i] += p * oct
scale //= 2
oct_width *= 2
oct_height *= 2
return paddle.concat(out_array)
def create_perlin_noise(octaves, width, height, grayscale, side_y, side_x):
out = perlin_ms(octaves, width, height, grayscale)
if grayscale:
out = TF.resize(size=(side_y, side_x), img=out.numpy())
out = np.uint8(out)
out = Image.fromarray(out).convert('RGB')
else:
out = out.reshape([-1, 3, out.shape[0] // 3, out.shape[1]])
out = out.squeeze().transpose([1, 2, 0]).numpy()
out = TF.resize(size=(side_y, side_x), img=out)
out = out.clip(0, 1) * 255
out = np.uint8(out)
out = Image.fromarray(out)
out = ImageOps.autocontrast(out)
return out
def regen_perlin(perlin_mode, side_y, side_x, batch_size):
if perlin_mode == 'color':
init = create_perlin_noise([1.5**-i * 0.5 for i in range(12)], 1, 1, False, side_y, side_x)
init2 = create_perlin_noise([1.5**-i * 0.5 for i in range(8)], 4, 4, False, side_y, side_x)
elif perlin_mode == 'gray':
init = create_perlin_noise([1.5**-i * 0.5 for i in range(12)], 1, 1, True, side_y, side_x)
init2 = create_perlin_noise([1.5**-i * 0.5 for i in range(8)], 4, 4, True, side_y, side_x)
else:
init = create_perlin_noise([1.5**-i * 0.5 for i in range(12)], 1, 1, False, side_y, side_x)
init2 = create_perlin_noise([1.5**-i * 0.5 for i in range(8)], 4, 4, True, side_y, side_x)
init = (TF.to_tensor(init).add(TF.to_tensor(init2)).divide(paddle.to_tensor(2.0)).unsqueeze(0) * 2 - 1)
del init2
return init.expand([batch_size, -1, -1, -1])
'''
This code is rewritten by Paddle based on
https://github.com/openai/guided-diffusion/blob/main/guided_diffusion/respace.py
'''
import numpy as np
import paddle
from .gaussian_diffusion import GaussianDiffusion
def space_timesteps(num_timesteps, section_counts):
"""
Create a list of timesteps to use from an original diffusion process,
given the number of timesteps we want to take from equally-sized portions
of the original process.
For example, if there's 300 timesteps and the section counts are [10,15,20]
then the first 100 timesteps are strided to be 10 timesteps, the second 100
are strided to be 15 timesteps, and the final 100 are strided to be 20.
If the stride is a string starting with "ddim", then the fixed striding
from the DDIM paper is used, and only one section is allowed.
:param num_timesteps: the number of diffusion steps in the original
process to divide up.
:param section_counts: either a list of numbers, or a string containing
comma-separated numbers, indicating the step count
per section. As a special case, use "ddimN" where N
is a number of steps to use the striding from the
DDIM paper.
:return: a set of diffusion steps from the original process to use.
"""
if isinstance(section_counts, str):
if section_counts.startswith("ddim"):
desired_count = int(section_counts[len("ddim"):])
for i in range(1, num_timesteps):
if len(range(0, num_timesteps, i)) == desired_count:
return set(range(0, num_timesteps, i))
raise ValueError(f"cannot create exactly {num_timesteps} steps with an integer stride")
section_counts = [int(x) for x in section_counts.split(",")]
size_per = num_timesteps // len(section_counts)
extra = num_timesteps % len(section_counts)
start_idx = 0
all_steps = []
for i, section_count in enumerate(section_counts):
size = size_per + (1 if i < extra else 0)
if size < section_count:
raise ValueError(f"cannot divide section of {size} steps into {section_count}")
if section_count <= 1:
frac_stride = 1
else:
frac_stride = (size - 1) / (section_count - 1)
cur_idx = 0.0
taken_steps = []
for _ in range(section_count):
taken_steps.append(start_idx + round(cur_idx))
cur_idx += frac_stride
all_steps += taken_steps
start_idx += size
return set(all_steps)
class SpacedDiffusion(GaussianDiffusion):
"""
A diffusion process which can skip steps in a base diffusion process.
:param use_timesteps: a collection (sequence or set) of timesteps from the
original diffusion process to retain.
:param kwargs: the kwargs to create the base diffusion process.
"""
def __init__(self, use_timesteps, **kwargs):
self.use_timesteps = set(use_timesteps)
self.timestep_map = []
self.original_num_steps = len(kwargs["betas"])
base_diffusion = GaussianDiffusion(**kwargs) # pylint: disable=missing-kwoa
last_alpha_cumprod = 1.0
new_betas = []
for i, alpha_cumprod in enumerate(base_diffusion.alphas_cumprod):
if i in self.use_timesteps:
new_betas.append(1 - alpha_cumprod / last_alpha_cumprod)
last_alpha_cumprod = alpha_cumprod
self.timestep_map.append(i)
kwargs["betas"] = np.array(new_betas)
super().__init__(**kwargs)
def p_mean_variance(self, model, *args, **kwargs): # pylint: disable=signature-differs
return super().p_mean_variance(self._wrap_model(model), *args, **kwargs)
def training_losses(self, model, *args, **kwargs): # pylint: disable=signature-differs
return super().training_losses(self._wrap_model(model), *args, **kwargs)
def condition_mean(self, cond_fn, *args, **kwargs):
return super().condition_mean(self._wrap_model(cond_fn), *args, **kwargs)
def condition_score(self, cond_fn, *args, **kwargs):
return super().condition_score(self._wrap_model(cond_fn), *args, **kwargs)
def _wrap_model(self, model):
if isinstance(model, _WrappedModel):
return model
return _WrappedModel(model, self.timestep_map, self.rescale_timesteps, self.original_num_steps)
def _scale_timesteps(self, t):
# Scaling is done by the wrapped model.
return t
class _WrappedModel:
def __init__(self, model, timestep_map, rescale_timesteps, original_num_steps):
self.model = model
self.timestep_map = timestep_map
self.rescale_timesteps = rescale_timesteps
self.original_num_steps = original_num_steps
def __call__(self, x, ts, **kwargs):
map_tensor = paddle.to_tensor(self.timestep_map, place=ts.place, dtype=ts.dtype)
new_ts = map_tensor[ts]
if self.rescale_timesteps:
new_ts = paddle.cast(new_ts, 'float32') * (1000.0 / self.original_num_steps)
return self.model(x, new_ts, **kwargs)
'''
This code is based on
https://github.com/openai/guided-diffusion/blob/main/guided_diffusion/script_util.py
'''
import argparse
import inspect
from . import gaussian_diffusion as gd
from .respace import space_timesteps
from .respace import SpacedDiffusion
from .unet import EncoderUNetModel
from .unet import SuperResModel
from .unet import UNetModel
NUM_CLASSES = 1000
def diffusion_defaults():
"""
Defaults for image and classifier training.
"""
return dict(
learn_sigma=False,
diffusion_steps=1000,
noise_schedule="linear",
timestep_respacing="",
use_kl=False,
predict_xstart=False,
rescale_timesteps=False,
rescale_learned_sigmas=False,
)
def model_and_diffusion_defaults():
"""
Defaults for image training.
"""
res = dict(
image_size=64,
num_channels=128,
num_res_blocks=2,
num_heads=4,
num_heads_upsample=-1,
num_head_channels=-1,
attention_resolutions="16,8",
channel_mult="",
dropout=0.0,
class_cond=False,
use_checkpoint=False,
use_scale_shift_norm=True,
resblock_updown=False,
use_fp16=False,
use_new_attention_order=False,
)
res.update(diffusion_defaults())
return res
def create_model_and_diffusion(
image_size,
class_cond,
learn_sigma,
num_channels,
num_res_blocks,
channel_mult,
num_heads,
num_head_channels,
num_heads_upsample,
attention_resolutions,
dropout,
diffusion_steps,
noise_schedule,
timestep_respacing,
use_kl,
predict_xstart,
rescale_timesteps,
rescale_learned_sigmas,
use_checkpoint,
use_scale_shift_norm,
resblock_updown,
use_fp16,
use_new_attention_order,
):
model = create_model(
image_size,
num_channels,
num_res_blocks,
channel_mult=channel_mult,
learn_sigma=learn_sigma,
class_cond=class_cond,
use_checkpoint=use_checkpoint,
attention_resolutions=attention_resolutions,
num_heads=num_heads,
num_head_channels=num_head_channels,
num_heads_upsample=num_heads_upsample,
use_scale_shift_norm=use_scale_shift_norm,
dropout=dropout,
resblock_updown=resblock_updown,
use_fp16=use_fp16,
use_new_attention_order=use_new_attention_order,
)
diffusion = create_gaussian_diffusion(
steps=diffusion_steps,
learn_sigma=learn_sigma,
noise_schedule=noise_schedule,
use_kl=use_kl,
predict_xstart=predict_xstart,
rescale_timesteps=rescale_timesteps,
rescale_learned_sigmas=rescale_learned_sigmas,
timestep_respacing=timestep_respacing,
)
return model, diffusion
def create_model(
image_size,
num_channels,
num_res_blocks,
channel_mult="",
learn_sigma=False,
class_cond=False,
use_checkpoint=False,
attention_resolutions="16",
num_heads=1,
num_head_channels=-1,
num_heads_upsample=-1,
use_scale_shift_norm=False,
dropout=0,
resblock_updown=False,
use_fp16=False,
use_new_attention_order=False,
):
if channel_mult == "":
if image_size == 512:
channel_mult = (0.5, 1, 1, 2, 2, 4, 4)
elif image_size == 256:
channel_mult = (1, 1, 2, 2, 4, 4)
elif image_size == 128:
channel_mult = (1, 1, 2, 3, 4)
elif image_size == 64:
channel_mult = (1, 2, 3, 4)
else:
raise ValueError(f"unsupported image size: {image_size}")
else:
channel_mult = tuple(int(ch_mult) for ch_mult in channel_mult.split(","))
attention_ds = []
for res in attention_resolutions.split(","):
attention_ds.append(image_size // int(res))
return UNetModel(
image_size=image_size,
in_channels=3,
model_channels=num_channels,
out_channels=(3 if not learn_sigma else 6),
num_res_blocks=num_res_blocks,
attention_resolutions=tuple(attention_ds),
dropout=dropout,
channel_mult=channel_mult,
num_classes=(NUM_CLASSES if class_cond else None),
use_checkpoint=use_checkpoint,
use_fp16=use_fp16,
num_heads=num_heads,
num_head_channels=num_head_channels,
num_heads_upsample=num_heads_upsample,
use_scale_shift_norm=use_scale_shift_norm,
resblock_updown=resblock_updown,
use_new_attention_order=use_new_attention_order,
)
def create_gaussian_diffusion(
*,
steps=1000,
learn_sigma=False,
sigma_small=False,
noise_schedule="linear",
use_kl=False,
predict_xstart=False,
rescale_timesteps=False,
rescale_learned_sigmas=False,
timestep_respacing="",
):
betas = gd.get_named_beta_schedule(noise_schedule, steps)
if use_kl:
loss_type = gd.LossType.RESCALED_KL
elif rescale_learned_sigmas:
loss_type = gd.LossType.RESCALED_MSE
else:
loss_type = gd.LossType.MSE
if not timestep_respacing:
timestep_respacing = [steps]
return SpacedDiffusion(
use_timesteps=space_timesteps(steps, timestep_respacing),
betas=betas,
model_mean_type=(gd.ModelMeanType.EPSILON if not predict_xstart else gd.ModelMeanType.START_X),
model_var_type=((gd.ModelVarType.FIXED_LARGE if not sigma_small else gd.ModelVarType.FIXED_SMALL)
if not learn_sigma else gd.ModelVarType.LEARNED_RANGE),
loss_type=loss_type,
rescale_timesteps=rescale_timesteps,
)
'''
This code is rewritten by Paddle based on
https://github.com/jina-ai/discoart/blob/main/discoart/nn/sec_diff.py
'''
import math
from dataclasses import dataclass
from functools import partial
import paddle
import paddle.nn as nn
@dataclass
class DiffusionOutput:
v: paddle.Tensor
pred: paddle.Tensor
eps: paddle.Tensor
class SkipBlock(nn.Layer):
def __init__(self, main, skip=None):
super().__init__()
self.main = nn.Sequential(*main)
self.skip = skip if skip else nn.Identity()
def forward(self, input):
return paddle.concat([self.main(input), self.skip(input)], axis=1)
def append_dims(x, n):
return x[(Ellipsis, *(None, ) * (n - x.ndim))]
def expand_to_planes(x, shape):
return paddle.tile(append_dims(x, len(shape)), [1, 1, *shape[2:]])
def alpha_sigma_to_t(alpha, sigma):
return paddle.atan2(sigma, alpha) * 2 / math.pi
def t_to_alpha_sigma(t):
return paddle.cos(t * math.pi / 2), paddle.sin(t * math.pi / 2)
class SecondaryDiffusionImageNet2(nn.Layer):
def __init__(self):
super().__init__()
c = 64 # The base channel count
cs = [c, c * 2, c * 2, c * 4, c * 4, c * 8]
self.timestep_embed = FourierFeatures(1, 16)
self.down = nn.AvgPool2D(2)
self.up = nn.Upsample(scale_factor=2, mode='bilinear', align_corners=False)
self.net = nn.Sequential(
ConvBlock(3 + 16, cs[0]),
ConvBlock(cs[0], cs[0]),
SkipBlock([
self.down,
ConvBlock(cs[0], cs[1]),
ConvBlock(cs[1], cs[1]),
SkipBlock([
self.down,
ConvBlock(cs[1], cs[2]),
ConvBlock(cs[2], cs[2]),
SkipBlock([
self.down,
ConvBlock(cs[2], cs[3]),
ConvBlock(cs[3], cs[3]),
SkipBlock([
self.down,
ConvBlock(cs[3], cs[4]),
ConvBlock(cs[4], cs[4]),
SkipBlock([
self.down,
ConvBlock(cs[4], cs[5]),
ConvBlock(cs[5], cs[5]),
ConvBlock(cs[5], cs[5]),
ConvBlock(cs[5], cs[4]),
self.up,
]),
ConvBlock(cs[4] * 2, cs[4]),
ConvBlock(cs[4], cs[3]),
self.up,
]),
ConvBlock(cs[3] * 2, cs[3]),
ConvBlock(cs[3], cs[2]),
self.up,
]),
ConvBlock(cs[2] * 2, cs[2]),
ConvBlock(cs[2], cs[1]),
self.up,
]),
ConvBlock(cs[1] * 2, cs[1]),
ConvBlock(cs[1], cs[0]),
self.up,
]),
ConvBlock(cs[0] * 2, cs[0]),
nn.Conv2D(cs[0], 3, 3, padding=1),
)
def forward(self, input, t):
timestep_embed = expand_to_planes(self.timestep_embed(t[:, None]), input.shape)
v = self.net(paddle.concat([input, timestep_embed], axis=1))
alphas, sigmas = map(partial(append_dims, n=v.ndim), t_to_alpha_sigma(t))
pred = input * alphas - v * sigmas
eps = input * sigmas + v * alphas
return DiffusionOutput(v, pred, eps)
class FourierFeatures(nn.Layer):
def __init__(self, in_features, out_features, std=1.0):
super().__init__()
assert out_features % 2 == 0
# self.weight = nn.Parameter(paddle.randn([out_features // 2, in_features]) * std)
self.weight = paddle.create_parameter([out_features // 2, in_features],
dtype='float32',
default_initializer=nn.initializer.Normal(mean=0.0, std=std))
def forward(self, input):
f = 2 * math.pi * input @ self.weight.T
return paddle.concat([f.cos(), f.sin()], axis=-1)
class ConvBlock(nn.Sequential):
def __init__(self, c_in, c_out):
super().__init__(
nn.Conv2D(c_in, c_out, 3, padding=1),
nn.ReLU(),
)
'''
This code is rewritten by Paddle based on
https://github.com/pytorch/vision/blob/main/torchvision/transforms/transforms.py
'''
import math
import numbers
import warnings
from enum import Enum
from typing import Any
from typing import Dict
from typing import List
from typing import Optional
from typing import Sequence
from typing import Tuple
from typing import Union
import paddle
import paddle.nn as nn
from paddle import Tensor
from paddle.nn import functional as F
from paddle.nn.functional import grid_sample
from paddle.vision import transforms as T
class Normalize(nn.Layer):
def __init__(self, mean, std):
super(Normalize, self).__init__()
self.mean = paddle.to_tensor(mean)
self.std = paddle.to_tensor(std)
def forward(self, tensor: Tensor):
dtype = tensor.dtype
mean = paddle.to_tensor(self.mean, dtype=dtype)
std = paddle.to_tensor(self.std, dtype=dtype)
mean = mean.reshape([1, -1, 1, 1])
std = std.reshape([1, -1, 1, 1])
result = tensor.subtract(mean).divide(std)
return result
class InterpolationMode(Enum):
"""Interpolation modes
Available interpolation methods are ``nearest``, ``bilinear``, ``bicubic``, ``box``, ``hamming``, and ``lanczos``.
"""
NEAREST = "nearest"
BILINEAR = "bilinear"
BICUBIC = "bicubic"
# For PIL compatibility
BOX = "box"
HAMMING = "hamming"
LANCZOS = "lanczos"
class Grayscale(nn.Layer):
def __init__(self, num_output_channels):
super(Grayscale, self).__init__()
self.num_output_channels = num_output_channels
def forward(self, x):
output = (0.2989 * x[:, 0:1, :, :] + 0.587 * x[:, 1:2, :, :] + 0.114 * x[:, 2:3, :, :])
if self.num_output_channels == 3:
return output.expand(x.shape)
return output
class Lambda(nn.Layer):
def __init__(self, func):
super(Lambda, self).__init__()
self.transform = func
def forward(self, x):
return self.transform(x)
class RandomGrayscale(nn.Layer):
def __init__(self, p):
super(RandomGrayscale, self).__init__()
self.prob = p
self.transform = Grayscale(3)
def forward(self, x):
if paddle.rand([1]) < self.prob:
return self.transform(x)
else:
return x
class RandomHorizontalFlip(nn.Layer):
def __init__(self, prob):
super(RandomHorizontalFlip, self).__init__()
self.prob = prob
def forward(self, x):
if paddle.rand([1]) < self.prob:
return x[:, :, :, ::-1]
else:
return x
def _blend(img1: Tensor, img2: Tensor, ratio: float) -> Tensor:
ratio = float(ratio)
bound = 1.0
return (ratio * img1 + (1.0 - ratio) * img2).clip(0, bound)
def trunc_div(a, b):
ipt = paddle.divide(a, b)
sign_ipt = paddle.sign(ipt)
abs_ipt = paddle.abs(ipt)
abs_ipt = paddle.floor(abs_ipt)
out = paddle.multiply(sign_ipt, abs_ipt)
return out
def fmod(a, b):
return a - trunc_div(a, b) * b
def _rgb2hsv(img: Tensor) -> Tensor:
r, g, b = img.unbind(axis=-3)
# Implementation is based on https://github.com/python-pillow/Pillow/blob/4174d4267616897df3746d315d5a2d0f82c656ee/
# src/libImaging/Convert.c#L330
maxc = paddle.max(img, axis=-3)
minc = paddle.min(img, axis=-3)
# The algorithm erases S and H channel where `maxc = minc`. This avoids NaN
# from happening in the results, because
# + S channel has division by `maxc`, which is zero only if `maxc = minc`
# + H channel has division by `(maxc - minc)`.
#
# Instead of overwriting NaN afterwards, we just prevent it from occuring so
# we don't need to deal with it in case we save the NaN in a buffer in
# backprop, if it is ever supported, but it doesn't hurt to do so.
eqc = maxc == minc
cr = maxc - minc
# Since `eqc => cr = 0`, replacing denominator with 1 when `eqc` is fine.
ones = paddle.ones_like(maxc)
s = cr / paddle.where(eqc, ones, maxc)
# Note that `eqc => maxc = minc = r = g = b`. So the following calculation
# of `h` would reduce to `bc - gc + 2 + rc - bc + 4 + rc - bc = 6` so it
# would not matter what values `rc`, `gc`, and `bc` have here, and thus
# replacing denominator with 1 when `eqc` is fine.
cr_divisor = paddle.where(eqc, ones, cr)
rc = (maxc - r) / cr_divisor
gc = (maxc - g) / cr_divisor
bc = (maxc - b) / cr_divisor
hr = (maxc == r).cast('float32') * (bc - gc)
hg = ((maxc == g) & (maxc != r)).cast('float32') * (2.0 + rc - bc)
hb = ((maxc != g) & (maxc != r)).cast('float32') * (4.0 + gc - rc)
h = hr + hg + hb
h = fmod((h / 6.0 + 1.0), paddle.to_tensor(1.0))
return paddle.stack((h, s, maxc), axis=-3)
def _hsv2rgb(img: Tensor) -> Tensor:
h, s, v = img.unbind(axis=-3)
i = paddle.floor(h * 6.0)
f = (h * 6.0) - i
i = i.cast(dtype='int32')
p = paddle.clip((v * (1.0 - s)), 0.0, 1.0)
q = paddle.clip((v * (1.0 - s * f)), 0.0, 1.0)
t = paddle.clip((v * (1.0 - s * (1.0 - f))), 0.0, 1.0)
i = i % 6
mask = i.unsqueeze(axis=-3) == paddle.arange(6).reshape([-1, 1, 1])
a1 = paddle.stack((v, q, p, p, t, v), axis=-3)
a2 = paddle.stack((t, v, v, q, p, p), axis=-3)
a3 = paddle.stack((p, p, t, v, v, q), axis=-3)
a4 = paddle.stack((a1, a2, a3), axis=-4)
return paddle.einsum("...ijk, ...xijk -> ...xjk", mask.cast(dtype=img.dtype), a4)
def adjust_brightness(img: Tensor, brightness_factor: float) -> Tensor:
if brightness_factor < 0:
raise ValueError(f"brightness_factor ({brightness_factor}) is not non-negative.")
return _blend(img, paddle.zeros_like(img), brightness_factor)
def adjust_contrast(img: Tensor, contrast_factor: float) -> Tensor:
if contrast_factor < 0:
raise ValueError(f"contrast_factor ({contrast_factor}) is not non-negative.")
c = img.shape[1]
if c == 3:
output = (0.2989 * img[:, 0:1, :, :] + 0.587 * img[:, 1:2, :, :] + 0.114 * img[:, 2:3, :, :])
mean = paddle.mean(output, axis=(-3, -2, -1), keepdim=True)
else:
mean = paddle.mean(img, axis=(-3, -2, -1), keepdim=True)
return _blend(img, mean, contrast_factor)
def adjust_hue(img: Tensor, hue_factor: float) -> Tensor:
if not (-0.5 <= hue_factor <= 0.5):
raise ValueError(f"hue_factor ({hue_factor}) is not in [-0.5, 0.5].")
img = _rgb2hsv(img)
h, s, v = img.unbind(axis=-3)
h = fmod(h + hue_factor, paddle.to_tensor(1.0))
img = paddle.stack((h, s, v), axis=-3)
img_hue_adj = _hsv2rgb(img)
return img_hue_adj
def adjust_saturation(img: Tensor, saturation_factor: float) -> Tensor:
if saturation_factor < 0:
raise ValueError(f"saturation_factor ({saturation_factor}) is not non-negative.")
output = (0.2989 * img[:, 0:1, :, :] + 0.587 * img[:, 1:2, :, :] + 0.114 * img[:, 2:3, :, :])
return _blend(img, output, saturation_factor)
class ColorJitter(nn.Layer):
def __init__(self, brightness=0, contrast=0, saturation=0, hue=0):
super(ColorJitter, self).__init__()
self.brightness = self._check_input(brightness, "brightness")
self.contrast = self._check_input(contrast, "contrast")
self.saturation = self._check_input(saturation, "saturation")
self.hue = self._check_input(hue, "hue", center=0, bound=(-0.5, 0.5), clip_first_on_zero=False)
def _check_input(self, value, name, center=1, bound=(0, float("inf")), clip_first_on_zero=True):
if isinstance(value, numbers.Number):
if value < 0:
raise ValueError(f"If {name} is a single number, it must be non negative.")
value = [center - float(value), center + float(value)]
if clip_first_on_zero:
value[0] = max(value[0], 0.0)
elif isinstance(value, (tuple, list)) and len(value) == 2:
if not bound[0] <= value[0] <= value[1] <= bound[1]:
raise ValueError(f"{name} values should be between {bound}")
else:
raise TypeError(f"{name} should be a single number or a list/tuple with length 2.")
# if value is 0 or (1., 1.) for brightness/contrast/saturation
# or (0., 0.) for hue, do nothing
if value[0] == value[1] == center:
value = None
return value
@staticmethod
def get_params(
brightness: Optional[List[float]],
contrast: Optional[List[float]],
saturation: Optional[List[float]],
hue: Optional[List[float]],
) -> Tuple[Tensor, Optional[float], Optional[float], Optional[float], Optional[float]]:
"""Get the parameters for the randomized transform to be applied on image.
Args:
brightness (tuple of float (min, max), optional): The range from which the brightness_factor is chosen
uniformly. Pass None to turn off the transformation.
contrast (tuple of float (min, max), optional): The range from which the contrast_factor is chosen
uniformly. Pass None to turn off the transformation.
saturation (tuple of float (min, max), optional): The range from which the saturation_factor is chosen
uniformly. Pass None to turn off the transformation.
hue (tuple of float (min, max), optional): The range from which the hue_factor is chosen uniformly.
Pass None to turn off the transformation.
Returns:
tuple: The parameters used to apply the randomized transform
along with their random order.
"""
fn_idx = paddle.randperm(4)
b = None if brightness is None else paddle.empty([1]).uniform_(brightness[0], brightness[1])
c = None if contrast is None else paddle.empty([1]).uniform_(contrast[0], contrast[1])
s = None if saturation is None else paddle.empty([1]).uniform_(saturation[0], saturation[1])
h = None if hue is None else paddle.empty([1]).uniform_(hue[0], hue[1])
return fn_idx, b, c, s, h
def forward(self, img):
"""
Args:
img (PIL Image or Tensor): Input image.
Returns:
PIL Image or Tensor: Color jittered image.
"""
fn_idx, brightness_factor, contrast_factor, saturation_factor, hue_factor = self.get_params(
self.brightness, self.contrast, self.saturation, self.hue)
for fn_id in fn_idx:
if fn_id == 0 and brightness_factor is not None:
img = adjust_brightness(img, brightness_factor)
elif fn_id == 1 and contrast_factor is not None:
img = adjust_contrast(img, contrast_factor)
elif fn_id == 2 and saturation_factor is not None:
img = adjust_saturation(img, saturation_factor)
elif fn_id == 3 and hue_factor is not None:
img = adjust_hue(img, hue_factor)
return img
def __repr__(self) -> str:
s = (f"{self.__class__.__name__}("
f"brightness={self.brightness}"
f", contrast={self.contrast}"
f", saturation={self.saturation}"
f", hue={self.hue})")
return s
def _apply_grid_transform(img: Tensor, grid: Tensor, mode: str, fill: Optional[List[float]]) -> Tensor:
if img.shape[0] > 1:
# Apply same grid to a batch of images
grid = grid.expand([img.shape[0], grid.shape[1], grid.shape[2], grid.shape[3]])
# Append a dummy mask for customized fill colors, should be faster than grid_sample() twice
if fill is not None:
dummy = paddle.ones((img.shape[0], 1, img.shape[2], img.shape[3]), dtype=img.dtype)
img = paddle.concat((img, dummy), axis=1)
img = grid_sample(img, grid, mode=mode, padding_mode="zeros", align_corners=False)
# Fill with required color
if fill is not None:
mask = img[:, -1:, :, :] # N * 1 * H * W
img = img[:, :-1, :, :] # N * C * H * W
mask = mask.expand_as(img)
len_fill = len(fill) if isinstance(fill, (tuple, list)) else 1
fill_img = paddle.to_tensor(fill, dtype=img.dtype).reshape([1, len_fill, 1, 1]).expand_as(img)
if mode == "nearest":
mask = mask < 0.5
img[mask] = fill_img[mask]
else: # 'bilinear'
img = img * mask + (1.0 - mask) * fill_img
return img
def _gen_affine_grid(
theta: Tensor,
w: int,
h: int,
ow: int,
oh: int,
) -> Tensor:
# https://github.com/pytorch/pytorch/blob/74b65c32be68b15dc7c9e8bb62459efbfbde33d8/aten/src/ATen/native/
# AffineGridGenerator.cpp#L18
# Difference with AffineGridGenerator is that:
# 1) we normalize grid values after applying theta
# 2) we can normalize by other image size, such that it covers "extend" option like in PIL.Image.rotate
d = 0.5
base_grid = paddle.empty([1, oh, ow, 3], dtype=theta.dtype)
x_grid = paddle.linspace(-ow * 0.5 + d, ow * 0.5 + d - 1, num=ow)
base_grid[..., 0] = (x_grid)
y_grid = paddle.linspace(-oh * 0.5 + d, oh * 0.5 + d - 1, num=oh).unsqueeze_(-1)
base_grid[..., 1] = (y_grid)
base_grid[..., 2] = 1.0
rescaled_theta = theta.transpose([0, 2, 1]) / paddle.to_tensor([0.5 * w, 0.5 * h], dtype=theta.dtype)
output_grid = base_grid.reshape([1, oh * ow, 3]).bmm(rescaled_theta)
return output_grid.reshape([1, oh, ow, 2])
def affine_impl(img: Tensor,
matrix: List[float],
interpolation: str = "nearest",
fill: Optional[List[float]] = None) -> Tensor:
theta = paddle.to_tensor(matrix, dtype=img.dtype).reshape([1, 2, 3])
shape = img.shape
# grid will be generated on the same device as theta and img
grid = _gen_affine_grid(theta, w=shape[-1], h=shape[-2], ow=shape[-1], oh=shape[-2])
return _apply_grid_transform(img, grid, interpolation, fill=fill)
def _get_inverse_affine_matrix(center: List[float],
angle: float,
translate: List[float],
scale: float,
shear: List[float],
inverted: bool = True) -> List[float]:
# Helper method to compute inverse matrix for affine transformation
# Pillow requires inverse affine transformation matrix:
# Affine matrix is : M = T * C * RotateScaleShear * C^-1
#
# where T is translation matrix: [1, 0, tx | 0, 1, ty | 0, 0, 1]
# C is translation matrix to keep center: [1, 0, cx | 0, 1, cy | 0, 0, 1]
# RotateScaleShear is rotation with scale and shear matrix
#
# RotateScaleShear(a, s, (sx, sy)) =
# = R(a) * S(s) * SHy(sy) * SHx(sx)
# = [ s*cos(a - sy)/cos(sy), s*(-cos(a - sy)*tan(sx)/cos(sy) - sin(a)), 0 ]
# [ s*sin(a + sy)/cos(sy), s*(-sin(a - sy)*tan(sx)/cos(sy) + cos(a)), 0 ]
# [ 0 , 0 , 1 ]
# where R is a rotation matrix, S is a scaling matrix, and SHx and SHy are the shears:
# SHx(s) = [1, -tan(s)] and SHy(s) = [1 , 0]
# [0, 1 ] [-tan(s), 1]
#
# Thus, the inverse is M^-1 = C * RotateScaleShear^-1 * C^-1 * T^-1
rot = math.radians(angle)
sx = math.radians(shear[0])
sy = math.radians(shear[1])
cx, cy = center
tx, ty = translate
# RSS without scaling
a = math.cos(rot - sy) / math.cos(sy)
b = -math.cos(rot - sy) * math.tan(sx) / math.cos(sy) - math.sin(rot)
c = math.sin(rot - sy) / math.cos(sy)
d = -math.sin(rot - sy) * math.tan(sx) / math.cos(sy) + math.cos(rot)
if inverted:
# Inverted rotation matrix with scale and shear
# det([[a, b], [c, d]]) == 1, since det(rotation) = 1 and det(shear) = 1
matrix = [d, -b, 0.0, -c, a, 0.0]
matrix = [x / scale for x in matrix]
# Apply inverse of translation and of center translation: RSS^-1 * C^-1 * T^-1
matrix[2] += matrix[0] * (-cx - tx) + matrix[1] * (-cy - ty)
matrix[5] += matrix[3] * (-cx - tx) + matrix[4] * (-cy - ty)
# Apply center translation: C * RSS^-1 * C^-1 * T^-1
matrix[2] += cx
matrix[5] += cy
else:
matrix = [a, b, 0.0, c, d, 0.0]
matrix = [x * scale for x in matrix]
# Apply inverse of center translation: RSS * C^-1
matrix[2] += matrix[0] * (-cx) + matrix[1] * (-cy)
matrix[5] += matrix[3] * (-cx) + matrix[4] * (-cy)
# Apply translation and center : T * C * RSS * C^-1
matrix[2] += cx + tx
matrix[5] += cy + ty
return matrix
def affine(
img: Tensor,
angle: float,
translate: List[int],
scale: float,
shear: List[float],
interpolation: InterpolationMode = InterpolationMode.NEAREST,
fill: Optional[List[float]] = None,
resample: Optional[int] = None,
fillcolor: Optional[List[float]] = None,
center: Optional[List[int]] = None,
) -> Tensor:
"""Apply affine transformation on the image keeping image center invariant.
If the image is paddle Tensor, it is expected
to have [..., H, W] shape, where ... means an arbitrary number of leading dimensions.
Args:
img (PIL Image or Tensor): image to transform.
angle (number): rotation angle in degrees between -180 and 180, clockwise direction.
translate (sequence of integers): horizontal and vertical translations (post-rotation translation)
scale (float): overall scale
shear (float or sequence): shear angle value in degrees between -180 to 180, clockwise direction.
If a sequence is specified, the first value corresponds to a shear parallel to the x axis, while
the second value corresponds to a shear parallel to the y axis.
interpolation (InterpolationMode): Desired interpolation enum defined by
:class:`torchvision.transforms.InterpolationMode`. Default is ``InterpolationMode.NEAREST``.
If input is Tensor, only ``InterpolationMode.NEAREST``, ``InterpolationMode.BILINEAR`` are supported.
For backward compatibility integer values (e.g. ``PIL.Image[.Resampling].NEAREST``) are still accepted,
but deprecated since 0.13 and will be removed in 0.15. Please use InterpolationMode enum.
fill (sequence or number, optional): Pixel fill value for the area outside the transformed
image. If given a number, the value is used for all bands respectively.
.. note::
In torchscript mode single int/float value is not supported, please use a sequence
of length 1: ``[value, ]``.
fillcolor (sequence or number, optional):
.. warning::
This parameter was deprecated in ``0.12`` and will be removed in ``0.14``. Please use ``fill`` instead.
resample (int, optional):
.. warning::
This parameter was deprecated in ``0.12`` and will be removed in ``0.14``. Please use ``interpolation``
instead.
center (sequence, optional): Optional center of rotation. Origin is the upper left corner.
Default is the center of the image.
Returns:
PIL Image or Tensor: Transformed image.
"""
# Backward compatibility with integer value
if isinstance(interpolation, int):
warnings.warn("Argument 'interpolation' of type int is deprecated since 0.13 and will be removed in 0.15. "
"Please use InterpolationMode enum.")
interpolation = _interpolation_modes_from_int(interpolation)
if fillcolor is not None:
warnings.warn("The parameter 'fillcolor' is deprecated since 0.12 and will be removed in 0.14. "
"Please use 'fill' instead.")
fill = fillcolor
if not isinstance(angle, (int, float)):
raise TypeError("Argument angle should be int or float")
if not isinstance(translate, (list, tuple)):
raise TypeError("Argument translate should be a sequence")
if len(translate) != 2:
raise ValueError("Argument translate should be a sequence of length 2")
if scale <= 0.0:
raise ValueError("Argument scale should be positive")
if not isinstance(shear, (numbers.Number, (list, tuple))):
raise TypeError("Shear should be either a single value or a sequence of two values")
if not isinstance(interpolation, InterpolationMode):
raise TypeError("Argument interpolation should be a InterpolationMode")
if isinstance(angle, int):
angle = float(angle)
if isinstance(translate, tuple):
translate = list(translate)
if isinstance(shear, numbers.Number):
shear = [shear, 0.0]
if isinstance(shear, tuple):
shear = list(shear)
if len(shear) == 1:
shear = [shear[0], shear[0]]
if len(shear) != 2:
raise ValueError(f"Shear should be a sequence containing two values. Got {shear}")
if center is not None and not isinstance(center, (list, tuple)):
raise TypeError("Argument center should be a sequence")
center_f = [0.0, 0.0]
if center is not None:
_, height, width = img.shape[0], img.shape[1], img.shape[2]
# Center values should be in pixel coordinates but translated such that (0, 0) corresponds to image center.
center_f = [1.0 * (c - s * 0.5) for c, s in zip(center, [width, height])]
translate_f = [1.0 * t for t in translate]
matrix = _get_inverse_affine_matrix(center_f, angle, translate_f, scale, shear)
return affine_impl(img, matrix=matrix, interpolation=interpolation.value, fill=fill)
def _interpolation_modes_from_int(i: int) -> InterpolationMode:
inverse_modes_mapping = {
0: InterpolationMode.NEAREST,
2: InterpolationMode.BILINEAR,
3: InterpolationMode.BICUBIC,
4: InterpolationMode.BOX,
5: InterpolationMode.HAMMING,
1: InterpolationMode.LANCZOS,
}
return inverse_modes_mapping[i]
def _check_sequence_input(x, name, req_sizes):
msg = req_sizes[0] if len(req_sizes) < 2 else " or ".join([str(s) for s in req_sizes])
if not isinstance(x, Sequence):
raise TypeError(f"{name} should be a sequence of length {msg}.")
if len(x) not in req_sizes:
raise ValueError(f"{name} should be sequence of length {msg}.")
def _setup_angle(x, name, req_sizes=(2, )):
if isinstance(x, numbers.Number):
if x < 0:
raise ValueError(f"If {name} is a single number, it must be positive.")
x = [-x, x]
else:
_check_sequence_input(x, name, req_sizes)
return [float(d) for d in x]
class RandomAffine(nn.Layer):
"""Random affine transformation of the image keeping center invariant.
If the image is paddle Tensor, it is expected
to have [..., H, W] shape, where ... means an arbitrary number of leading dimensions.
Args:
degrees (sequence or number): Range of degrees to select from.
If degrees is a number instead of sequence like (min, max), the range of degrees
will be (-degrees, +degrees). Set to 0 to deactivate rotations.
translate (tuple, optional): tuple of maximum absolute fraction for horizontal
and vertical translations. For example translate=(a, b), then horizontal shift
is randomly sampled in the range -img_width * a < dx < img_width * a and vertical shift is
randomly sampled in the range -img_height * b < dy < img_height * b. Will not translate by default.
scale (tuple, optional): scaling factor interval, e.g (a, b), then scale is
randomly sampled from the range a <= scale <= b. Will keep original scale by default.
shear (sequence or number, optional): Range of degrees to select from.
If shear is a number, a shear parallel to the x axis in the range (-shear, +shear)
will be applied. Else if shear is a sequence of 2 values a shear parallel to the x axis in the
range (shear[0], shear[1]) will be applied. Else if shear is a sequence of 4 values,
a x-axis shear in (shear[0], shear[1]) and y-axis shear in (shear[2], shear[3]) will be applied.
Will not apply shear by default.
interpolation (InterpolationMode): Desired interpolation enum defined by
:class:`torchvision.transforms.InterpolationMode`. Default is ``InterpolationMode.NEAREST``.
If input is Tensor, only ``InterpolationMode.NEAREST``, ``InterpolationMode.BILINEAR`` are supported.
For backward compatibility integer values (e.g. ``PIL.Image[.Resampling].NEAREST``) are still accepted,
but deprecated since 0.13 and will be removed in 0.15. Please use InterpolationMode enum.
fill (sequence or number): Pixel fill value for the area outside the transformed
image. Default is ``0``. If given a number, the value is used for all bands respectively.
fillcolor (sequence or number, optional):
.. warning::
This parameter was deprecated in ``0.12`` and will be removed in ``0.14``. Please use ``fill`` instead.
resample (int, optional):
.. warning::
This parameter was deprecated in ``0.12`` and will be removed in ``0.14``. Please use ``interpolation``
instead.
center (sequence, optional): Optional center of rotation, (x, y). Origin is the upper left corner.
Default is the center of the image.
.. _filters: https://pillow.readthedocs.io/en/latest/handbook/concepts.html#filters
"""
def __init__(
self,
degrees,
translate=None,
scale=None,
shear=None,
interpolation=InterpolationMode.NEAREST,
fill=0,
fillcolor=None,
resample=None,
center=None,
):
super(RandomAffine, self).__init__()
if resample is not None:
warnings.warn("The parameter 'resample' is deprecated since 0.12 and will be removed in 0.14. "
"Please use 'interpolation' instead.")
interpolation = _interpolation_modes_from_int(resample)
# Backward compatibility with integer value
if isinstance(interpolation, int):
warnings.warn("Argument 'interpolation' of type int is deprecated since 0.13 and will be removed in 0.15. "
"Please use InterpolationMode enum.")
interpolation = _interpolation_modes_from_int(interpolation)
if fillcolor is not None:
warnings.warn("The parameter 'fillcolor' is deprecated since 0.12 and will be removed in 0.14. "
"Please use 'fill' instead.")
fill = fillcolor
self.degrees = _setup_angle(degrees, name="degrees", req_sizes=(2, ))
if translate is not None:
_check_sequence_input(translate, "translate", req_sizes=(2, ))
for t in translate:
if not (0.0 <= t <= 1.0):
raise ValueError("translation values should be between 0 and 1")
self.translate = translate
if scale is not None:
_check_sequence_input(scale, "scale", req_sizes=(2, ))
for s in scale:
if s <= 0:
raise ValueError("scale values should be positive")
self.scale = scale
if shear is not None:
self.shear = _setup_angle(shear, name="shear", req_sizes=(2, 4))
else:
self.shear = shear
self.resample = self.interpolation = interpolation
if fill is None:
fill = 0
elif not isinstance(fill, (Sequence, numbers.Number)):
raise TypeError("Fill should be either a sequence or a number.")
self.fillcolor = self.fill = fill
if center is not None:
_check_sequence_input(center, "center", req_sizes=(2, ))
self.center = center
@staticmethod
def get_params(
degrees: List[float],
translate: Optional[List[float]],
scale_ranges: Optional[List[float]],
shears: Optional[List[float]],
img_size: List[int],
) -> Tuple[float, Tuple[int, int], float, Tuple[float, float]]:
"""Get parameters for affine transformation
Returns:
params to be passed to the affine transformation
"""
angle = float(paddle.empty([1]).uniform_(float(degrees[0]), float(degrees[1])))
if translate is not None:
max_dx = float(translate[0] * img_size[0])
max_dy = float(translate[1] * img_size[1])
tx = int(float(paddle.empty([1]).uniform_(-max_dx, max_dx)))
ty = int(float(paddle.empty([1]).uniform_(-max_dy, max_dy)))
translations = (tx, ty)
else:
translations = (0, 0)
if scale_ranges is not None:
scale = float(paddle.empty([1]).uniform_(scale_ranges[0], scale_ranges[1]))
else:
scale = 1.0
shear_x = shear_y = 0.0
if shears is not None:
shear_x = float(paddle.empty([1]).uniform_(shears[0], shears[1]))
if len(shears) == 4:
shear_y = float(paddle.empty([1]).uniform_(shears[2], shears[3]))
shear = (shear_x, shear_y)
return angle, translations, scale, shear
def forward(self, img):
fill = self.fill
channels, height, width = img.shape[1], img.shape[2], img.shape[3]
if isinstance(fill, (int, float)):
fill = [float(fill)] * channels
else:
fill = [float(f) for f in fill]
img_size = [width, height] # flip for keeping BC on get_params call
ret = self.get_params(self.degrees, self.translate, self.scale, self.shear, img_size)
return affine(img, *ret, interpolation=self.interpolation, fill=fill, center=self.center)
def __repr__(self) -> str:
s = f"{self.__class__.__name__}(degrees={self.degrees}"
s += f", translate={self.translate}" if self.translate is not None else ""
s += f", scale={self.scale}" if self.scale is not None else ""
s += f", shear={self.shear}" if self.shear is not None else ""
s += f", interpolation={self.interpolation.value}" if self.interpolation != InterpolationMode.NEAREST else ""
s += f", fill={self.fill}" if self.fill != 0 else ""
s += f", center={self.center}" if self.center is not None else ""
s += ")"
return s
'''
This code is rewritten by Paddle based on
https://github.com/openai/guided-diffusion/blob/main/guided_diffusion/unet.py
'''
import math
from abc import abstractmethod
import numpy as np
import paddle
import paddle.nn as nn
import paddle.nn.functional as F
from .nn import avg_pool_nd
from .nn import checkpoint
from .nn import conv_nd
from .nn import linear
from .nn import normalization
from .nn import SiLU
from .nn import timestep_embedding
from .nn import zero_module
class AttentionPool2d(nn.Layer):
"""
Adapted from CLIP: https://github.com/openai/CLIP/blob/main/clip/model.py
"""
def __init__(
self,
spacial_dim: int,
embed_dim: int,
num_heads_channels: int,
output_dim: int = None,
):
super().__init__()
# self.positional_embedding = nn.Parameter(
# th.randn(embed_dim, spacial_dim ** 2 + 1) / embed_dim ** 0.5
# )
positional_embedding = self.create_parameter(paddle.randn(embed_dim, spacial_dim**2 + 1) / embed_dim**0.5)
self.add_parameter("positional_embedding", positional_embedding)
self.qkv_proj = conv_nd(1, embed_dim, 3 * embed_dim, 1)
self.c_proj = conv_nd(1, embed_dim, output_dim or embed_dim, 1)
self.num_heads = embed_dim // num_heads_channels
self.attention = QKVAttention(self.num_heads)
def forward(self, x):
b, c, *_spatial = x.shape
# x = x.reshape(b, c, -1) # NC(HW)
x = paddle.reshape(x, [b, c, -1])
x = paddle.concat([x.mean(dim=-1, keepdim=True), x], axis=-1) # NC(HW+1)
x = x + paddle.cast(self.positional_embedding[None, :, :], x.dtype) # NC(HW+1)
x = self.qkv_proj(x)
x = self.attention(x)
x = self.c_proj(x)
return x[:, :, 0]
class TimestepBlock(nn.Layer):
"""
Any module where forward() takes timestep embeddings as a second argument.
"""
@abstractmethod
def forward(self, x, emb):
"""
Apply the module to `x` given `emb` timestep embeddings.
"""
class TimestepEmbedSequential(nn.Sequential, TimestepBlock):
"""
A sequential module that passes timestep embeddings to the children that
support it as an extra input.
"""
def forward(self, x, emb):
for layer in self:
if isinstance(layer, TimestepBlock):
x = layer(x, emb)
else:
x = layer(x)
return x
class Upsample(nn.Layer):
"""
An upsampling layer with an optional convolution.
:param channels: channels in the inputs and outputs.
:param use_conv: a bool determining if a convolution is applied.
:param dims: determines if the signal is 1D, 2D, or 3D. If 3D, then
upsampling occurs in the inner-two dimensions.
"""
def __init__(self, channels, use_conv, dims=2, out_channels=None):
super().__init__()
self.channels = channels
self.out_channels = out_channels or channels
self.use_conv = use_conv
self.dims = dims
if use_conv:
self.conv = conv_nd(dims, self.channels, self.out_channels, 3, padding=1)
def forward(self, x):
assert x.shape[1] == self.channels
if self.dims == 3:
x = F.interpolate(x, (x.shape[2], x.shape[3] * 2, x.shape[4] * 2), mode="nearest")
else:
x = F.interpolate(x, scale_factor=2, mode="nearest")
if self.use_conv:
x = self.conv(x)
return x
class Downsample(nn.Layer):
"""
A downsampling layer with an optional convolution.
:param channels: channels in the inputs and outputs.
:param use_conv: a bool determining if a convolution is applied.
:param dims: determines if the signal is 1D, 2D, or 3D. If 3D, then
downsampling occurs in the inner-two dimensions.
"""
def __init__(self, channels, use_conv, dims=2, out_channels=None):
super().__init__()
self.channels = channels
self.out_channels = out_channels or channels
self.use_conv = use_conv
self.dims = dims
stride = 2 if dims != 3 else (1, 2, 2)
if use_conv:
self.op = conv_nd(dims, self.channels, self.out_channels, 3, stride=stride, padding=1)
else:
assert self.channels == self.out_channels
self.op = avg_pool_nd(dims, kernel_size=stride, stride=stride)
def forward(self, x):
assert x.shape[1] == self.channels
return self.op(x)
class ResBlock(TimestepBlock):
"""
A residual block that can optionally change the number of channels.
:param channels: the number of input channels.
:param emb_channels: the number of timestep embedding channels.
:param dropout: the rate of dropout.
:param out_channels: if specified, the number of out channels.
:param use_conv: if True and out_channels is specified, use a spatial
convolution instead of a smaller 1x1 convolution to change the
channels in the skip connection.
:param dims: determines if the signal is 1D, 2D, or 3D.
:param use_checkpoint: if True, use gradient checkpointing on this module.
:param up: if True, use this block for upsampling.
:param down: if True, use this block for downsampling.
"""
def __init__(
self,
channels,
emb_channels,
dropout,
out_channels=None,
use_conv=False,
use_scale_shift_norm=False,
dims=2,
use_checkpoint=False,
up=False,
down=False,
):
super().__init__()
self.channels = channels
self.emb_channels = emb_channels
self.dropout = dropout
self.out_channels = out_channels or channels
self.use_conv = use_conv
self.use_checkpoint = use_checkpoint
self.use_scale_shift_norm = use_scale_shift_norm
self.in_layers = nn.Sequential(
normalization(channels),
SiLU(),
conv_nd(dims, channels, self.out_channels, 3, padding=1),
)
self.updown = up or down
if up:
self.h_upd = Upsample(channels, False, dims)
self.x_upd = Upsample(channels, False, dims)
elif down:
self.h_upd = Downsample(channels, False, dims)
self.x_upd = Downsample(channels, False, dims)
else:
self.h_upd = self.x_upd = nn.Identity()
self.emb_layers = nn.Sequential(
SiLU(),
linear(
emb_channels,
2 * self.out_channels if use_scale_shift_norm else self.out_channels,
),
)
self.out_layers = nn.Sequential(
normalization(self.out_channels),
SiLU(),
nn.Dropout(p=dropout),
zero_module(conv_nd(dims, self.out_channels, self.out_channels, 3, padding=1)),
)
if self.out_channels == channels:
self.skip_connection = nn.Identity()
elif use_conv:
self.skip_connection = conv_nd(dims, channels, self.out_channels, 3, padding=1)
else:
self.skip_connection = conv_nd(dims, channels, self.out_channels, 1)
def forward(self, x, emb):
"""
Apply the block to a Tensor, conditioned on a timestep embedding.
:param x: an [N x C x ...] Tensor of features.
:param emb: an [N x emb_channels] Tensor of timestep embeddings.
:return: an [N x C x ...] Tensor of outputs.
"""
return checkpoint(self._forward, (x, emb), self.parameters(), self.use_checkpoint)
def _forward(self, x, emb):
if self.updown:
in_rest, in_conv = self.in_layers[:-1], self.in_layers[-1]
h = in_rest(x)
h = self.h_upd(h)
x = self.x_upd(x)
h = in_conv(h)
else:
h = self.in_layers(x)
emb_out = self.emb_layers(emb)
emb_out = paddle.cast(emb_out, h.dtype)
while len(emb_out.shape) < len(h.shape):
emb_out = emb_out[..., None]
if self.use_scale_shift_norm:
out_norm, out_rest = self.out_layers[0], self.out_layers[1:]
scale, shift = paddle.chunk(emb_out, 2, axis=1)
h = out_norm(h) * (1 + scale) + shift
h = out_rest(h)
else:
h = h + emb_out
h = self.out_layers(h)
return self.skip_connection(x) + h
class AttentionBlock(nn.Layer):
"""
An attention block that allows spatial positions to attend to each other.
Originally ported from here, but adapted to the N-d case.
https://github.com/hojonathanho/diffusion/blob/1e0dceb3b3495bbe19116a5e1b3596cd0706c543/diffusion_tf/models/unet.py#L66.
"""
def __init__(
self,
channels,
num_heads=1,
num_head_channels=-1,
use_checkpoint=False,
use_new_attention_order=False,
):
super().__init__()
self.channels = channels
if num_head_channels == -1:
self.num_heads = num_heads
else:
assert (channels % num_head_channels == 0
), f"q,k,v channels {channels} is not divisible by num_head_channels {num_head_channels}"
self.num_heads = channels // num_head_channels
self.use_checkpoint = use_checkpoint
self.norm = normalization(channels)
self.qkv = conv_nd(1, channels, channels * 3, 1)
if use_new_attention_order:
# split qkv before split heads
self.attention = QKVAttention(self.num_heads)
else:
# split heads before split qkv
self.attention = QKVAttentionLegacy(self.num_heads)
self.proj_out = zero_module(conv_nd(1, channels, channels, 1))
def forward(self, x):
return checkpoint(self._forward, (x, ), self.parameters(), self.use_checkpoint)
def _forward(self, x):
b, c, *spatial = x.shape
# x = x.reshape(b, c, -1)
x = paddle.reshape(x, [b, c, -1])
qkv = self.qkv(self.norm(x))
h = self.attention(qkv)
h = self.proj_out(h)
# return (x + h).reshape(b, c, *spatial)
return paddle.reshape(x + h, [b, c, *spatial])
def count_flops_attn(model, _x, y):
"""
A counter for the `thop` package to count the operations in an
attention operation.
Meant to be used like:
macs, params = thop.profile(
model,
inputs=(inputs, timestamps),
custom_ops={QKVAttention: QKVAttention.count_flops},
)
"""
b, c, *spatial = y[0].shape
num_spatial = int(np.prod(spatial))
# We perform two matmuls with the same number of ops.
# The first computes the weight matrix, the second computes
# the combination of the value vectors.
matmul_ops = 2 * b * (num_spatial**2) * c
model.total_ops += paddle.to_tensor([matmul_ops], dtype='float64')
class QKVAttentionLegacy(nn.Layer):
"""
A module which performs QKV attention. Matches legacy QKVAttention + input/ouput heads shaping
"""
def __init__(self, n_heads):
super().__init__()
self.n_heads = n_heads
def forward(self, qkv):
"""
Apply QKV attention.
:param qkv: an [N x (H * 3 * C) x T] tensor of Qs, Ks, and Vs.
:return: an [N x (H * C) x T] tensor after attention.
"""
bs, width, length = qkv.shape
assert width % (3 * self.n_heads) == 0
ch = width // (3 * self.n_heads)
# q, k, v = qkv.reshape(bs * self.n_heads, ch * 3, length).split(ch, dim=1)
q, k, v = paddle.reshape(qkv, [bs * self.n_heads, ch * 3, length]).split(3, axis=1)
scale = 1 / math.sqrt(math.sqrt(ch))
weight = paddle.einsum("bct,bcs->bts", q * scale, k * scale) # More stable with f16 than dividing afterwards
weight = paddle.cast(nn.functional.softmax(paddle.cast(weight, 'float32'), axis=-1), weight.dtype)
a = paddle.einsum("bts,bcs->bct", weight, v)
# return a.reshape(bs, -1, length)
return paddle.reshape(a, [bs, -1, length])
@staticmethod
def count_flops(model, _x, y):
return count_flops_attn(model, _x, y)
class QKVAttention(nn.Layer):
"""
A module which performs QKV attention and splits in a different order.
"""
def __init__(self, n_heads):
super().__init__()
self.n_heads = n_heads
def forward(self, qkv):
"""
Apply QKV attention.
:param qkv: an [N x (3 * H * C) x T] tensor of Qs, Ks, and Vs.
:return: an [N x (H * C) x T] tensor after attention.
"""
bs, width, length = qkv.shape
assert width % (3 * self.n_heads) == 0
ch = width // (3 * self.n_heads)
q, k, v = qkv.chunk(3, axis=1)
scale = 1 / math.sqrt(math.sqrt(ch))
weight = paddle.einsum(
"bct,bcs->bts",
(q * scale).view(bs * self.n_heads, ch, length),
(k * scale).view(bs * self.n_heads, ch, length),
) # More stable with f16 than dividing afterwards
weight = paddle.cast(nn.functional.softmax(paddle.cast(weight, 'float32'), axis=-1), weight.dtype)
a = paddle.einsum("bts,bcs->bct", weight, v.reshape(bs * self.n_heads, ch, length))
# return a.reshape(bs, -1, length)
return paddle.reshape(a, [bs, -1, length])
@staticmethod
def count_flops(model, _x, y):
return count_flops_attn(model, _x, y)
class UNetModel(nn.Layer):
"""
The full UNet model with attention and timestep embedding.
:param in_channels: channels in the input Tensor.
:param model_channels: base channel count for the model.
:param out_channels: channels in the output Tensor.
:param num_res_blocks: number of residual blocks per downsample.
:param attention_resolutions: a collection of downsample rates at which
attention will take place. May be a set, list, or tuple.
For example, if this contains 4, then at 4x downsampling, attention
will be used.
:param dropout: the dropout probability.
:param channel_mult: channel multiplier for each level of the UNet.
:param conv_resample: if True, use learned convolutions for upsampling and
downsampling.
:param dims: determines if the signal is 1D, 2D, or 3D.
:param num_classes: if specified (as an int), then this model will be
class-conditional with `num_classes` classes.
:param use_checkpoint: use gradient checkpointing to reduce memory usage.
:param num_heads: the number of attention heads in each attention layer.
:param num_heads_channels: if specified, ignore num_heads and instead use
a fixed channel width per attention head.
:param num_heads_upsample: works with num_heads to set a different number
of heads for upsampling. Deprecated.
:param use_scale_shift_norm: use a FiLM-like conditioning mechanism.
:param resblock_updown: use residual blocks for up/downsampling.
:param use_new_attention_order: use a different attention pattern for potentially
increased efficiency.
"""
def __init__(
self,
image_size,
in_channels,
model_channels,
out_channels,
num_res_blocks,
attention_resolutions,
dropout=0,
channel_mult=(1, 2, 4, 8),
conv_resample=True,
dims=2,
num_classes=None,
use_checkpoint=False,
use_fp16=False,
num_heads=1,
num_head_channels=-1,
num_heads_upsample=-1,
use_scale_shift_norm=False,
resblock_updown=False,
use_new_attention_order=False,
):
super().__init__()
if num_heads_upsample == -1:
num_heads_upsample = num_heads
self.image_size = image_size
self.in_channels = in_channels
self.model_channels = model_channels
self.out_channels = out_channels
self.num_res_blocks = num_res_blocks
self.attention_resolutions = attention_resolutions
self.dropout = dropout
self.channel_mult = channel_mult
self.conv_resample = conv_resample
self.num_classes = num_classes
self.use_checkpoint = use_checkpoint
self.dtype = paddle.float16 if use_fp16 else paddle.float32
self.num_heads = num_heads
self.num_head_channels = num_head_channels
self.num_heads_upsample = num_heads_upsample
time_embed_dim = model_channels * 4
self.time_embed = nn.Sequential(
linear(model_channels, time_embed_dim),
SiLU(),
linear(time_embed_dim, time_embed_dim),
)
if self.num_classes is not None:
self.label_emb = nn.Embedding(num_classes, time_embed_dim)
ch = input_ch = int(channel_mult[0] * model_channels)
self.input_blocks = nn.LayerList([TimestepEmbedSequential(conv_nd(dims, in_channels, ch, 3, padding=1))])
self._feature_size = ch
input_block_chans = [ch]
ds = 1
for level, mult in enumerate(channel_mult):
for _ in range(num_res_blocks):
layers = [
ResBlock(
ch,
time_embed_dim,
dropout,
out_channels=int(mult * model_channels),
dims=dims,
use_checkpoint=use_checkpoint,
use_scale_shift_norm=use_scale_shift_norm,
)
]
ch = int(mult * model_channels)
if ds in attention_resolutions:
layers.append(
AttentionBlock(
ch,
use_checkpoint=use_checkpoint,
num_heads=num_heads,
num_head_channels=num_head_channels,
use_new_attention_order=use_new_attention_order,
))
self.input_blocks.append(TimestepEmbedSequential(*layers))
self._feature_size += ch
input_block_chans.append(ch)
if level != len(channel_mult) - 1:
out_ch = ch
self.input_blocks.append(
TimestepEmbedSequential(
ResBlock(
ch,
time_embed_dim,
dropout,
out_channels=out_ch,
dims=dims,
use_checkpoint=use_checkpoint,
use_scale_shift_norm=use_scale_shift_norm,
down=True,
) if resblock_updown else Downsample(ch, conv_resample, dims=dims, out_channels=out_ch)))
ch = out_ch
input_block_chans.append(ch)
ds *= 2
self._feature_size += ch
self.middle_block = TimestepEmbedSequential(
ResBlock(
ch,
time_embed_dim,
dropout,
dims=dims,
use_checkpoint=use_checkpoint,
use_scale_shift_norm=use_scale_shift_norm,
),
AttentionBlock(
ch,
use_checkpoint=use_checkpoint,
num_heads=num_heads,
num_head_channels=num_head_channels,
use_new_attention_order=use_new_attention_order,
),
ResBlock(
ch,
time_embed_dim,
dropout,
dims=dims,
use_checkpoint=use_checkpoint,
use_scale_shift_norm=use_scale_shift_norm,
),
)
self._feature_size += ch
self.output_blocks = nn.LayerList([])
for level, mult in list(enumerate(channel_mult))[::-1]:
for i in range(num_res_blocks + 1):
ich = input_block_chans.pop()
layers = [
ResBlock(
ch + ich,
time_embed_dim,
dropout,
out_channels=int(model_channels * mult),
dims=dims,
use_checkpoint=use_checkpoint,
use_scale_shift_norm=use_scale_shift_norm,
)
]
ch = int(model_channels * mult)
if ds in attention_resolutions:
layers.append(
AttentionBlock(
ch,
use_checkpoint=use_checkpoint,
num_heads=num_heads_upsample,
num_head_channels=num_head_channels,
use_new_attention_order=use_new_attention_order,
))
if level and i == num_res_blocks:
out_ch = ch
layers.append(
ResBlock(
ch,
time_embed_dim,
dropout,
out_channels=out_ch,
dims=dims,
use_checkpoint=use_checkpoint,
use_scale_shift_norm=use_scale_shift_norm,
up=True,
) if resblock_updown else Upsample(ch, conv_resample, dims=dims, out_channels=out_ch))
ds //= 2
self.output_blocks.append(TimestepEmbedSequential(*layers))
self._feature_size += ch
self.out = nn.Sequential(
normalization(ch),
SiLU(),
zero_module(conv_nd(dims, input_ch, out_channels, 3, padding=1)),
)
def forward(self, x, timesteps, y=None):
"""
Apply the model to an input batch.
:param x: an [N x C x ...] Tensor of inputs.
:param timesteps: a 1-D batch of timesteps.
:param y: an [N] Tensor of labels, if class-conditional.
:return: an [N x C x ...] Tensor of outputs.
"""
assert (y is not None) == (self.num_classes
is not None), "must specify y if and only if the model is class-conditional"
hs = []
emb = self.time_embed(timestep_embedding(timesteps, self.model_channels))
if self.num_classes is not None:
assert y.shape == (x.shape[0], )
emb = emb + self.label_emb(y)
h = paddle.cast(x, self.dtype)
for module in self.input_blocks:
h = module(h, emb)
hs.append(h)
h = self.middle_block(h, emb)
for module in self.output_blocks:
h = paddle.concat([h, hs.pop()], axis=1)
h = module(h, emb)
# h = paddle.cast(h, x.dtype)
return self.out(h)
class SuperResModel(UNetModel):
"""
A UNetModel that performs super-resolution.
Expects an extra kwarg `low_res` to condition on a low-resolution image.
"""
def __init__(self, image_size, in_channels, *args, **kwargs):
super().__init__(image_size, in_channels * 2, *args, **kwargs)
def forward(self, x, timesteps, low_res=None, **kwargs):
_, _, new_height, new_width = x.shape
upsampled = F.interpolate(low_res, (new_height, new_width), mode="bilinear")
x = paddle.concat([x, upsampled], axis=1)
return super().forward(x, timesteps, **kwargs)
class EncoderUNetModel(nn.Layer):
"""
The half UNet model with attention and timestep embedding.
For usage, see UNet.
"""
def __init__(
self,
image_size,
in_channels,
model_channels,
out_channels,
num_res_blocks,
attention_resolutions,
dropout=0,
channel_mult=(1, 2, 4, 8),
conv_resample=True,
dims=2,
use_checkpoint=False,
use_fp16=False,
num_heads=1,
num_head_channels=-1,
num_heads_upsample=-1,
use_scale_shift_norm=False,
resblock_updown=False,
use_new_attention_order=False,
pool="adaptive",
):
super().__init__()
if num_heads_upsample == -1:
num_heads_upsample = num_heads
self.in_channels = in_channels
self.model_channels = model_channels
self.out_channels = out_channels
self.num_res_blocks = num_res_blocks
self.attention_resolutions = attention_resolutions
self.dropout = dropout
self.channel_mult = channel_mult
self.conv_resample = conv_resample
self.use_checkpoint = use_checkpoint
self.dtype = paddle.float16 if use_fp16 else paddle.float32
self.num_heads = num_heads
self.num_head_channels = num_head_channels
self.num_heads_upsample = num_heads_upsample
time_embed_dim = model_channels * 4
self.time_embed = nn.Sequential(
linear(model_channels, time_embed_dim),
SiLU(),
linear(time_embed_dim, time_embed_dim),
)
ch = int(channel_mult[0] * model_channels)
self.input_blocks = nn.LayerList([TimestepEmbedSequential(conv_nd(dims, in_channels, ch, 3, padding=1))])
self._feature_size = ch
input_block_chans = [ch]
ds = 1
for level, mult in enumerate(channel_mult):
for _ in range(num_res_blocks):
layers = [
ResBlock(
ch,
time_embed_dim,
dropout,
out_channels=int(mult * model_channels),
dims=dims,
use_checkpoint=use_checkpoint,
use_scale_shift_norm=use_scale_shift_norm,
)
]
ch = int(mult * model_channels)
if ds in attention_resolutions:
layers.append(
AttentionBlock(
ch,
use_checkpoint=use_checkpoint,
num_heads=num_heads,
num_head_channels=num_head_channels,
use_new_attention_order=use_new_attention_order,
))
self.input_blocks.append(TimestepEmbedSequential(*layers))
self._feature_size += ch
input_block_chans.append(ch)
if level != len(channel_mult) - 1:
out_ch = ch
self.input_blocks.append(
TimestepEmbedSequential(
ResBlock(
ch,
time_embed_dim,
dropout,
out_channels=out_ch,
dims=dims,
use_checkpoint=use_checkpoint,
use_scale_shift_norm=use_scale_shift_norm,
down=True,
) if resblock_updown else Downsample(ch, conv_resample, dims=dims, out_channels=out_ch)))
ch = out_ch
input_block_chans.append(ch)
ds *= 2
self._feature_size += ch
self.middle_block = TimestepEmbedSequential(
ResBlock(
ch,
time_embed_dim,
dropout,
dims=dims,
use_checkpoint=use_checkpoint,
use_scale_shift_norm=use_scale_shift_norm,
),
AttentionBlock(
ch,
use_checkpoint=use_checkpoint,
num_heads=num_heads,
num_head_channels=num_head_channels,
use_new_attention_order=use_new_attention_order,
),
ResBlock(
ch,
time_embed_dim,
dropout,
dims=dims,
use_checkpoint=use_checkpoint,
use_scale_shift_norm=use_scale_shift_norm,
),
)
self._feature_size += ch
self.pool = pool
if pool == "adaptive":
self.out = nn.Sequential(
normalization(ch),
SiLU(),
nn.AdaptiveAvgPool2D((1, 1)),
zero_module(conv_nd(dims, ch, out_channels, 1)),
nn.Flatten(),
)
elif pool == "attention":
assert num_head_channels != -1
self.out = nn.Sequential(
normalization(ch),
SiLU(),
AttentionPool2d((image_size // ds), ch, num_head_channels, out_channels),
)
elif pool == "spatial":
self.out = nn.Sequential(
nn.Linear(self._feature_size, 2048),
nn.ReLU(),
nn.Linear(2048, self.out_channels),
)
elif pool == "spatial_v2":
self.out = nn.Sequential(
nn.Linear(self._feature_size, 2048),
normalization(2048),
SiLU(),
nn.Linear(2048, self.out_channels),
)
else:
raise NotImplementedError(f"Unexpected {pool} pooling")
def forward(self, x, timesteps):
"""
Apply the model to an input batch.
:param x: an [N x C x ...] Tensor of inputs.
:param timesteps: a 1-D batch of timesteps.
:return: an [N x K] Tensor of outputs.
"""
emb = self.time_embed(timestep_embedding(timesteps, self.model_channels))
results = []
# h = x.type(self.dtype)
h = paddle.cast(x, self.dtype)
for module in self.input_blocks:
h = module(h, emb)
if self.pool.startswith("spatial"):
# results.append(h.type(x.dtype).mean(axis=(2, 3)))
results.append(paddle.cast(h, x.dtype).mean(axis=(2, 3)))
h = self.middle_block(h, emb)
if self.pool.startswith("spatial"):
results.append(paddle.cast(h, x.dtype).mean(axis=(2, 3)))
h = paddle.concat(results, axis=-1)
return self.out(h)
else:
# h = h.type(x.dtype)
h = paddle.cast(h, x.dtype)
return self.out(h)
text_prompts:
- greg rutkowski和thomas kinkade在artstation上的一幅美丽的画,一个独特的灯塔,照耀着它的光穿过喧嚣的血海。
init_image:
width_height: [ 1280, 768]
skip_steps: 10
steps: 250
cut_ic_pow: 1
init_scale: 1000
clip_guidance_scale: 5000
tv_scale: 0
range_scale: 150
sat_scale: 0
cutn_batches: 4
diffusion_model: 512x512_diffusion_uncond_finetune_008100
use_secondary_model: True
diffusion_sampling_mode: ddim
perlin_init: False
perlin_mode: mixed
seed: 445467575
eta: 0.8
clamp_grad: True
clamp_max: 0.05
randomize_class: True
clip_denoised: False
fuzzy_prompt: False
rand_mag: 0.05
cut_overview: "[12]*400+[4]*600"
cut_innercut: "[4]*400+[12]*600"
cut_icgray_p: "[0.2]*400+[0]*600"
display_rate: 10
n_batches: 1
batch_size: 1
batch_name: ''
clip_models:
- ViTB16
output_dir: "./"
text_prompts: |
Phrase, sentence, or string of words and phrases describing what the image should look like. The words will be analyzed by the AI and will guide the diffusion process toward the image(s) you describe. These can include commas and weights to adjust the relative importance of each element. E.g. "A beautiful painting of a singular lighthouse, shining its light across a tumultuous sea of blood by greg rutkowski and thomas kinkade, Trending on artstation."
Notice that this prompt loosely follows a structure: [subject], [prepositional details], [setting], [meta modifiers and artist]; this is a good starting point for your experiments.
Developing text prompts takes practice and experience, and is not the subject of this guide. If you are a beginner to writing text prompts, a good place to start is on a simple AI art app like Night Cafe, starry ai or WOMBO prior to using DD, to get a feel for how text gets translated into images by GAN tools. These other apps use different technologies, but many of the same principles apply.
init_image: |
Recall that in the image sequence above, the first image shown is just noise. If an init_image is provided, diffusion will replace the noise with the init_image as its starting state. To use an init_image, upload the image to the Colab instance or your Google Drive, and enter the full image path here.
If using an init_image, you may need to increase skip_steps to ~ 50% of total steps to retain the character of the init. See skip_steps above for further discussion.
width_height: |
Desired final image size, in pixels. You can have a square, wide, or tall image, but each edge length should be set to a multiple of 64px, and a minimum of 512px on the default CLIP model setting. If you forget to use multiples of 64px in your dimensions, DD will adjust the dimensions of your image to make it so.
skip_steps: |
Consider the chart shown here. Noise scheduling (denoise strength) starts very high and progressively gets lower and lower as diffusion steps progress. The noise levels in the first few steps are very high, so images change dramatically in early steps.
As DD moves along the curve, noise levels (and thus the amount an image changes per step) declines, and image coherence from one step to the next increases.
The first few steps of denoising are often so dramatic that some steps (maybe 10-15% of total) can be skipped without affecting the final image. You can experiment with this as a way to cut render times.
If you skip too many steps, however, the remaining noise may not be high enough to generate new content, and thus may not have ‘time left’ to finish an image satisfactorily.
Also, depending on your other settings, you may need to skip steps to prevent CLIP from overshooting your goal, resulting in ‘blown out’ colors (hyper saturated, solid white, or solid black regions) or otherwise poor image quality. Consider that the denoising process is at its strongest in the early steps, so skipping steps can sometimes mitigate other problems.
Lastly, if using an init_image, you will need to skip ~50% of the diffusion steps to retain the shapes in the original init image.
However, if you’re using an init_image, you can also adjust skip_steps up or down for creative reasons. With low skip_steps you can get a result "inspired by" the init_image which will retain the colors and rough layout and shapes but look quite different. With high skip_steps you can preserve most of the init_image contents and just do fine tuning of the texture.
steps: |
When creating an image, the denoising curve is subdivided into steps for processing. Each step (or iteration) involves the AI looking at subsets of the image called ‘cuts’ and calculating the ‘direction’ the image should be guided to be more like the prompt. Then it adjusts the image with the help of the diffusion denoiser, and moves to the next step.
Increasing steps will provide more opportunities for the AI to adjust the image, and each adjustment will be smaller, and thus will yield a more precise, detailed image. Increasing steps comes at the expense of longer render times. Also, while increasing steps should generally increase image quality, there is a diminishing return on additional steps beyond 250 - 500 steps. However, some intricate images can take 1000, 2000, or more steps. It is really up to the user.
Just know that the render time is directly related to the number of steps, and many other parameters have a major impact on image quality, without costing additional time.
cut_ic_pow: |
This sets the size of the border used for inner cuts. High cut_ic_pow values have larger borders, and therefore the cuts themselves will be smaller and provide finer details. If you have too many or too-small inner cuts, you may lose overall image coherency and/or it may cause an undesirable ‘mosaic’ effect. Low cut_ic_pow values will allow the inner cuts to be larger, helping image coherency while still helping with some details.
init_scale: |
This controls how strongly CLIP will try to match the init_image provided. This is balanced against the clip_guidance_scale (CGS) above. Too much init scale, and the image won’t change much during diffusion. Too much CGS and the init image will be lost.
clip_guidance_scale: |
CGS is one of the most important parameters you will use. It tells DD how strongly you want CLIP to move toward your prompt each timestep. Higher is generally better, but if CGS is too strong it will overshoot the goal and distort the image. So a happy medium is needed, and it takes experience to learn how to adjust CGS.
Note that this parameter generally scales with image dimensions. In other words, if you increase your total dimensions by 50% (e.g. a change from 512 x 512 to 512 x 768), then to maintain the same effect on the image, you’d want to increase clip_guidance_scale from 5000 to 7500.
Of the basic settings, clip_guidance_scale, steps and skip_steps are the most important contributors to image quality, so learn them well.
tv_scale: |
Total variance denoising. Optional, set to zero to turn off. Controls ‘smoothness’ of final output. If used, tv_scale will try to smooth out your final image to reduce overall noise. If your image is too ‘crunchy’, increase tv_scale. TV denoising is good at preserving edges while smoothing away noise in flat regions. See https://en.wikipedia.org/wiki/Total_variation_denoising
range_scale: |
Optional, set to zero to turn off. Used for adjustment of color contrast. Lower range_scale will increase contrast. Very low numbers create a reduced color palette, resulting in more vibrant or poster-like images. Higher range_scale will reduce contrast, for more muted images.
sat_scale: |
Saturation scale. Optional, set to zero to turn off. If used, sat_scale will help mitigate oversaturation. If your image is too saturated, increase sat_scale to reduce the saturation.
cutn_batches: |
Each iteration, the AI cuts the image into smaller pieces known as cuts, and compares each cut to the prompt to decide how to guide the next diffusion step. More cuts can generally lead to better images, since DD has more chances to fine-tune the image precision in each timestep.
Additional cuts are memory intensive, however, and if DD tries to evaluate too many cuts at once, it can run out of memory. You can use cutn_batches to increase cuts per timestep without increasing memory usage.
At the default settings, DD is scheduled to do 16 cuts per timestep. If cutn_batches is set to 1, there will indeed only be 16 cuts total per timestep.
However, if cutn_batches is increased to 4, DD will do 64 cuts total in each timestep, divided into 4 sequential batches of 16 cuts each. Because the cuts are being evaluated only 16 at a time, DD uses the memory required for only 16 cuts, but gives you the quality benefit of 64 cuts. The tradeoff, of course, is that this will take ~4 times as long to render each image.
So, (scheduled cuts) x (cutn_batches) = (total cuts per timestep). Increasing cutn_batches will increase render times, however, as the work is being done sequentially. DD’s default cut schedule is a good place to start, but the cut schedule can be adjusted in the Cutn Scheduling section, explained below.
diffusion_model: Diffusion_model of choice.
use_secondary_model: |
Option to use a secondary purpose-made diffusion model to clean up interim diffusion images for CLIP evaluation. If this option is turned off, DD will use the regular (large) diffusion model. Using the secondary model is faster - one user reported a 50% improvement in render speed! However, the secondary model is much smaller, and may reduce image quality and detail. I suggest you experiment with this.
diffusion_sampling_mode: |
Two alternate diffusion denoising algorithms. ddim has been around longer, and is more established and tested. plms is a newly added alternate method that promises good diffusion results in fewer steps, but has not been as fully tested and may have side effects. This new plms mode is actively being researched in the #settings-and-techniques channel in the DD Discord.
perlin_init: |
Normally, DD will use an image filled with random noise as a starting point for the diffusion curve. If perlin_init is selected, DD will instead use a Perlin noise model as an initial state. Perlin has very interesting characteristics, distinct from random noise, so it’s worth experimenting with this for your projects. Beyond perlin, you can, of course, generate your own noise images (such as with GIMP, etc) and use them as an init_image (without skipping steps).
Choosing perlin_init does not affect the actual diffusion process, just the starting point for the diffusion. Please note that selecting a perlin_init will replace and override any init_image you may have specified. Further, because the 2D, 3D and video animation systems all rely on the init_image system, if you enable Perlin while using animation modes, the perlin_init will jump in front of any previous image or video input, and DD will NOT give you the expected sequence of coherent images. All of that said, using Perlin and animation modes together do make a very colorful rainbow effect, which can be used creatively.
perlin_mode: |
sets type of Perlin noise: colored, gray, or a mix of both, giving you additional options for noise types. Experiment to see what these do in your projects.
seed: |
Deep in the diffusion code, there is a random number ‘seed’ which is used as the basis for determining the initial state of the diffusion. By default, this is random, but you can also specify your own seed. This is useful if you like a particular result and would like to run more iterations that will be similar.
After each run, the actual seed value used will be reported in the parameters report, and can be reused if desired by entering seed # here. If a specific numerical seed is used repeatedly, the resulting images will be quite similar but not identical.
eta: |
eta (greek letter η) is a diffusion model variable that mixes in a random amount of scaled noise into each timestep. 0 is no noise, 1.0 is more noise. As with most DD parameters, you can go below zero for eta, but it may give you unpredictable results.
The steps parameter has a close relationship with the eta parameter. If you set eta to 0, then you can get decent output with only 50-75 steps. Setting eta to 1.0 favors higher step counts, ideally around 250 and up. eta has a subtle, unpredictable effect on image, so you’ll need to experiment to see how this affects your projects.
clamp_grad: |
As I understand it, clamp_grad is an internal limiter that stops DD from producing extreme results. Try your images with and without clamp_grad. If the image changes drastically with clamp_grad turned off, it probably means your clip_guidance_scale is too high and should be reduced.
clamp_max: |
Sets the value of the clamp_grad limitation. Default is 0.05, providing for smoother, more muted coloration in images, but setting higher values (0.15-0.3) can provide interesting contrast and vibrancy.
randomize_class:
clip_denoised: False
fuzzy_prompt: |
Controls whether to add multiple noisy prompts to the prompt losses. If True, can increase variability of image output. Experiment with this.
rand_mag: |
Affects only the fuzzy_prompt. Controls the magnitude of the random noise added by fuzzy_prompt.
cut_overview: The schedule of overview cuts
cut_innercut: The schedule of inner cuts
cut_icgray_p: This sets the size of the border used for inner cuts. High cut_ic_pow values have larger borders, and therefore the cuts themselves will be smaller and provide finer details. If you have too many or too-small inner cuts, you may lose overall image coherency and/or it may cause an undesirable ‘mosaic’ effect. Low cut_ic_pow values will allow the inner cuts to be larger, helping image coherency while still helping with some details.
display_rate: |
During a diffusion run, you can monitor the progress of each image being created with this variable. If display_rate is set to 50, DD will show you the in-progress image every 50 timesteps. Setting this to a lower value, like 5 or 10, is a good way to get an early peek at where your image is heading. If you don’t like the progression, just interrupt execution, change some settings, and re-run. If you are planning a long, unmonitored batch, it’s better to set display_rate equal to steps, because displaying interim images does slow Colab down slightly.
n_batches: |
This variable sets the number of still images you want DD to create. If you are using an animation mode (see below for details) DD will ignore n_batches and create a single set of animated frames based on the animation settings.
batch_name: |
The name of the batch, the batch id will be named as "discoart-[batch_name]-seed". To avoid your artworks be overridden by other users, please use a unique name.
clip_models: |
CLIP Model selectors. ViT-B/32, ViT-B/16, ViT-L/14, RN101, RN50, RN50x4, RN50x16, RN50x64.
These various CLIP models are available for you to use during image generation. Models have different styles or ‘flavors,’ so look around.
You can mix in multiple models as well for different results. However, keep in mind that some models are extremely memory-hungry, and turning on additional models will take additional memory and may cause a crash.
The rough order of speed/mem usage is (smallest/fastest to largest/slowest):
ViT-B/32
RN50
RN101
ViT-B/16
RN50x4
RN50x16
RN50x64
ViT-L/14
For RN50x64 & ViTL14 you may need to use fewer cuts, depending on your VRAM.
'''
This code is rewritten by Paddle based on Jina-ai/discoart.
https://github.com/jina-ai/discoart/blob/main/discoart/runner.py
'''
import gc
import os
import random
from threading import Thread
import numpy as np
import paddle
import paddle.vision.transforms as T
import paddle_lpips as lpips
from disco_diffusion_ernievil_base.vit_b_16x.ernievil2.utils.utils import tokenize
from docarray import Document
from docarray import DocumentArray
from IPython import display
from ipywidgets import Output
from PIL import Image
from .helper import logger
from .helper import parse_prompt
from .model.losses import range_loss
from .model.losses import spherical_dist_loss
from .model.losses import tv_loss
from .model.make_cutouts import MakeCutoutsDango
from .model.sec_diff import alpha_sigma_to_t
from .model.sec_diff import SecondaryDiffusionImageNet2
from .model.transforms import Normalize
def do_run(args, models) -> 'DocumentArray':
logger.info('preparing models...')
model, diffusion, clip_models, secondary_model = models
normalize = Normalize(
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225],
)
lpips_model = lpips.LPIPS(net='vgg')
for parameter in lpips_model.parameters():
parameter.stop_gradient = True
side_x = (args.width_height[0] // 64) * 64
side_y = (args.width_height[1] // 64) * 64
cut_overview = eval(args.cut_overview)
cut_innercut = eval(args.cut_innercut)
cut_icgray_p = eval(args.cut_icgray_p)
from .model.perlin_noises import create_perlin_noise, regen_perlin
seed = args.seed
skip_steps = args.skip_steps
loss_values = []
if seed is not None:
np.random.seed(seed)
random.seed(seed)
paddle.seed(seed)
model_stats = []
for clip_model in clip_models:
model_stat = {
'clip_model': None,
'target_embeds': [],
'make_cutouts': None,
'weights': [],
}
model_stat['clip_model'] = clip_model
if isinstance(args.text_prompts, str):
args.text_prompts = [args.text_prompts]
for prompt in args.text_prompts:
txt, weight = parse_prompt(prompt)
txt = clip_model.encode_text(tokenize(prompt))
if args.fuzzy_prompt:
for i in range(25):
model_stat['target_embeds'].append((txt + paddle.randn(txt.shape) * args.rand_mag).clip(0, 1))
model_stat['weights'].append(weight)
else:
model_stat['target_embeds'].append(txt)
model_stat['weights'].append(weight)
model_stat['target_embeds'] = paddle.concat(model_stat['target_embeds'])
model_stat['weights'] = paddle.to_tensor(model_stat['weights'])
if model_stat['weights'].sum().abs() < 1e-3:
raise RuntimeError('The weights must not sum to 0.')
model_stat['weights'] /= model_stat['weights'].sum().abs()
model_stats.append(model_stat)
init = None
if args.init_image:
d = Document(uri=args.init_image).load_uri_to_image_tensor(side_x, side_y)
init = T.to_tensor(d.tensor).unsqueeze(0) * 2 - 1
if args.perlin_init:
if args.perlin_mode == 'color':
init = create_perlin_noise([1.5**-i * 0.5 for i in range(12)], 1, 1, False, side_y, side_x)
init2 = create_perlin_noise([1.5**-i * 0.5 for i in range(8)], 4, 4, False, side_y, side_x)
elif args.perlin_mode == 'gray':
init = create_perlin_noise([1.5**-i * 0.5 for i in range(12)], 1, 1, True, side_y, side_x)
init2 = create_perlin_noise([1.5**-i * 0.5 for i in range(8)], 4, 4, True, side_y, side_x)
else:
init = create_perlin_noise([1.5**-i * 0.5 for i in range(12)], 1, 1, False, side_y, side_x)
init2 = create_perlin_noise([1.5**-i * 0.5 for i in range(8)], 4, 4, True, side_y, side_x)
init = (T.to_tensor(init).add(T.to_tensor(init2)).divide(paddle.to_tensor(2.0)).unsqueeze(0) * 2 - 1)
del init2
cur_t = None
def cond_fn(x, t, y=None):
x_is_NaN = False
n = x.shape[0]
if secondary_model:
alpha = paddle.to_tensor(diffusion.sqrt_alphas_cumprod[cur_t], dtype='float32')
sigma = paddle.to_tensor(diffusion.sqrt_one_minus_alphas_cumprod[cur_t], dtype='float32')
cosine_t = alpha_sigma_to_t(alpha, sigma)
x = paddle.to_tensor(x.detach(), dtype='float32')
x.stop_gradient = False
cosine_t = paddle.tile(paddle.to_tensor(cosine_t.detach().cpu().numpy()), [n])
cosine_t.stop_gradient = False
out = secondary_model(x, cosine_t).pred
fac = diffusion.sqrt_one_minus_alphas_cumprod[cur_t]
x_in_d = out * fac + x * (1 - fac)
x_in = x_in_d.detach()
x_in.stop_gradient = False
x_in_grad = paddle.zeros_like(x_in, dtype='float32')
else:
t = paddle.ones([n], dtype='int64') * cur_t
out = diffusion.p_mean_variance(model, x, t, clip_denoised=False, model_kwargs={'y': y})
fac = diffusion.sqrt_one_minus_alphas_cumprod[cur_t]
x_in_d = out['pred_xstart'] * fac + x * (1 - fac)
x_in = x_in_d.detach()
x_in.stop_gradient = False
x_in_grad = paddle.zeros_like(x_in, dtype='float32')
for model_stat in model_stats:
for i in range(args.cutn_batches):
t_int = (int(t.item()) + 1) # errors on last step without +1, need to find source
# when using SLIP Base model the dimensions need to be hard coded to avoid AttributeError: 'VisionTransformer' object has no attribute 'input_resolution'
try:
input_resolution = model_stat['clip_model'].visual.input_resolution
except:
input_resolution = 224
cuts = MakeCutoutsDango(
input_resolution,
Overview=cut_overview[1000 - t_int],
InnerCrop=cut_innercut[1000 - t_int],
IC_Size_Pow=args.cut_ic_pow,
IC_Grey_P=cut_icgray_p[1000 - t_int],
)
clip_in = normalize(cuts(x_in.add(paddle.to_tensor(1.0)).divide(paddle.to_tensor(2.0))))
image_embeds = (model_stat['clip_model'].encode_image(clip_in))
dists = spherical_dist_loss(
image_embeds.unsqueeze(1),
model_stat['target_embeds'].unsqueeze(0),
)
dists = dists.reshape([
cut_overview[1000 - t_int] + cut_innercut[1000 - t_int],
n,
-1,
])
losses = dists.multiply(model_stat['weights']).sum(2).mean(0)
loss_values.append(losses.sum().item()) # log loss, probably shouldn't do per cutn_batch
x_in_grad += ((paddle.grad(losses.sum() * args.clip_guidance_scale, x_in)[0]) / args.cutn_batches)
tv_losses = tv_loss(x_in)
range_losses = range_loss(x_in)
sat_losses = paddle.abs(x_in - x_in.clip(min=-1, max=1)).mean()
loss = (tv_losses.sum() * args.tv_scale + range_losses.sum() * args.range_scale +
sat_losses.sum() * args.sat_scale)
if init is not None and args.init_scale:
init_losses = lpips_model(x_in, init)
loss = loss + init_losses.sum() * args.init_scale
x_in_grad += paddle.grad(loss, x_in)[0]
if not paddle.isnan(x_in_grad).any():
grad = -paddle.grad(x_in_d, x, x_in_grad)[0]
else:
x_is_NaN = True
grad = paddle.zeros_like(x)
if args.clamp_grad and not x_is_NaN:
magnitude = grad.square().mean().sqrt()
return (grad * magnitude.clip(max=args.clamp_max) / magnitude)
return grad
if args.diffusion_sampling_mode == 'ddim':
sample_fn = diffusion.ddim_sample_loop_progressive
else:
sample_fn = diffusion.plms_sample_loop_progressive
logger.info('creating artwork...')
image_display = Output()
da_batches = DocumentArray()
for _nb in range(args.n_batches):
display.clear_output(wait=True)
display.display(args.name_docarray, image_display)
gc.collect()
paddle.device.cuda.empty_cache()
d = Document(tags=vars(args))
da_batches.append(d)
cur_t = diffusion.num_timesteps - skip_steps - 1
if args.perlin_init:
init = regen_perlin(args.perlin_mode, side_y, side_x, args.batch_size)
if args.diffusion_sampling_mode == 'ddim':
samples = sample_fn(
model,
(args.batch_size, 3, side_y, side_x),
clip_denoised=args.clip_denoised,
model_kwargs={},
cond_fn=cond_fn,
progress=True,
skip_timesteps=skip_steps,
init_image=init,
randomize_class=args.randomize_class,
eta=args.eta,
)
else:
samples = sample_fn(
model,
(args.batch_size, 3, side_y, side_x),
clip_denoised=args.clip_denoised,
model_kwargs={},
cond_fn=cond_fn,
progress=True,
skip_timesteps=skip_steps,
init_image=init,
randomize_class=args.randomize_class,
order=2,
)
threads = []
for j, sample in enumerate(samples):
cur_t -= 1
with image_display:
if j % args.display_rate == 0 or cur_t == -1:
for _, image in enumerate(sample['pred_xstart']):
image = (image + 1) / 2
image = image.clip(0, 1).squeeze().transpose([1, 2, 0]).numpy() * 255
image = np.uint8(image)
image = Image.fromarray(image)
image.save(os.path.join(args.output_dir, 'progress-{}.png'.format(_nb)))
c = Document(tags={'cur_t': cur_t})
c.load_pil_image_to_datauri(image)
d.chunks.append(c)
display.clear_output(wait=True)
display.display(display.Image(os.path.join(args.output_dir, 'progress-{}.png'.format(_nb))))
d.chunks.plot_image_sprites(os.path.join(args.output_dir,
f'{args.name_docarray}-progress-{_nb}.png'),
show_index=True)
t = Thread(
target=_silent_push,
args=(
da_batches,
args.name_docarray,
),
)
threads.append(t)
t.start()
if cur_t == -1:
d.load_pil_image_to_datauri(image)
for t in threads:
t.join()
display.clear_output(wait=True)
logger.info(f'done! {args.name_docarray}')
da_batches.plot_image_sprites(skip_empty=True, show_index=True, keep_aspect_ratio=True)
return da_batches
def _silent_push(da_batches: DocumentArray, name: str) -> None:
try:
da_batches.push(name)
except Exception as ex:
logger.debug(f'push failed: {ex}')
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
__version__ = '2.0.0' # Maybe dev is better
from . import transformers
from . import utils
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import print_function
import sys
import warnings
from functools import partial
from functools import reduce
import paddle
from paddle.fluid import core
from paddle.fluid.data_feeder import check_dtype
from paddle.fluid.data_feeder import check_type
from paddle.fluid.data_feeder import check_variable_and_dtype
from paddle.fluid.data_feeder import convert_dtype
from paddle.fluid.framework import default_main_program
from paddle.fluid.framework import in_dygraph_mode
from paddle.fluid.layer_helper import LayerHelper
from paddle.fluid.layers import control_flow
from paddle.fluid.layers import nn
from paddle.fluid.layers import sequence_lod
from paddle.fluid.layers import tensor
from paddle.fluid.layers import utils
from paddle.fluid.layers.utils import *
from paddle.fluid.param_attr import ParamAttr
from paddle.utils import deprecated
#import paddle.nn as nn
class ArrayWrapper(object):
def __init__(self, x):
self.array = [x]
def append(self, x):
self.array.append(x)
return self
def __getitem__(self, item):
return self.array.__getitem__(item)
def _maybe_copy(state, new_state, step_mask):
"""update rnn state or just pass the old state through"""
new_state = nn.elementwise_mul(new_state, step_mask, axis=0) \
+ nn.elementwise_mul(state, (1 - step_mask), axis=0)
return new_state
def _transpose_batch_time(x):
perm = [1, 0] + list(range(2, len(x.shape)))
return nn.transpose(x, perm)
class Decoder(object):
"""
:api_attr: Static Graph
Decoder is the base class for any decoder instance used in `dynamic_decode`.
It provides interface for output generation for one time step, which can be
used to generate sequences.
The key abstraction provided by Decoder is:
1. :code:`(initial_input, initial_state, finished) = initialize(inits)` ,
which generates the input and state for the first decoding step, and gives the
initial status telling whether each sequence in the batch is finished.
It would be called once before the decoding iterations.
2. :code:`(output, next_state, next_input, finished) = step(time, input, state)` ,
which transforms the input and state to the output and new state, generates
input for the next decoding step, and emits the flag indicating finished status.
It is the main part for each decoding iteration.
3. :code:`(final_outputs, final_state) = finalize(outputs, final_state, sequence_lengths)` ,
which revises the outputs(stack of all time steps' output) and final state(state from the
last decoding step) to get the counterpart for special usage.
Not necessary to be implemented if no need to revise the stacked outputs and
state from the last decoding step. If implemented, it would be called after
the decoding iterations.
Decoder is more general compared to RNNCell, since the returned `next_input`
and `finished` make it can determine the input and when to finish by itself
when used in dynamic decoding. Decoder always wraps a RNNCell instance though
not necessary.
"""
def initialize(self, inits):
r"""
Called once before the decoding iterations.
Parameters:
inits: Argument provided by the caller.
Returns:
tuple: A tuple( :code:`(initial_inputs, initial_states, finished)` ). \
`initial_inputs` and `initial_states` both are a (possibly nested \
structure of) tensor variable[s], and `finished` is a tensor with \
bool data type.
"""
raise NotImplementedError
def step(self, time, inputs, states, **kwargs):
r"""
Called per step of decoding.
Parameters:
time(Variable): A Tensor with shape :math:`[1]` provided by the caller.
The data type is int64.
inputs(Variable): A (possibly nested structure of) tensor variable[s].
states(Variable): A (possibly nested structure of) tensor variable[s].
**kwargs: Additional keyword arguments, provided by the caller.
Returns:
tuple: A tuple( :code:(outputs, next_states, next_inputs, finished)` ). \
`next_inputs` and `next_states` both are a (possibly nested \
structure of) tensor variable[s], and the structure, shape and \
data type must be same as the counterpart from input arguments. \
`outputs` is a (possibly nested structure of) tensor variable[s]. \
`finished` is a Tensor with bool data type.
"""
raise NotImplementedError
def finalize(self, outputs, final_states, sequence_lengths):
r"""
Called once after the decoding iterations if implemented.
Parameters:
outputs(Variable): A (possibly nested structure of) tensor variable[s].
The structure and data type is same as `output_dtype`.
The tensor stacks all time steps' output thus has shape
:math:`[time\_step, batch\_size, ...]` , which is done by the caller.
final_states(Variable): A (possibly nested structure of) tensor variable[s].
It is the `next_states` returned by `decoder.step` at last decoding step,
thus has the same structure, shape and data type with states at any time
step.
Returns:
tuple: A tuple( :code:`(final_outputs, final_states)` ). \
`final_outputs` and `final_states` both are a (possibly nested \
structure of) tensor variable[s].
"""
raise NotImplementedError
@property
def tracks_own_finished(self):
"""
Describes whether the Decoder keeps track of finished states by itself.
`decoder.step()` would emit a bool `finished` value at each decoding
step. The emited `finished` can be used to determine whether every
batch entries is finished directly, or it can be combined with the
finished tracker keeped in `dynamic_decode` by performing a logical OR
to take the already finished into account.
If `False`, the latter would be took when performing `dynamic_decode`,
which is the default. Otherwise, the former would be took, which uses
the finished value emited by the decoder as all batch entry finished
status directly, and it is the case when batch entries might be
reordered such as beams in BeamSearchDecoder.
Returns:
bool: A python bool `False`.
"""
return False
class BeamSearchDecoder(Decoder):
"""
Decoder with beam search decoding strategy. It wraps a cell to get probabilities,
and follows a beam search step to calculate scores and select candidate
token ids for each decoding step.
Please refer to `Beam search <https://en.wikipedia.org/wiki/Beam_search>`_
for more details.
**NOTE** When decoding with beam search, the `inputs` and `states` of cell
would be tiled to `beam_size` (unsqueeze and tile), resulting to shapes like
`[batch_size * beam_size, ...]` , which is built into `BeamSearchDecoder` and
done automatically. Thus any other tensor with shape `[batch_size, ...]` used
in `cell.call` needs to be tiled manually first, which can be completed by using
:code:`BeamSearchDecoder.tile_beam_merge_with_batch` . The most common case
for this is the encoder output in attention mechanism.
Returns:
BeamSearchDecoder: An instance of decoder which can be used in \
`paddle.nn.dynamic_decode` to implement decoding.
Examples:
.. code-block:: python
import numpy as np
import paddle
from paddle.nn import BeamSearchDecoder, dynamic_decode
from paddle.nn import GRUCell, Linear, Embedding
trg_embeder = Embedding(100, 32)
output_layer = Linear(32, 32)
decoder_cell = GRUCell(input_size=32, hidden_size=32)
decoder = BeamSearchDecoder(decoder_cell,
start_token=0,
end_token=1,
beam_size=4,
embedding_fn=trg_embeder,
output_fn=output_layer)
"""
def __init__(self, cell, start_token, end_token, beam_size, embedding_fn=None, output_fn=None):
"""
Constructor of BeamSearchDecoder.
Parameters:
cell(RNNCellBase): An instance of `RNNCellBase` or object with the same interface.
start_token(int): The start token id.
end_token(int): The end token id.
beam_size(int): The beam width used in beam search.
embedding_fn(optional): A callable to apply to selected candidate ids.
Mostly it is an embedding layer to transform ids to embeddings,
and the returned value acts as the `input` argument for `cell.call`.
If not provided, the id to embedding transformation must be built into
`cell.call`. Default None.
output_fn(optional): A callable to apply to the cell's output prior to
calculate scores and select candidate token ids. Default None.
"""
self.cell = cell
self.embedding_fn = embedding_fn
self.output_fn = output_fn
self.start_token = start_token
self.end_token = end_token
self.beam_size = beam_size
@staticmethod
def tile_beam_merge_with_batch(x, beam_size):
r"""
Tile the batch dimension of a tensor. Specifically, this function takes
a tensor t shaped `[batch_size, s0, s1, ...]` composed of minibatch
entries `t[0], ..., t[batch_size - 1]` and tiles it to have a shape
`[batch_size * beam_size, s0, s1, ...]` composed of minibatch entries
`t[0], t[0], ..., t[1], t[1], ...` where each minibatch entry is repeated
`beam_size` times.
Parameters:
x(Variable): A tensor with shape `[batch_size, ...]`. The data type
should be float32, float64, int32, int64 or bool.
beam_size(int): The beam width used in beam search.
Returns:
Variable: A tensor with shape `[batch_size * beam_size, ...]`, whose \
data type is same as `x`.
"""
check_type(x, 'x', (Variable), 'BeamSearchDecoder.tile_beam_merge_with_batch')
x = nn.unsqueeze(x, [1]) # [batch_size, 1, ...]
expand_times = [1] * len(x.shape)
expand_times[1] = beam_size
x = paddle.tile(x, expand_times) # [batch_size, beam_size, ...]
x = nn.transpose(x, list(range(2, len(x.shape))) + [0, 1]) # [..., batch_size, beam_size]
# use 0 to copy to avoid wrong shape
x = nn.reshape(x, shape=[0] * (len(x.shape) - 2) + [-1]) # [..., batch_size * beam_size]
x = nn.transpose(x, [len(x.shape) - 1] + list(range(0, len(x.shape) - 1))) # [batch_size * beam_size, ...]
return x
def _split_batch_beams(self, x):
r"""
Reshape a tensor with shape `[batch_size * beam_size, ...]` to a new
tensor with shape `[batch_size, beam_size, ...]`.
Parameters:
x(Variable): A tensor with shape `[batch_size * beam_size, ...]`. The
data type should be float32, float64, int32, int64 or bool.
Returns:
Variable: A tensor with shape `[batch_size, beam_size, ...]`, whose \
data type is same as `x`.
"""
check_type(x, 'x', (Variable), 'BeamSearchDecoder._split_batch_beams')
# TODO: avoid fake shape in compile-time like tile_beam_merge_with_batch
return nn.reshape(x, shape=[-1, self.beam_size] + list(x.shape[1:]))
def _merge_batch_beams(self, x):
r"""
Reshape a tensor with shape `[batch_size, beam_size, ...]` to a new
tensor with shape `[batch_size * beam_size, ...]`.
Parameters:
x(Variable): A tensor with shape `[batch_size, beam_size, ...]`. The
data type should be float32, float64, int32, int64 or bool.
Returns:
Variable: A tensor with shape `[batch_size * beam_size, ...]`, whose \
data type is same as `x`.
"""
check_type(x, 'x', (Variable), 'BeamSearchDecoder._merge_batch_beams')
# TODO: avoid fake shape in compile-time like tile_beam_merge_with_batch
return nn.reshape(x, shape=[-1] + list(x.shape[2:]))
def _expand_to_beam_size(self, x):
r"""
This function takes a tensor t shaped `[batch_size, s0, s1, ...]` composed
of minibatch entries `t[0], ..., t[batch_size - 1]` and tiles it to have a
shape `[batch_size, beam_size, s0, s1, ...]` composed of minibatch entries
`t[0], t[0], ..., t[1], t[1], ...` where each minibatch entry is repeated
`beam_size` times.
Parameters:
x(Variable): A tensor with shape `[batch_size, ...]`, The data type
should be float32, float64, int32, int64 or bool.
Returns:
Variable: A tensor with shape `[batch_size, beam_size, ...]`, whose \
data type is same as `x`.
"""
check_type(x, 'x', (Variable), 'BeamSearchDecoder._expand_to_beam_size')
x = nn.unsqueeze(x, [1])
expand_times = [1] * len(x.shape)
expand_times[1] = self.beam_size
x = paddle.tile(x, expand_times)
return x
def _mask_probs(self, probs, finished):
r"""
Mask log probabilities. It forces finished beams to allocate all probability
mass to eos and unfinished beams to remain unchanged.
Parameters:
probs(Variable): A tensor with shape `[batch_size, beam_size, vocab_size]`,
representing the log probabilities. Its data type should be float32 or float64.
finished(Variable): A tensor with shape `[batch_size, beam_size]`,
representing the finished status for all beams. Its data type
should be bool.
Returns:
Variable: A tensor with the same shape and data type as `x`, \
where unfinished beams stay unchanged and finished beams are \
replaced with a tensor with all probability on the EOS token.
"""
check_type(probs, 'probs', (Variable), 'BeamSearchDecoder._mask_probs')
check_type(finished, 'finished', (Variable), 'BeamSearchDecoder._mask_probs')
# TODO: use where_op
finished = tensor.cast(finished, dtype=probs.dtype)
probs = nn.elementwise_mul(paddle.tile(nn.unsqueeze(finished, [2]), [1, 1, self.vocab_size]),
self.noend_mask_tensor,
axis=-1) - nn.elementwise_mul(probs, (finished - 1), axis=0)
return probs
def _gather(self, x, indices, batch_size):
r"""
Gather from the tensor `x` using `indices`.
Parameters:
x(Variable): A tensor with shape `[batch_size, beam_size, ...]`.
indices(Variable): A `int64` tensor with shape `[batch_size, beam_size]`,
representing the indices that we use to gather.
batch_size(Variable): A tensor with shape `[1]`. Its data type should
be int32 or int64.
Returns:
Variable: A tensor with the same shape and data type as `x`, \
representing the gathered tensor.
"""
check_type(x, 'x', (Variable), 'BeamSearchDecoder._gather')
check_type(indices, 'indices', (Variable), 'BeamSearchDecoder._gather')
check_type(batch_size, 'batch_size', (Variable), 'BeamSearchDecoder._gather')
# TODO: compatibility of int32 and int64
batch_size = tensor.cast(batch_size, indices.dtype) if batch_size.dtype != indices.dtype else batch_size
batch_size.stop_gradient = True # TODO: remove this
batch_pos = paddle.tile(nn.unsqueeze(tensor.range(0, batch_size, 1, dtype=indices.dtype), [1]),
[1, self.beam_size])
topk_coordinates = nn.stack([batch_pos, indices], axis=2)
topk_coordinates.stop_gradient = True
return nn.gather_nd(x, topk_coordinates)
class OutputWrapper(collections.namedtuple("OutputWrapper", ("scores", "predicted_ids", "parent_ids"))):
"""
The structure for the returned value `outputs` of `decoder.step`.
A namedtuple includes scores, predicted_ids, parent_ids as fields.
"""
pass
class StateWrapper(collections.namedtuple("StateWrapper", ("cell_states", "log_probs", "finished", "lengths"))):
"""
The structure for the argument `states` of `decoder.step`.
A namedtuple includes cell_states, log_probs, finished, lengths as fields.
"""
pass
def initialize(self, initial_cell_states, bos_ids=None):
r"""
Initialize the BeamSearchDecoder.
Parameters:
initial_cell_states(Variable): A (possibly nested structure of)
tensor variable[s]. An argument provided by the caller.
Returns:
tuple: A tuple( :code:`(initial_inputs, initial_states, finished)` ). \
`initial_inputs` is a tensor t filled by `start_token` with shape \
`[batch_size, beam_size]` when `embedding_fn` is None, or the \
returned value of `embedding_fn(t)` when `embedding_fn` is provided. \
`initial_states` is a nested structure(namedtuple including cell_states, \
log_probs, finished, lengths as fields) of tensor variables, where \
`log_probs, finished, lengths` all has a tensor value shaped \
`[batch_size, beam_size]` with data type `float32, bool, int64`. \
cell_states has a value with the same structure as the input \
argument `initial_cell_states` but with tiled shape `[batch_size, beam_size, ...]`. \
`finished` is a `bool` tensor filled by False with shape `[batch_size, beam_size]`.
"""
self.kinf = 1e9
state = flatten(initial_cell_states)[0]
self.batch_size = nn.shape(state)[0]
if bos_ids is not None:
self.start_token = bos_ids
self.start_token_tensor = tensor.fill_constant(shape=[1], dtype="int64", value=self.start_token)
self.end_token_tensor = tensor.fill_constant(shape=[1], dtype="int64", value=self.end_token)
init_cell_states = map_structure(self._expand_to_beam_size, initial_cell_states)
init_inputs = paddle.full(shape=[self.batch_size, self.beam_size],
fill_value=self.start_token_tensor,
dtype=self.start_token_tensor.dtype)
log_probs = paddle.tile(tensor.assign(np.array([[0.] + [-self.kinf] * (self.beam_size - 1)], dtype="float32")),
[self.batch_size, 1])
if paddle.get_default_dtype() == "float64":
log_probs = tensor.cast(log_probs, "float64")
# TODO: remove the restriction of force_cpu
init_finished = tensor.fill_constant_batch_size_like(input=state,
shape=[-1, self.beam_size],
dtype="bool",
value=False,
force_cpu=True)
init_lengths = tensor.zeros_like(init_inputs)
init_inputs = self.embedding_fn(init_inputs) if self.embedding_fn else init_inputs
return init_inputs, self.StateWrapper(init_cell_states, log_probs, init_finished, init_lengths), init_finished
def _beam_search_step(self, time, logits, next_cell_states, beam_state):
r"""
Calculate scores and select candidate token ids.
Parameters:
time(Variable): An `int64` tensor with shape `[1]` provided by the caller,
representing the current time step number of decoding.
logits(Variable): A tensor with shape `[batch_size, beam_size, vocab_size]`,
representing the logits at the current time step. Its data type is float32.
next_cell_states(Variable): A (possibly nested structure of) tensor variable[s].
It has the same structure, shape and data type as the `cell_states` of
`initial_states` returned by `initialize()`. It represents the next state
from the cell.
beam_state(Variable): A structure of tensor variables.
It is same as the `initial_states` returned by `initialize()` for
the first decoding step and `beam_search_state` returned by
`step()` for the others.
Returns:
tuple: A tuple( :code:`(beam_search_output, beam_search_state)` ). \
`beam_search_output` is a namedtuple(including scores, predicted_ids, \
parent_ids as fields) of tensor variables, where \
`scores, predicted_ids, parent_ids` all has a tensor value shaped \
`[batch_size, beam_size]` with data type `float32, int64, int64`.
`beam_search_state` has the same structure, shape and data type \
as the input argument `beam_state`.
"""
self.vocab_size = logits.shape[-1]
self.vocab_size_tensor = tensor.fill_constant(shape=[1], dtype="int64", value=self.vocab_size)
noend_array = [-self.kinf] * self.vocab_size
noend_array[self.end_token] = 0
self.noend_mask_tensor = tensor.assign(np.array(noend_array, "float32"))
if paddle.get_default_dtype() == "float64":
self.noend_mask_tensor = tensor.cast(self.noend_mask_tensor, "float64")
step_log_probs = nn.log(nn.softmax(logits))
step_log_probs = self._mask_probs(step_log_probs, beam_state.finished)
log_probs = nn.elementwise_add(x=step_log_probs, y=beam_state.log_probs, axis=0)
# TODO: length penalty
scores = log_probs
scores = nn.reshape(scores, [-1, self.beam_size * self.vocab_size])
# TODO: add grad for topk then this beam search can be used to train
topk_scores, topk_indices = paddle.topk(x=scores, k=self.beam_size)
beam_indices = nn.elementwise_floordiv(topk_indices, self.vocab_size_tensor)
token_indices = nn.elementwise_mod(topk_indices, self.vocab_size_tensor)
next_log_probs = self._gather(nn.reshape(log_probs, [-1, self.beam_size * self.vocab_size]), topk_indices,
self.batch_size)
next_cell_states = map_structure(lambda x: self._gather(x, beam_indices, self.batch_size), next_cell_states)
next_finished = self._gather(beam_state.finished, beam_indices, self.batch_size)
next_lengths = self._gather(beam_state.lengths, beam_indices, self.batch_size)
next_lengths = next_lengths + tensor.cast(nn.logical_not(next_finished), beam_state.lengths.dtype)
next_finished = control_flow.logical_or(next_finished, control_flow.equal(token_indices, self.end_token_tensor))
beam_search_output = self.OutputWrapper(topk_scores, token_indices, beam_indices)
beam_search_state = self.StateWrapper(next_cell_states, next_log_probs, next_finished, next_lengths)
return beam_search_output, beam_search_state
def step(self, time, inputs, states, **kwargs):
r"""
Perform a beam search decoding step, which uses `cell` to get probabilities,
and follows a beam search step to calculate scores and select candidate
token ids.
Parameters:
time(Variable): An `int64` tensor with shape `[1]` provided by the caller,
representing the current time step number of decoding.
inputs(Variable): A tensor variable. It is same as `initial_inputs`
returned by `initialize()` for the first decoding step and
`next_inputs` returned by `step()` for the others.
states(Variable): A structure of tensor variables.
It is same as the `initial_states` returned by `initialize()` for
the first decoding step and `beam_search_state` returned by
`step()` for the others.
**kwargs: Additional keyword arguments, provided by the caller.
Returns:
tuple: A tuple( :code:`(beam_search_output, beam_search_state, next_inputs, finished)` ). \
`beam_search_state` and `next_inputs` have the same structure, \
shape and data type as the input arguments `states` and `inputs` separately. \
`beam_search_output` is a namedtuple(including scores, predicted_ids, \
parent_ids as fields) of tensor variables, where \
`scores, predicted_ids, parent_ids` all has a tensor value shaped \
`[batch_size, beam_size]` with data type `float32, int64, int64`. \
`finished` is a `bool` tensor with shape `[batch_size, beam_size]`.
"""
inputs = map_structure(self._merge_batch_beams, inputs)
cell_states = map_structure(self._merge_batch_beams, states.cell_states)
cell_outputs, next_cell_states = self.cell(inputs, cell_states, **kwargs)
cell_outputs = map_structure(self._split_batch_beams, cell_outputs)
next_cell_states = map_structure(self._split_batch_beams, next_cell_states)
if self.output_fn is not None:
cell_outputs = self.output_fn(cell_outputs)
beam_search_output, beam_search_state = self._beam_search_step(time=time,
logits=cell_outputs,
next_cell_states=next_cell_states,
beam_state=states)
finished = beam_search_state.finished
sample_ids = beam_search_output.predicted_ids
sample_ids.stop_gradient = True
next_inputs = self.embedding_fn(sample_ids) if self.embedding_fn else sample_ids
return (beam_search_output, beam_search_state, next_inputs, finished)
def finalize(self, outputs, final_states, sequence_lengths):
r"""
Use `gather_tree` to backtrace along the beam search tree and construct
the full predicted sequences.
Parameters:
outputs(Variable): A structure(namedtuple) of tensor variables,
The structure and data type is same as `output_dtype`.
The tensor stacks all time steps' output thus has shape
`[time_step, batch_size, ...]`, which is done by the caller.
final_states(Variable): A structure(namedtuple) of tensor variables.
It is the `next_states` returned by `decoder.step` at last
decoding step, thus has the same structure, shape and data type
with states at any time step.
sequence_lengths(Variable): An `int64` tensor shaped `[batch_size, beam_size]`.
It contains sequence lengths for each beam determined during
decoding.
Returns:
tuple: A tuple( :code:`(predicted_ids, final_states)` ). \
`predicted_ids` is an `int64` tensor shaped \
`[time_step, batch_size, beam_size]`. `final_states` is the same \
as the input argument `final_states`.
"""
predicted_ids = nn.gather_tree(outputs.predicted_ids, outputs.parent_ids)
# TODO: use FinalBeamSearchDecoderOutput as output
return predicted_ids, final_states
@property
def tracks_own_finished(self):
"""
BeamSearchDecoder reorders its beams and their finished state. Thus it
conflicts with `dynamic_decode` function's tracking of finished states.
Setting this property to true to avoid early stopping of decoding due
to mismanagement of the finished state.
Returns:
bool: A python bool `True`.
"""
return True
def _dynamic_decode_imperative(decoder,
inits=None,
max_step_num=None,
output_time_major=False,
impute_finished=False,
is_test=False,
return_length=False,
bos_ids=None,
**kwargs):
def _maybe_copy(state, new_state, step_mask):
# TODO: use where_op
state_dtype = state.dtype
if convert_dtype(state_dtype) in ["bool"]:
state = tensor.cast(state, dtype="float32")
new_state = tensor.cast(new_state, dtype="float32")
if step_mask.dtype != state.dtype:
step_mask = tensor.cast(step_mask, dtype=state.dtype)
# otherwise, renamed bool gradients of would be summed up leading
# to sum(bool) error.
step_mask.stop_gradient = True
new_state = nn.elementwise_mul(state, step_mask, axis=0) - nn.elementwise_mul(new_state, (step_mask - 1),
axis=0)
if convert_dtype(state_dtype) in ["bool"]:
new_state = tensor.cast(new_state, dtype=state_dtype)
return new_state
initial_inputs, initial_states, initial_finished = decoder.initialize(inits, bos_ids=bos_ids)
inputs, states, finished = (initial_inputs, initial_states, initial_finished)
cond = control_flow.logical_not((nn.reduce_all(initial_finished)))
sequence_lengths = tensor.cast(tensor.zeros_like(initial_finished), "int64")
outputs = None
step_idx = 0
step_idx_tensor = tensor.fill_constant(shape=[1], dtype="int64", value=step_idx)
while cond.numpy():
(step_outputs, next_states, next_inputs, next_finished) = decoder.step(step_idx_tensor, inputs, states,
**kwargs)
if not decoder.tracks_own_finished:
# BeamSearchDecoder would track it own finished, since
# beams would be reordered and the finished status of each
# entry might change. Otherwise, perform logical OR which
# would not change the already finished.
next_finished = control_flow.logical_or(next_finished, finished)
# To confirm states.finished/finished be consistent with
# next_finished.
tensor.assign(next_finished, finished)
next_sequence_lengths = nn.elementwise_add(
sequence_lengths, tensor.cast(control_flow.logical_not(finished), sequence_lengths.dtype))
if impute_finished: # rectify the states for the finished.
next_states = map_structure(lambda x, y: _maybe_copy(x, y, finished), states, next_states)
else:
warnings.warn(
"`next_states` has no `lengths` attribute, the returned `sequence_lengths` would be all zeros."
) if not hasattr(next_states, "lengths") else None
next_sequence_lengths = getattr(next_states, "lengths", sequence_lengths)
outputs = map_structure(lambda x: ArrayWrapper(x), step_outputs) if step_idx == 0 else map_structure(
lambda x, x_array: x_array.append(x), step_outputs, outputs)
inputs, states, finished, sequence_lengths = (next_inputs, next_states, next_finished, next_sequence_lengths)
control_flow.increment(x=step_idx_tensor, value=1.0, in_place=True)
step_idx += 1
cond = control_flow.logical_not(nn.reduce_all(finished))
if max_step_num is not None and step_idx > max_step_num:
break
final_outputs = map_structure(lambda x: nn.stack(x.array, axis=0), outputs)
final_states = states
try:
final_outputs, final_states = decoder.finalize(final_outputs, final_states, sequence_lengths)
except NotImplementedError:
pass
if not output_time_major:
final_outputs = map_structure(lambda x: nn.transpose(x, [1, 0] + list(range(2, len(x.shape)))), final_outputs)
return (final_outputs, final_states, sequence_lengths) if return_length else (final_outputs, final_states)
def _dynamic_decode_declarative(decoder,
inits=None,
max_step_num=None,
output_time_major=False,
impute_finished=False,
is_test=False,
return_length=False,
**kwargs):
initial_inputs, initial_states, initial_finished = decoder.initialize(inits)
global_inputs, global_states, global_finished = (initial_inputs, initial_states, initial_finished)
global_finished.stop_gradient = True
step_idx = tensor.fill_constant(shape=[1], dtype="int64", value=0)
cond = control_flow.logical_not((nn.reduce_all(initial_finished)))
if max_step_num is not None:
max_step_num = tensor.fill_constant(shape=[1], dtype="int64", value=max_step_num)
while_op = control_flow.While(cond, is_test=is_test)
sequence_lengths = tensor.cast(tensor.zeros_like(initial_finished), "int64")
sequence_lengths.stop_gradient = True
if is_test:
# for test, reuse inputs and states variables to save memory
inputs = map_structure(lambda x: x, initial_inputs)
states = map_structure(lambda x: x, initial_states)
else:
# inputs and states of all steps must be saved for backward and training
inputs_arrays = map_structure(lambda x: control_flow.array_write(x, step_idx), initial_inputs)
states_arrays = map_structure(lambda x: control_flow.array_write(x, step_idx), initial_states)
def _maybe_copy(state, new_state, step_mask):
# TODO: use where_op
state_dtype = state.dtype
if convert_dtype(state_dtype) in ["bool"]:
state = tensor.cast(state, dtype="float32")
new_state = tensor.cast(new_state, dtype="float32")
if step_mask.dtype != state.dtype:
step_mask = tensor.cast(step_mask, dtype=state.dtype)
# otherwise, renamed bool gradients of would be summed up leading
# to sum(bool) error.
step_mask.stop_gradient = True
new_state = nn.elementwise_mul(state, step_mask, axis=0) - nn.elementwise_mul(new_state, (step_mask - 1),
axis=0)
if convert_dtype(state_dtype) in ["bool"]:
new_state = tensor.cast(new_state, dtype=state_dtype)
return new_state
def _transpose_batch_time(x):
return nn.transpose(x, [1, 0] + list(range(2, len(x.shape))))
def _create_array_out_of_while(dtype):
current_block_idx = default_main_program().current_block_idx
default_main_program().current_block_idx = default_main_program().current_block().parent_idx
tensor_array = control_flow.create_array(dtype)
default_main_program().current_block_idx = current_block_idx
return tensor_array
# While
with while_op.block():
if not is_test:
inputs = map_structure(lambda array: control_flow.array_read(array, step_idx), inputs_arrays)
states = map_structure(lambda array: control_flow.array_read(array, step_idx), states_arrays)
(outputs, next_states, next_inputs, next_finished) = decoder.step(step_idx, inputs, states, **kwargs)
if not decoder.tracks_own_finished:
# BeamSearchDecoder would track it own finished, since beams would
# be reordered and the finished status of each entry might change.
# Otherwise, perform logical OR which would not change the already
# finished.
next_finished = control_flow.logical_or(next_finished, global_finished)
next_sequence_lengths = nn.elementwise_add(
sequence_lengths, tensor.cast(control_flow.logical_not(global_finished), sequence_lengths.dtype))
if impute_finished: # rectify the states for the finished.
next_states = map_structure(
lambda x, y: _maybe_copy(x, y, global_finished),
states,
next_states,
)
else:
warnings.warn(
"`next_states` has no `lengths` attribute, the returned `sequence_lengths` would be all zeros."
) if not hasattr(next_states, "lengths") else None
next_sequence_lengths = getattr(next_states, "lengths", sequence_lengths)
# create tensor array in global block after dtype[s] of outputs can be got
outputs_arrays = map_structure(lambda x: _create_array_out_of_while(x.dtype), outputs)
map_structure(lambda x, x_array: control_flow.array_write(x, i=step_idx, array=x_array), outputs,
outputs_arrays)
control_flow.increment(x=step_idx, value=1.0, in_place=True)
# update the global_finished first, since it might be also in states of
# decoder, which otherwise would write a stale finished status to array
tensor.assign(next_finished, global_finished)
tensor.assign(next_sequence_lengths, sequence_lengths)
if is_test:
map_structure(tensor.assign, next_inputs, global_inputs)
map_structure(tensor.assign, next_states, global_states)
else:
map_structure(lambda x, x_array: control_flow.array_write(x, i=step_idx, array=x_array), next_inputs,
inputs_arrays)
map_structure(lambda x, x_array: control_flow.array_write(x, i=step_idx, array=x_array), next_states,
states_arrays)
if max_step_num is not None:
control_flow.logical_and(control_flow.logical_not(nn.reduce_all(global_finished)),
control_flow.less_equal(step_idx, max_step_num), cond)
else:
control_flow.logical_not(nn.reduce_all(global_finished), cond)
final_outputs = map_structure(lambda array: tensor.tensor_array_to_tensor(array, axis=0, use_stack=True)[0],
outputs_arrays)
if is_test:
final_states = global_states
else:
final_states = map_structure(lambda array: control_flow.array_read(array, step_idx), states_arrays)
try:
final_outputs, final_states = decoder.finalize(final_outputs, final_states, sequence_lengths)
except NotImplementedError:
pass
if not output_time_major:
final_outputs = map_structure(_transpose_batch_time, final_outputs)
return (final_outputs, final_states, sequence_lengths) if return_length else (final_outputs, final_states)
def dynamic_decode(decoder,
inits=None,
max_step_num=None,
output_time_major=False,
impute_finished=False,
is_test=False,
return_length=False,
bos_ids=None,
**kwargs):
r"""
Dynamic decoding performs :code:`decoder.step()` repeatedly until the returned
Tensor indicating finished status contains all True values or the number of
decoding step reaches to :attr:`max_step_num`.
:code:`decoder.initialize()` would be called once before the decoding loop.
If the `decoder` has implemented `finalize` method, :code:`decoder.finalize()`
would be called once after the decoding loop.
Parameters:
decoder(Decoder): An instance of `Decoder`.
inits(object, optional): Argument passed to `decoder.initialize`.
Default `None`.
max_step_num(int, optional): The maximum number of steps. If not provided,
decode until the decoder is fully done, or in other words, the returned
Tensor by :code:`decoder.step()` indicating finished status contains
all True. Default `None`.
output_time_major(bool, optional): Indicate the data layout of Tensor included
in the final outputs(the first returned value of this method). If
attr:`False`, the data layout would be batch major with shape
`[batch_size, seq_len, ...]`. If attr:`True`, the data layout would
be time major with shape `[seq_len, batch_size, ...]`. Default: `False`.
impute_finished(bool, optional): If `True` and `decoder.tracks_own_finished`
is False, then states get copied through for batch entries which are
marked as finished, which differs with the unfinished using the new states
returned by :code:`decoder.step()` and ensures that the final states have
the correct values. Otherwise, states wouldn't be copied through when
finished. If the returned `final_states` is needed, it should be set as
True, which causes some slowdown. Default `False`.
is_test(bool, optional): A flag indicating whether to use test mode. In
test mode, it is more memory saving. Default `False`.
return_length(bool, optional): A flag indicating whether to return an
extra Tensor variable in the output tuple, which stores the actual
lengths of all decoded sequences. Default `False`.
**kwargs: Additional keyword arguments. Arguments passed to `decoder.step`.
Returns:
tuple: A tuple( :code:`(final_outputs, final_states, sequence_lengths)` ) \
when `return_length` is True, otherwise a tuple( :code:`(final_outputs, final_states)` ). \
The final outputs and states, both are Tensor or nested structure of Tensor. \
`final_outputs` has the same structure and data types as the :code:`outputs` \
returned by :code:`decoder.step()` , and each Tenser in `final_outputs` \
is the stacked of all decoding steps' outputs, which might be revised \
by :code:`decoder.finalize()` if the decoder has implemented `finalize`. \
`final_states` is the counterpart at last time step of initial states \
returned by :code:`decoder.initialize()` , thus has the same structure \
with it and has tensors with same shapes and data types. `sequence_lengths` \
is an `int64` tensor with the same shape as `finished` returned \
by :code:`decoder.initialize()` , and it stores the actual lengths of \
all decoded sequences.
Examples:
.. code-block:: python
import numpy as np
import paddle
from paddle.nn import BeamSearchDecoder, dynamic_decode
from paddle.nn import GRUCell, Linear, Embedding
trg_embeder = Embedding(100, 32)
output_layer = Linear(32, 32)
decoder_cell = GRUCell(input_size=32, hidden_size=32)
decoder = BeamSearchDecoder(decoder_cell,
start_token=0,
end_token=1,
beam_size=4,
embedding_fn=trg_embeder,
output_fn=output_layer)
encoder_output = paddle.ones((4, 8, 32), dtype=paddle.get_default_dtype())
outputs = dynamic_decode(decoder=decoder,
inits=decoder_cell.get_initial_states(encoder_output),
max_step_num=10)
"""
if in_dygraph_mode():
return _dynamic_decode_imperative(decoder, inits, max_step_num, output_time_major, impute_finished, is_test,
return_length, bos_ids, **kwargs)
else:
print(">>> hello_debug: not support")
#return _dynamic_decode_declarative(decoder, inits, max_step_num,
# output_time_major, impute_finished,
# is_test, return_length, **kwargs)
class DecodeHelper(object):
"""
DecodeHelper is the base class for any helper instance used in `BasicDecoder`.
It provides interface to implement sampling and produce inputs for the next
time step in dynamic decoding.
"""
def initialize(self):
r"""
DecodeHelper initialization to produce inputs for the first decoding step
and give the initial status telling whether each sequence in the batch
is finished. It is the partial of the initialization of `BasicDecoder`.
Returns:
tuple: A tuple( :code:`(initial_inputs, initial_finished)` ). \
`initial_inputs` is a (possibly nested structure of) tensor \
variable[s], and the tensor's shape is `[batch_size, ...]`. \
`initial_finished` is a bool tensor with shape `[batch_size]`.
"""
pass
def sample(self, time, outputs, states):
"""
Perform sampling with some strategies according to `outputs`. It is the
partial of `BasicDecoder.step`.
Parameters:
time(Variable): An `int64` tensor with shape `[1]` provided by the caller,
representing the current time step number of decoding.
outputs(Variable): A tensor variable. Usually it's data type is float32
or float64, and it's shape is `[batch_size, vocabulary_size]`,
representing the predicted logits of current step. It is same as
`outputs` returned by `BasicDecoder.output_fn(BasicDecoder.cell.call())`.
states(Variable): A (possibly nested structure of) tensor variable[s].
It is same as `new_states` returned by `BasicDecoder.cell.call()`.
Returns:
Variable: An `int64` tensor representing the sampled ids.
"""
pass
def next_inputs(self, time, outputs, states, sample_ids):
r"""
Produce the inputs and states for next time step and give status telling
whether each minibatch entry is finished. It is called after `sample` in
`BasicDecoder.step`. It is the partial of `BasicDecoder.step`.
Parameters:
time(Variable): An `int64` tensor with shape `[1]` provided by the caller,
representing the current time step number of decoding.
outputs(Variable): A tensor variable. Usually it's data type is float32
or float64, and it's shape is `[batch_size, vocabulary_size]`,
representing the predicted logits of current step. It is same as
`outputs` returned by `BasicDecoder.output_fn(BasicDecoder.cell.call())`.
states(Variable): A (possibly nested structure of) tensor variable[s].
It is same as `new_states` returned by `BasicDecoder.cell.call()`.
sample_ids(Variable): A (possibly nested structure of) tensor variable[s].
It is same as `sample_ids` returned by `sample()`.
Returns:
tuple: A tuple( :code:`(finished, next_inputs, next_states)` ). \
`next_inputs` and `next_states` both are a (possibly nested \
structure of) tensor variable[s], and the structure, shape and \
data type of `next_states` must be same as the input argument \
`states`. `finished` is a bool tensor with shape `[batch_size]`.
"""
pass
class TrainingHelper(DecodeHelper):
"""
TrainingHelper is a subclass of DecodeHelper. It is a decoding helper
slicing from the full sequence inputs as the inputs for corresponding
step. And it uses `argmax` to sample from the outputs of `cell.call()`.
Since the needs of sequence inputs, it is used mostly for teach-forcing MLE
(maximum likelihood) training, and the sampled would not be used.
Examples:
.. code-block:: python
import paddle.fluid as fluid
import paddle.fluid.layers as layers
trg_emb = fluid.data(name="trg_emb",
shape=[None, None, 128],
dtype="float32")
trg_seq_length = fluid.data(name="trg_seq_length",
shape=[None],
dtype="int64")
helper = layers.TrainingHelper(trg_emb, trg_seq_length)
decoder_cell = layers.GRUCell(hidden_size=128)
decoder = layers.BasicDecoder(decoder_cell, helper)
outputs = layers.dynamic_decode(
decoder,
inits=decoder_cell.get_initial_states(trg_emb),
is_test=False)
"""
def __init__(self, inputs, sequence_length, time_major=False):
"""
Constructor of TrainingHelper.
Parameters:
inputs(Variable): A (possibly nested structure of) tensor variable[s].
The shape of tensor should be `[batch_size, sequence_length, ...]`
for `time_major == False` or `[sequence_length, batch_size, ...]`
for `time_major == True`. It represents the inputs to be sliced
from at every decoding step.
sequence_length(Variable): A tensor with shape `[batch_size]`.
It stores real length of each instance in `inputs`, by which we
can label the finished status of each instance at every decoding
step.
time_major(bool, optional): Indicate the data layout of Tensor included
in `inputs`. If `False`, the data layout would be batch major with
shape `[batch_size, sequence_length, ...]`. If `True`, the data
layout would be time major with shape `[sequence_length, batch_size, ...]`.
Default: `False`.
"""
self.inputs = inputs
self.sequence_length = sequence_length
self.time_major = time_major
# extend inputs to avoid to slice out of range in `next_inputs`
# may be easier and have better performance than condition_op
self.inputs_ = map_structure(
lambda x: nn.pad(x,
paddings=([0, 1] + [0, 0] * (len(x.shape) - 1))
if time_major else ([0, 0, 0, 1] + [0, 0] * (len(x.shape) - 2))), self.inputs)
def initialize(self):
r"""
TrainingHelper initialization produces inputs for the first decoding
step by slicing at the first time step of full sequence inputs, and it
gives initial status telling whether each sequence in the batch is
finished. It is the partial of the initialization of `BasicDecoder`.
Returns:
tuple: A tuple( :code:`(initial_inputs, initial_finished)` ). \
`initial_inputs` is a (possibly nested structure of) tensor \
variable[s], and the tensor's shape is `[batch_size, ...]`. \
`initial_finished` is a bool tensor with shape `[batch_size]`.
"""
init_finished = control_flow.equal(self.sequence_length,
tensor.fill_constant(shape=[1], dtype=self.sequence_length.dtype, value=0))
# TODO: support zero length
init_inputs = map_structure(lambda x: x[0] if self.time_major else x[:, 0], self.inputs)
return init_inputs, init_finished
def sample(self, time, outputs, states):
r"""
Perform sampling by using `argmax` according to the `outputs`. Mostly
the sampled ids would not be used since the inputs for next decoding
step would be got by slicing.
Parameters:
time(Variable): An `int64` tensor with shape `[1]` provided by the
caller, representing the current time step number of decoding.
outputs(Variable): A tensor variable. Usually it's data type is float32
or float64, and it's shape is `[batch_size, vocabulary_size]`,
representing the predicted logits of current step. It is same as
`outputs` returned by `BasicDecoder.output_fn(BasicDecoder.cell.call())`.
states(Variable): A (possibly nested structure of) tensor variable[s].
It is same as `new_states` returned by `BasicDecoder.cell.call()`.
Returns:
Variable: An `int64` tensor with shape `[batch_size]`, representing \
the sampled ids.
"""
sample_ids = tensor.argmax(outputs, axis=-1)
return sample_ids
def next_inputs(self, time, outputs, states, sample_ids):
r"""
Generate inputs for the next decoding step by slicing at corresponding
step of the full sequence inputs. Simultaneously, produce the states
for next time step by directly using the input `states` and emit status
telling whether each minibatch entry reaches to the corresponding length.
Parameters:
time(Variable): An `int64` tensor with shape `[1]` provided by the
caller, representing the current time step number of decoding.
outputs(Variable): A tensor variable. Usually it's data type is float32
or float64, and it's shape is `[batch_size, vocabulary_size]`,
representing the predicted logits of current step. It is same as
`outputs` returned by `BasicDecoder.output_fn(BasicDecoder.cell.call())`.
states(Variable): A (possibly nested structure of) tensor variable[s].
It is same as `new_states` returned by `BasicDecoder.cell.call()`.
sample_ids(Variable): An `int64` tensor variable shaped `[batch_size]`.
It is same as `sample_ids` returned by `sample()`.
Returns:
tuple: A tuple( :code:`(finished, next_inputs, next_states)` ). \
`next_inputs` and `next_states` both are a (possibly nested \
structure of) tensor variable[s], and the tensor's shape is \
`[batch_size, ...]`. `next_states` is identical to the input \
argument `states`. `finished` is a `bool` Tensor with \
shape `[batch_size]`.
"""
# TODO: compatibility of int32 and int64
time = tensor.cast(time, "int32") if convert_dtype(time.dtype) not in ["int32"] else time
if self.sequence_length.dtype != time.dtype:
self.sequence_length = tensor.cast(self.sequence_length, time.dtype)
next_time = time + 1
finished = control_flow.less_equal(self.sequence_length, next_time)
def _slice(x): # TODO: use Variable.__getitem__
axes = [0 if self.time_major else 1]
return nn.squeeze(nn.slice(x, axes=axes, starts=[next_time], ends=[next_time + 1]), axes=axes)
next_inputs = map_structure(_slice, self.inputs_)
return finished, next_inputs, states
class GreedyEmbeddingHelper(DecodeHelper):
"""
GreedyEmbeddingHelper is a subclass of DecodeHelper. It is a decoding helper
uses the argmax of the output (treated as logits) and passes the results
through an embedding layer to get inputs for the next decoding step.
Examples:
.. code-block:: python
import paddle.fluid as fluid
import paddle.fluid.layers as layers
trg_emb = fluid.data(name="trg_emb",
shape=[None, None, 128],
dtype="float32")
trg_embeder = lambda x: fluid.embedding(
x, size=[10000, 128], param_attr=fluid.ParamAttr(name="trg_embedding"))
output_layer = lambda x: layers.fc(x,
size=10000,
num_flatten_dims=len(x.shape) - 1,
param_attr=fluid.ParamAttr(name=
"output_w"),
bias_attr=False)
helper = layers.GreedyEmbeddingHelper(trg_embeder, start_tokens=0, end_token=1)
decoder_cell = layers.GRUCell(hidden_size=128)
decoder = layers.BasicDecoder(decoder_cell, helper, output_fn=output_layer)
outputs = layers.dynamic_decode(
decoder=decoder, inits=decoder_cell.get_initial_states(encoder_output))
"""
def __init__(self, embedding_fn, start_tokens, end_token):
r"""
Constructor of GreedyEmbeddingHelper.
Parameters:
embedding_fn(callable): A functor to apply on the argmax results.
Mostly it is an embedding layer to transform ids to embeddings.
**Note that fluid.embedding should be used here rather than
fluid.layers.embedding, since shape of ids is [batch_size].
when using fluid.layers.embedding, must unsqueeze in embedding_fn.**
start_tokens(Variable): A `int64` tensor shaped `[batch_size]`,
representing the start tokens.
end_token(int): The end token id.
Returns:
tuple: A tuple( :code:`(initial_inputs, initial_states, finished)` ). \
`initial_inputs` and `initial_states` both are a (possibly nested \
structure of) tensor variable[s], and `finished` is a tensor with \
bool data type.
"""
self.embedding_fn = embedding_fn
self.start_tokens = start_tokens
self.end_token = tensor.fill_constant(shape=[1], dtype="int64", value=end_token)
def initialize(self):
r"""
GreedyEmbeddingHelper initialization produces inputs for the first decoding
step by using `start_tokens` of the constructor, and gives initial
status telling whether each sequence in the batch is finished.
It is the partial of the initialization of `BasicDecoder`.
Returns:
tuple: A tuple( :code:`(initial_inputs, initial_finished)` ). \
`initial_inputs` is same as `start_tokens` of the constructor. \
`initial_finished` is a `bool` tensor filled by False and has \
the same shape as `start_tokens`.
"""
# TODO: remove the restriction of force_cpu
init_finished = tensor.fill_constant_batch_size_like(input=self.start_tokens,
shape=[-1],
dtype="bool",
value=False,
force_cpu=True)
init_inputs = self.embedding_fn(self.start_tokens)
return init_inputs, init_finished
def sample(self, time, outputs, states):
r"""
Perform sampling by using `argmax` according to the `outputs`.
Parameters:
time(Variable): An `int64` tensor with shape `[1]` provided by the
caller, representing the current time step number of decoding.
outputs(Variable): A tensor variable. Usually it's data type is float32
or float64, and it's shape is `[batch_size, vocabulary_size]`,
representing the predicted logits of current step. It is same as
`outputs` returned by `BasicDecoder.output_fn(BasicDecoder.cell.call())`.
states(Variable): A (possibly nested structure of) tensor variable[s].
It is same as `new_states` returned by `BasicDecoder.cell.call()`.
Returns:
Variable: An `int64` tensor with shape `[batch_size]`, representing \
the sampled ids.
"""
sample_ids = tensor.argmax(outputs, axis=-1)
return sample_ids
def next_inputs(self, time, outputs, states, sample_ids):
r"""
Generate inputs for the next decoding step by applying `embedding_fn`
to `sample_ids`. Simultaneously, produce the states for next time step
by directly using the input `states` and emit status telling whether
each minibatch entry gets an `end_token` sample.
Parameters:
time(Variable): An `int64` tensor with shape `[1]` provided by the
caller, representing the current time step number of decoding.
outputs(Variable): A tensor variable. Usually it's data type is float32
or float64, and it's shape is `[batch_size, vocabulary_size]`,
representing the predicted logits of current step. It is same as
`outputs` returned by `BasicDecoder.output_fn(BasicDecoder.cell.call())`.
states(Variable): A (possibly nested structure of) tensor variable[s].
It is same as `new_states` returned by `BasicDecoder.cell.call()`.
sample_ids(Variable): An `int64` tensor variable shaped `[batch_size]`.
It is same as `sample_ids` returned by `sample()`.
Returns:
tuple: A tuple( :code:`(finished, next_inputs, next_states)` ). \
`next_inputs` and `next_states` both are a (possibly nested \
structure of) tensor variable[s], and the tensor's shape is \
`[batch_size, ...]`. `next_states` is identical to the input \
argument `states`. `finished` is a `bool` Tensor with \
shape `[batch_size]`.
"""
finished = control_flow.equal(sample_ids, self.end_token)
next_inputs = self.embedding_fn(sample_ids)
return finished, next_inputs, states
class BasicDecoder(Decoder):
"""
BasicDecoder is a subclass of Decoder and assembles a RNNCell and DecodeHelper
instance as members, where the DecodeHelper helps to implement customed
decoding strategies.. It performs one decoding step as following steps:
1. Perform `cell_outputs, cell_states = cell.call(inputs, states)`
to get outputs and new states from cell.
2. Perform `sample_ids = helper.sample(time, cell_outputs, cell_states)`
to sample ids as decoded results of the current time step.
3. Perform `finished, next_inputs, next_states = helper.next_inputs(time,
cell_outputs, cell_states, sample_ids)` to generate inputs, states and
finished status for the next decoding step.
Examples:
.. code-block:: python
import paddle.fluid as fluid
import paddle.fluid.layers as layers
trg_emb = fluid.data(name="trg_emb",
shape=[None, None, 128],
dtype="float32")
trg_embeder = lambda x: fluid.embedding(
x, size=[10000, 128], param_attr=fluid.ParamAttr(name="trg_embedding"))
output_layer = lambda x: layers.fc(x,
size=10000,
num_flatten_dims=len(x.shape) - 1,
param_attr=fluid.ParamAttr(name=
"output_w"),
bias_attr=False)
helper = layers.SampleEmbeddingHelper(trg_embeder, start_tokens=0, end_token=1)
decoder_cell = layers.GRUCell(hidden_size=128)
decoder = layers.BasicDecoder(decoder_cell, helper, output_fn=output_layer)
outputs = layers.dynamic_decode(
decoder=decoder, inits=decoder_cell.get_initial_states(encoder_output))
"""
def __init__(self, cell, helper, output_fn=None):
"""
Constructor of BasicDecoder.
Parameters:
cell(RNNCell): An instance of `RNNCell` or object with the same interface.
helper(DecodeHelper): An instance of `DecodeHelper`.
output_fn(optional): A callable to apply to the cell's output prior to
sampling. Default None.
"""
self.cell = cell
self.helper = helper
self.output_fn = output_fn
def initialize(self, initial_cell_states):
r"""
BasicDecoder initialization includes helper initialization and cell
initialization, and cell initialization uses `initial_cell_states` as
the result directly.
Parameters:
initial_cell_states(Variable): A (possibly nested structure of)
tensor variable[s]. An argument provided by the caller `dynamic_decode`.
Returns:
tuple: A tuple( :code:(initial_inputs, initial_cell_states, finished)` ). \
`initial_inputs` and `initial_states` both are a (possibly nested \
structure of) tensor variable[s], and `finished` is a tensor with \
bool data type. `initial_inputs` and `finished` are the results \
of `helper.initialize()`, and `initial_cell_states` is same as \
the input argument counterpart.
"""
(initial_inputs, initial_finished) = self.helper.initialize()
return initial_inputs, initial_cell_states, initial_finished
class OutputWrapper(collections.namedtuple("OutputWrapper", ("cell_outputs", "sample_ids"))):
"""
The structure for the returned value `outputs` of `decoder.step`.
A namedtuple includes cell_outputs, sample_ids as fields.
"""
pass
def step(self, time, inputs, states, **kwargs):
r"""
Perform one decoding step as following steps:
1. Perform `cell_outputs, cell_states = cell.call(inputs, states)`
to get outputs and new states from cell.
2. Perform `sample_ids = helper.sample(time, cell_outputs, cell_states)`
to sample ids as decoded results of the current time step.
3. Perform `finished, next_inputs, next_states = helper.next_inputs(time,
cell_outputs, cell_states, sample_ids)` to generate inputs, states and
finished status for the next decoding step.
Parameters:
time(Variable): An `int64` tensor with shape `[1]` provided by the caller,
representing the current time step number of decoding.
inputs(Variable): A tensor variable. It is same as `initial_inputs`
returned by `initialize()` for the first decoding step and
`next_inputs` returned by `step()` for the others.
states(Variable): A structure of tensor variables.
It is same as the `initial_cell_states` returned by `initialize()`
for the first decoding step and `next_states` returned by
`step()` for the others.
**kwargs: Additional keyword arguments, provided by the caller
`dynamic_decode`.
Returns:
tuple: A tuple( :code:`(outputs, next_states, next_inputs, finished)` ). \
`outputs` is a namedtuple(including cell_outputs, sample_ids, \
as fields) of tensor variables, where `cell_outputs` is the result \
fof `cell.call()` and `sample_ids` is the result of `helper.sample()`. \
`next_states` and `next_inputs` have the same structure, shape \
and data type as the input arguments `states` and `inputs` separately. \
`finished` is a `bool` tensor with shape `[batch_size]`.
"""
cell_outputs, cell_states = self.cell(inputs, states, **kwargs)
if self.output_fn is not None:
cell_outputs = self.output_fn(cell_outputs)
sample_ids = self.helper.sample(time=time, outputs=cell_outputs, states=cell_states)
sample_ids.stop_gradient = True
(finished, next_inputs, next_states) = self.helper.next_inputs(time=time,
outputs=cell_outputs,
states=cell_states,
sample_ids=sample_ids)
outputs = self.OutputWrapper(cell_outputs, sample_ids)
return (outputs, next_states, next_inputs, finished)
def beam_search(pre_ids,
pre_scores,
ids,
scores,
beam_size,
end_id,
level=0,
is_accumulated=True,
name=None,
return_parent_idx=False):
r"""
Beam search is a classical algorithm for selecting candidate words in a
machine translation task.
Refer to `Beam search <https://en.wikipedia.org/wiki/Beam_search>`_
for more details.
**This operator only supports LoDTensor.** It is used after finishing
scores calculation to perform beam search for one time step. Specifically,
after ``ids`` and ``scores`` have been produced, it selects the top-K
( `k` is ``beam_size`` ) candidate word ids of current step from ``ids``
according to the corresponding ``scores``. Additionally, ``pre_id`` and
``pre_scores`` are the output of `beam_search` at previous step, they
are needed for special use to handle ended candidate translations.
Note that if ``is_accumulated`` is True, the ``scores`` passed in should
be accumulated scores. Otherwise, the ``scores`` are
considered as the probabilities of single step and would be transformed to
the log field and added up with ``pre_scores`` for final scores in this
operator. Length penalty should be done with extra operators before calculating
the accumulated scores if needed.
Please see the following demo for a fully beam search usage example:
fluid/tests/book/test_machine_translation.py
Args:
pre_ids(Variable): A LodTensor variable (lod level is 2), representing
the selected ids of previous step. It is the output of beam_search
at previous step. Its shape is `[batch_size, 1]` and its lod is
`[[0, 1, ... , batch_size], [0, 1, ..., batch_size]]` at the
first step. The data type should be int64.
pre_scores(Variable): A LodTensor variable has the same shape and lod
with ``pre_ids`` , representing the accumulated scores corresponding
to the selected ids of previous step. It is the output of
beam_search at previous step. The data type should be float32 or float64.
ids(Variable|None): A LodTensor variable containing the candidates ids.
It has the same lod with ``pre_ids`` and its shape should be
`[batch_size * beam_size, K]`, where `K` supposed to be greater than
``beam_size`` and the first dimension size (decrease as samples reach
to the end) should be same as that of ``pre_ids`` . The data type
should be int64. It can be None, which use index in ``scores`` as
ids.
scores(Variable): A LodTensor variable containing the accumulated
scores corresponding to ``ids`` . Both its shape and lod are same as
those of ``ids`` . The data type should be float32 or float64.
beam_size(int): The beam width used in beam search.
end_id(int): The id of end token.
level(int): **It can be ignored and mustn't change currently.**
The 2 level lod used in this operator has the following
meaning: The first level describes how many beams each sample has,
which would change to 0 when beams of the sample all end (batch reduce);
The second level describes how many times each beam is selected.
Default 0, which shouldn't be changed currently.
is_accumulated(bool): Whether the input ``score`` is accumulated scores.
Default True.
name(str, optional): For detailed information, please refer
to :ref:`api_guide_Name`. Usually name is no need to set and
None by default.
return_parent_idx(bool, optional): Whether to return an extra Tensor variable
in output, which stores the selected ids' parent index in
``pre_ids`` and can be used to update RNN's states by gather operator.
Default False.
Returns:
tuple: The tuple contains two or three LodTensor variables. The two LodTensor, \
representing the selected ids and the corresponding accumulated scores of \
current step, have the same shape `[batch_size, beam_size]` and lod with 2 levels, \
and have data types int64 and float32. If ``return_parent_idx`` is True, \
an extra Tensor variable preserving the selected ids' parent index \
is included, whose shape is `[batch_size * beam_size]` and data type \
is int64.
Examples:
.. code-block:: python
import paddle.fluid as fluid
import paddle
paddle.enable_static()
# Suppose `probs` contains predicted results from the computation
# cell and `pre_ids` and `pre_scores` is the output of beam_search
# at previous step.
beam_size = 4
end_id = 1
pre_ids = fluid.data(
name='pre_id', shape=[None, 1], lod_level=2, dtype='int64')
pre_scores = fluid.data(
name='pre_scores', shape=[None, 1], lod_level=2, dtype='float32')
probs = fluid.data(
name='probs', shape=[None, 10000], dtype='float32')
topk_scores, topk_indices = fluid.layers.topk(probs, k=beam_size)
accu_scores = fluid.layers.elementwise_add(
x=fluid.layers.log(x=topk_scores),
y=fluid.layers.reshape(pre_scores, shape=[-1]),
axis=0)
selected_ids, selected_scores = fluid.layers.beam_search(
pre_ids=pre_ids,
pre_scores=pre_scores,
ids=topk_indices,
scores=accu_scores,
beam_size=beam_size,
end_id=end_id)
"""
check_variable_and_dtype(pre_ids, 'pre_ids', ['int64'], 'beam_search')
check_variable_and_dtype(pre_scores, 'pre_scores', ['float32', 'float64'], 'beam_search')
check_type(ids, 'ids', (Variable, type(None)), 'beam_search')
check_variable_and_dtype(scores, 'scores', ['float32', 'float64'], 'beam_search')
helper = LayerHelper('beam_search', **locals())
score_type = pre_scores.dtype
id_type = pre_ids.dtype
inputs = {"pre_ids": pre_ids, "pre_scores": pre_scores, "scores": scores}
if ids is not None:
inputs["ids"] = ids
selected_scores = helper.create_variable_for_type_inference(dtype=score_type)
selected_ids = helper.create_variable_for_type_inference(dtype=id_type)
# parent_idx is a tensor used to gather cell states at the next time
# step. Though lod in selected_ids can also be used to gather by
# sequence_expand, it is not efficient.
# gather_op's index input only supports int32 dtype currently
parent_idx = helper.create_variable_for_type_inference(dtype="int32")
helper.append_op(
type='beam_search',
inputs=inputs,
outputs={
'selected_ids': selected_ids,
'selected_scores': selected_scores,
'parent_idx': parent_idx
},
attrs={
# TODO(ChunweiYan) to assure other value support
'level': level,
'beam_size': beam_size,
'end_id': end_id,
'is_accumulated': is_accumulated,
})
if return_parent_idx:
return selected_ids, selected_scores, parent_idx
else:
return selected_ids, selected_scores
def beam_search_decode(ids, scores, beam_size, end_id, name=None):
r"""
This operator is used after beam search has completed. It constructs the
full predicted sequences for each sample by walking back along the search
paths stored in lod of ``ids`` . The result sequences are stored in a
LoDTensor, which uses the following way to parse:
.. code-block:: text
If lod = [[0, 3, 6], [0, 12, 24, 40, 54, 67, 82]]
The first level of lod stands for: There are 2 samples each having 3
(beam width) predicted sequence.
The second level of lod stands for: The lengths of the first sample's
3 predicted sequences are 12, 12, 16; The lengths of the second sample's
3 predicted sequences are 14, 13, 15.
Please see the following demo for a fully beam search usage example:
fluid/tests/book/test_machine_translation.py
Args:
ids(Variable): The LoDTensorArray variable containing the selected ids
of all steps. Each LoDTensor in it has int64 data type and 2 level
lod which can be used to get the search paths.
scores(Variable): The LodTensorArray variable containing the accumulated
scores corresponding to selected ids of all steps. It has the same size
as ``ids`` . Each LoDTensor in it has the same shape and lod as the
counterpart in ``ids`` , and has a float32 data type.
beam_size(int): The beam width used in beam search.
end_id(int): The id of end token.
name(str, optional): For detailed information, please refer
to :ref:`api_guide_Name`. Usually name is no need to set and
None by default.
Returns:
tuple: The tuple contains two LodTensor variables. The two LodTensor, \
containing the full sequences of ids and the corresponding accumulated \
scores, have the same shape flattened to 1D and have the same 2 level \
lod. The lod can be used to get how many predicted sequences each sample \
has and how many ids each predicted sequence has.
Examples:
.. code-block:: python
import paddle.fluid as fluid
import paddle
paddle.enable_static()
# Suppose `ids` and `scores` are LodTensorArray variables reserving
# the selected ids and scores of all steps
ids = fluid.layers.create_array(dtype='int64')
scores = fluid.layers.create_array(dtype='float32')
finished_ids, finished_scores = fluid.layers.beam_search_decode(
ids, scores, beam_size=5, end_id=0)
"""
check_variable_and_dtype(ids, 'ids', ['int64'], 'beam_search_encode')
check_variable_and_dtype(scores, 'scores', ['float32'], 'beam_search_encode')
helper = LayerHelper('beam_search_decode', **locals())
sentence_ids = helper.create_variable_for_type_inference(dtype=ids.dtype)
sentence_scores = helper.create_variable_for_type_inference(dtype=scores.dtype)
helper.append_op(type="beam_search_decode",
inputs={
"Ids": ids,
"Scores": scores
},
outputs={
"SentenceIds": sentence_ids,
"SentenceScores": sentence_scores
},
attrs={
"beam_size": beam_size,
"end_id": end_id
})
return sentence_ids, sentence_scores
# copyright (c) 2021 PaddlePaddle Authors. All Rights Reserve.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import numpy as np
import paddle
import paddle.nn as nn
import paddle.nn.functional as F
from paddle.nn import Layer
from paddle.nn import Linear
from paddle.nn.initializer import Constant
from paddle.nn.initializer import Normal
from paddle.nn.initializer import TruncatedNormal
# from .base_transformer import QuickGELU
__all__ = [
"VisionTransformer", "ViT_small_patch16_224", "ViT_base_patch16_224", "ViT_base_patch16_384",
"ViT_base_patch32_224", "ViT_base_patch32_384", "ViT_large_patch16_224", "ViT_large_patch16_384",
"ViT_large_patch32_384", "ViT_huge_patch16_224", "ViT_huge_patch32_384", "ViT_large_patch14_224"
]
trunc_normal_ = TruncatedNormal(std=.02)
zeros_ = Constant(value=0.)
ones_ = Constant(value=1.)
class QuickGELU(Layer):
""" GELU """
def forward(self, x):
return x * F.sigmoid(1.702 * x)
def to_2tuple(x):
return tuple([x] * 2)
def drop_path(x, drop_prob=0., training=False):
"""Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks).
the original name is misleading as 'Drop Connect' is a different form of dropout in a separate paper...
See discussion: https://github.com/tensorflow/tpu/issues/494#issuecomment-532968956 ...
"""
if drop_prob == 0. or not training:
return x
keep_prob = paddle.to_tensor(1 - drop_prob)
shape = (paddle.shape(x)[0], ) + (1, ) * (x.ndim - 1)
random_tensor = keep_prob + paddle.rand(shape, dtype=x.dtype)
random_tensor = paddle.floor(random_tensor) # binarize
output = x.divide(keep_prob) * random_tensor
return output
class DropPath(nn.Layer):
"""Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks).
"""
def __init__(self, drop_prob=None):
super(DropPath, self).__init__()
self.drop_prob = drop_prob
def forward(self, x):
return drop_path(x, self.drop_prob, self.training)
class Identity(nn.Layer):
def __init__(self):
super(Identity, self).__init__()
def forward(self, input):
return input
class Mlp(nn.Layer):
def __init__(self, in_features, hidden_features=None, out_features=None, act_layer=nn.GELU, drop=0.):
super().__init__()
out_features = out_features or in_features
hidden_features = hidden_features or in_features
self.fc1 = nn.Linear(in_features, hidden_features)
self.act = act_layer()
self.fc2 = nn.Linear(hidden_features, out_features)
self.drop = nn.Dropout(drop)
def forward(self, x):
x = self.fc1(x)
x = self.act(x)
x = self.drop(x)
x = self.fc2(x)
x = self.drop(x)
return x
class Attention(nn.Layer):
def __init__(self, dim, num_heads=8, qkv_bias=False, qk_scale=None, attn_drop=0., proj_drop=0.):
super().__init__()
self.num_heads = num_heads
head_dim = dim // num_heads
self.scale = qk_scale or head_dim**-0.5
self.qkv = nn.Linear(dim, dim * 3, bias_attr=qkv_bias)
self.attn_drop = nn.Dropout(attn_drop)
self.proj = nn.Linear(dim, dim)
self.proj_drop = nn.Dropout(proj_drop)
def forward(self, x):
# B= paddle.shape(x)[0]
N, C = x.shape[1:]
qkv = self.qkv(x).reshape((-1, N, 3, self.num_heads, C // self.num_heads)).transpose((2, 0, 3, 1, 4))
q, k, v = qkv[0], qkv[1], qkv[2]
attn = (q.matmul(k.transpose((0, 1, 3, 2)))) * self.scale
attn = nn.functional.softmax(attn, axis=-1)
attn = self.attn_drop(attn)
x = (attn.matmul(v)).transpose((0, 2, 1, 3)).reshape((-1, N, C))
x = self.proj(x)
x = self.proj_drop(x)
return x
class Block(nn.Layer):
def __init__(self,
dim,
num_heads,
mlp_ratio=4.,
qkv_bias=False,
qk_scale=None,
drop=0.,
attn_drop=0.,
drop_path=0.,
act_layer=QuickGELU,
norm_layer='nn.LayerNorm',
epsilon=1e-5):
super().__init__()
self.norm1 = eval(norm_layer)(dim, epsilon=epsilon)
self.attn = Attention(dim,
num_heads=num_heads,
qkv_bias=qkv_bias,
qk_scale=qk_scale,
attn_drop=attn_drop,
proj_drop=drop)
# NOTE: drop path for stochastic depth, we shall see if this is better than dropout here
self.drop_path = DropPath(drop_path) if drop_path > 0. else Identity()
self.norm2 = eval(norm_layer)(dim, epsilon=epsilon)
mlp_hidden_dim = int(dim * mlp_ratio)
self.mlp = Mlp(in_features=dim, hidden_features=mlp_hidden_dim, act_layer=act_layer, drop=drop)
def forward(self, x):
x = x + self.drop_path(self.attn(self.norm1(x)))
x = x + self.drop_path(self.mlp(self.norm2(x)))
return x
class PatchEmbed(nn.Layer):
""" Image to Patch Embedding
"""
def __init__(self, img_size=224, patch_size=16, in_chans=3, embed_dim=768):
super().__init__()
img_size = to_2tuple(img_size)
patch_size = to_2tuple(patch_size)
num_patches = (img_size[1] // patch_size[1]) * \
(img_size[0] // patch_size[0])
self.img_size = img_size
self.patch_size = patch_size
self.num_patches = num_patches
self.proj = nn.Conv2D(in_chans, embed_dim, kernel_size=patch_size, stride=patch_size, bias_attr=False)
def forward(self, x):
B, C, H, W = x.shape
assert H == self.img_size[0] and W == self.img_size[1], \
"Input image size ({H}*{W}) doesn't match model ({self.img_size[0]}*{self.img_size[1]})."
x = self.proj(x).flatten(2).transpose((0, 2, 1))
return x
class VisionTransformer(nn.Layer):
""" Vision Transformer with support for patch input
"""
def __init__(self,
img_size=224,
patch_size=16,
in_chans=3,
class_dim=0,
embed_dim=768,
depth=12,
num_heads=12,
mlp_ratio=4,
qkv_bias=False,
qk_scale=None,
drop_rate=0.,
attn_drop_rate=0.,
drop_path_rate=0.,
norm_layer='nn.LayerNorm',
epsilon=1e-5,
**args):
super().__init__()
self.class_dim = class_dim
self.num_features = self.embed_dim = embed_dim
self.patch_embed = PatchEmbed(img_size=img_size, patch_size=patch_size, in_chans=in_chans, embed_dim=embed_dim)
num_patches = self.patch_embed.num_patches
scale = embed_dim**-0.5
self.class_embedding = self.create_parameter(shape=(1, 1, embed_dim), default_initializer=Normal(std=scale))
self.positional_embedding = self.create_parameter(shape=(1, num_patches + 1, embed_dim),
default_initializer=Normal(std=scale))
self.add_parameter("positional_embedding", self.positional_embedding)
self.add_parameter("class_embedding", self.class_embedding)
self.pos_drop = nn.Dropout(p=drop_rate)
dpr = np.linspace(0, drop_path_rate, depth)
self.norm_pre = eval(norm_layer)(embed_dim, epsilon=epsilon)
self.blocks = nn.LayerList([
Block(dim=embed_dim,
num_heads=num_heads,
mlp_ratio=mlp_ratio,
qkv_bias=qkv_bias,
qk_scale=qk_scale,
drop=drop_rate,
attn_drop=attn_drop_rate,
drop_path=dpr[i],
norm_layer=norm_layer,
epsilon=epsilon) for i in range(depth)
])
self.norm_post = eval(norm_layer)(embed_dim, epsilon=epsilon)
## Classifier head
#self.head = nn.Linear(embed_dim,
# class_dim) if class_dim > 0 else Identity()
trunc_normal_(self.positional_embedding)
trunc_normal_(self.class_embedding)
self.apply(self._init_weights)
def _init_weights(self, m):
if isinstance(m, nn.Linear):
trunc_normal_(m.weight)
if isinstance(m, nn.Linear) and m.bias is not None:
zeros_(m.bias)
elif isinstance(m, nn.LayerNorm):
zeros_(m.bias)
ones_(m.weight)
def forward_features(self, x):
# B = x.shape[0]
B = paddle.shape(x)[0]
x = self.patch_embed(x)
class_embedding = self.class_embedding.expand((B, -1, -1))
x = paddle.concat((class_embedding, x), axis=1)
x = x + self.positional_embedding
x = self.pos_drop(x)
x = self.norm_pre(x)
for blk in self.blocks:
x = blk(x)
#x = self.norm_post(x[:, 0, :])
x = self.norm_post(x)
# x = self.classfy(x)
return x
def forward(self, x):
x = self.forward_features(x)
return x
def ViT_small_patch16_224(**kwargs):
model = VisionTransformer(patch_size=16,
embed_dim=768,
depth=8,
num_heads=8,
mlp_ratio=3,
qk_scale=768**-0.5,
**kwargs)
return model
def ViT_base_patch16_224(**kwargs):
model = VisionTransformer(patch_size=16,
embed_dim=768,
depth=12,
num_heads=12,
mlp_ratio=4,
qkv_bias=True,
epsilon=1e-6,
**kwargs)
return model
def ViT_base_patch16_384(**kwargs):
model = VisionTransformer(img_size=384,
patch_size=16,
embed_dim=768,
depth=12,
num_heads=12,
mlp_ratio=4,
qkv_bias=True,
epsilon=1e-6,
**kwargs)
return model
def ViT_base_patch32_384(**kwargs):
model = VisionTransformer(img_size=384,
patch_size=32,
embed_dim=768,
depth=12,
num_heads=12,
mlp_ratio=4,
qkv_bias=True,
epsilon=1e-6,
**kwargs)
return model
def ViT_base_patch32_224(**kwargs):
model = VisionTransformer(img_size=224,
patch_size=32,
embed_dim=768,
depth=12,
num_heads=12,
mlp_ratio=4,
qkv_bias=True,
epsilon=1e-6,
**kwargs)
return model
def ViT_large_patch16_224(**kwargs):
model = VisionTransformer(patch_size=16,
embed_dim=1024,
depth=24,
num_heads=16,
mlp_ratio=4,
qkv_bias=True,
epsilon=1e-6,
**kwargs)
return model
def ViT_large_patch16_384(**kwargs):
model = VisionTransformer(img_size=384,
patch_size=16,
embed_dim=1024,
depth=24,
num_heads=16,
mlp_ratio=4,
qkv_bias=True,
epsilon=1e-6,
**kwargs)
return model
def ViT_large_patch14_224(**kwargs):
model = VisionTransformer(patch_size=14,
embed_dim=1024,
depth=24,
num_heads=16,
mlp_ratio=4,
qkv_bias=True,
epsilon=1e-6,
**kwargs)
return model
def ViT_large_patch32_384(**kwargs):
model = VisionTransformer(img_size=384,
patch_size=32,
embed_dim=1024,
depth=24,
num_heads=16,
mlp_ratio=4,
qkv_bias=True,
epsilon=1e-6,
**kwargs)
return model
def ViT_huge_patch16_224(**kwargs):
model = VisionTransformer(patch_size=16, embed_dim=1280, depth=32, num_heads=16, mlp_ratio=4, **kwargs)
return model
def ViT_huge_patch32_384(**kwargs):
model = VisionTransformer(img_size=384,
patch_size=32,
embed_dim=1280,
depth=32,
num_heads=16,
mlp_ratio=4,
**kwargs)
return model
# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Droppath, reimplement from https://github.com/yueatsprograms/Stochastic_Depth
"""
import paddle
import paddle.nn as nn
class DropPath(nn.Layer):
"""DropPath class"""
def __init__(self, drop_prob=None):
super(DropPath, self).__init__()
self.drop_prob = drop_prob
def drop_path(self, inputs):
"""drop path op
Args:
input: tensor with arbitrary shape
drop_prob: float number of drop path probability, default: 0.0
training: bool, if current mode is training, default: False
Returns:
output: output tensor after drop path
"""
# if prob is 0 or eval mode, return original input
if self.drop_prob == 0. or not self.training:
return inputs
keep_prob = 1 - self.drop_prob
keep_prob = paddle.to_tensor(keep_prob, dtype='float32')
shape = (inputs.shape[0], ) + (1, ) * (inputs.ndim - 1) # shape=(N, 1, 1, 1)
random_tensor = keep_prob + paddle.rand(shape, dtype=inputs.dtype)
random_tensor = random_tensor.floor() # mask
output = inputs.divide(keep_prob) * random_tensor #divide is to keep same output expectation
return output
def forward(self, inputs):
return self.drop_path(inputs)
#def main():
# tmp = paddle.to_tensor(np.random.rand(8, 16, 8, 8), dtype='float32')
# dp = DropPath(0.5)
# out = dp(tmp)
# print(out)
#
#if __name__ == "__main__":
# main()
import collections
import copy
import math
import os
import re
import paddle
import paddle.nn as nn
import paddle.nn.functional as F
from paddle import ParamAttr
from paddle.nn import AdaptiveAvgPool2D
from paddle.nn import AvgPool2D
from paddle.nn import BatchNorm
from paddle.nn import Conv2D
from paddle.nn import Dropout
from paddle.nn import Linear
from paddle.nn import MaxPool2D
MODEL_URLS = {
"EfficientNetB0_small":
"https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/EfficientNetB0_small_pretrained.pdparams",
"EfficientNetB0": "https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/EfficientNetB0_pretrained.pdparams",
"EfficientNetB1": "https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/EfficientNetB1_pretrained.pdparams",
"EfficientNetB2": "https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/EfficientNetB2_pretrained.pdparams",
"EfficientNetB3": "https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/EfficientNetB3_pretrained.pdparams",
"EfficientNetB4": "https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/EfficientNetB4_pretrained.pdparams",
"EfficientNetB5": "https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/EfficientNetB5_pretrained.pdparams",
"EfficientNetB6": "https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/EfficientNetB6_pretrained.pdparams",
"EfficientNetB7": "https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/EfficientNetB7_pretrained.pdparams",
}
__all__ = list(MODEL_URLS.keys())
GlobalParams = collections.namedtuple('GlobalParams', [
'batch_norm_momentum',
'batch_norm_epsilon',
'dropout_rate',
'num_classes',
'width_coefficient',
'depth_coefficient',
'depth_divisor',
'min_depth',
'drop_connect_rate',
])
BlockArgs = collections.namedtuple(
'BlockArgs',
['kernel_size', 'num_repeat', 'input_filters', 'output_filters', 'expand_ratio', 'id_skip', 'stride', 'se_ratio'])
GlobalParams.__new__.__defaults__ = (None, ) * len(GlobalParams._fields)
BlockArgs.__new__.__defaults__ = (None, ) * len(BlockArgs._fields)
def load_dygraph_pretrain(model, path=None):
if not (os.path.isdir(path) or os.path.exists(path + '.pdparams')):
raise ValueError("Model pretrain path {} does not "
"exists.".format(path))
param_state_dict = paddle.load(path + ".pdparams")
model.set_dict(param_state_dict)
return
def efficientnet_params(model_name):
""" Map EfficientNet model name to parameter coefficients. """
params_dict = {
# Coefficients: width,depth,resolution,dropout
'efficientnet-b0': (1.0, 1.0, 224, 0.2),
'efficientnet-b2': (1.1, 1.2, 260, 0.3),
'efficientnet-b3': (1.2, 1.4, 300, 0.3),
'efficientnet-b4': (1.4, 1.8, 380, 0.4),
'efficientnet-b5': (1.6, 2.2, 456, 0.4),
'efficientnet-b6': (1.8, 2.6, 528, 0.5),
'efficientnet-b7': (2.0, 3.1, 600, 0.5),
}
return params_dict[model_name]
def efficientnet(width_coefficient=None, depth_coefficient=None, dropout_rate=0.2, drop_connect_rate=0.2):
""" Get block arguments according to parameter and coefficients. """
blocks_args = [
'r1_k3_s11_e1_i32_o16_se0.25',
'r2_k3_s22_e6_i16_o24_se0.25',
'r2_k5_s22_e6_i24_o40_se0.25',
'r3_k3_s22_e6_i40_o80_se0.25',
'r3_k5_s11_e6_i80_o112_se0.25',
'r4_k5_s22_e6_i112_o192_se0.25',
'r1_k3_s11_e6_i192_o320_se0.25',
]
blocks_args = BlockDecoder.decode(blocks_args)
global_params = GlobalParams(batch_norm_momentum=0.99,
batch_norm_epsilon=1e-3,
dropout_rate=dropout_rate,
drop_connect_rate=drop_connect_rate,
num_classes=1000,
width_coefficient=width_coefficient,
depth_coefficient=depth_coefficient,
depth_divisor=8,
min_depth=None)
return blocks_args, global_params
def get_model_params(model_name, override_params):
""" Get the block args and global params for a given model """
if model_name.startswith('efficientnet'):
w, d, _, p = efficientnet_params(model_name)
blocks_args, global_params = efficientnet(width_coefficient=w, depth_coefficient=d, dropout_rate=p)
else:
raise NotImplementedError('model name is not pre-defined: %s' % model_name)
if override_params:
global_params = global_params._replace(**override_params)
return blocks_args, global_params
def round_filters(filters, global_params):
""" Calculate and round number of filters based on depth multiplier. """
multiplier = global_params.width_coefficient
if not multiplier:
return filters
divisor = global_params.depth_divisor
min_depth = global_params.min_depth
filters *= multiplier
min_depth = min_depth or divisor
new_filters = max(min_depth, int(filters + divisor / 2) // divisor * divisor)
if new_filters < 0.9 * filters: # prevent rounding by more than 10%
new_filters += divisor
return int(new_filters)
def round_repeats(repeats, global_params):
""" Round number of filters based on depth multiplier. """
multiplier = global_params.depth_coefficient
if not multiplier:
return repeats
return int(math.ceil(multiplier * repeats))
class BlockDecoder(object):
"""
Block Decoder, straight from the official TensorFlow repository.
"""
@staticmethod
def _decode_block_string(block_string):
""" Gets a block through a string notation of arguments. """
assert isinstance(block_string, str)
ops = block_string.split('_')
options = {}
for op in ops:
splits = re.split(r'(\d.*)', op)
if len(splits) >= 2:
key, value = splits[:2]
options[key] = value
# Check stride
cond_1 = ('s' in options and len(options['s']) == 1)
cond_2 = ((len(options['s']) == 2) and (options['s'][0] == options['s'][1]))
assert (cond_1 or cond_2)
return BlockArgs(kernel_size=int(options['k']),
num_repeat=int(options['r']),
input_filters=int(options['i']),
output_filters=int(options['o']),
expand_ratio=int(options['e']),
id_skip=('noskip' not in block_string),
se_ratio=float(options['se']) if 'se' in options else None,
stride=[int(options['s'][0])])
@staticmethod
def _encode_block_string(block):
"""Encodes a block to a string."""
args = [
'r%d' % block.num_repeat,
'k%d' % block.kernel_size,
's%d%d' % (block.strides[0], block.strides[1]),
'e%s' % block.expand_ratio,
'i%d' % block.input_filters,
'o%d' % block.output_filters
]
if 0 < block.se_ratio <= 1:
args.append('se%s' % block.se_ratio)
if block.id_skip is False:
args.append('noskip')
return '_'.join(args)
@staticmethod
def decode(string_list):
"""
Decode a list of string notations to specify blocks in the network.
string_list: list of strings, each string is a notation of block
return
list of BlockArgs namedtuples of block args
"""
assert isinstance(string_list, list)
blocks_args = []
for block_string in string_list:
blocks_args.append(BlockDecoder._decode_block_string(block_string))
return blocks_args
@staticmethod
def encode(blocks_args):
"""
Encodes a list of BlockArgs to a list of strings.
:param blocks_args: a list of BlockArgs namedtuples of block args
:return: a list of strings, each string is a notation of block
"""
block_strings = []
for block in blocks_args:
block_strings.append(BlockDecoder._encode_block_string(block))
return block_strings
def initial_type(name, use_bias=False):
param_attr = ParamAttr(name=name + "_weights")
if use_bias:
bias_attr = ParamAttr(name=name + "_offset")
else:
bias_attr = False
return param_attr, bias_attr
def init_batch_norm_layer(name="batch_norm"):
param_attr = ParamAttr(name=name + "_scale")
bias_attr = ParamAttr(name=name + "_offset")
return param_attr, bias_attr
def init_fc_layer(name="fc"):
param_attr = ParamAttr(name=name + "_weights")
bias_attr = ParamAttr(name=name + "_offset")
return param_attr, bias_attr
def cal_padding(img_size, stride, filter_size, dilation=1):
"""Calculate padding size."""
# out_size = max(filter_size - stride, 0)
if img_size % stride == 0:
out_size = max(filter_size - stride, 0)
else:
out_size = max(filter_size - (img_size % stride), 0)
return out_size // 2, out_size - out_size // 2
# inp_shape = {
# "b0_small": [224, 112, 112, 56, 28, 14, 14, 7],
# "b0": [224, 112, 112, 56, 28, 14, 14, 7],
# "b1": [240, 120, 120, 60, 30, 15, 15, 8],
# "b2": [260, 130, 130, 65, 33, 17, 17, 9],
# "b3": [300, 150, 150, 75, 38, 19, 19, 10],
# "b4": [380, 190, 190, 95, 48, 24, 24, 12],
# "b5": [456, 228, 228, 114, 57, 29, 29, 15],
# "b6": [528, 264, 264, 132, 66, 33, 33, 17],
# "b7": [600, 300, 300, 150, 75, 38, 38, 19]
# }
inp_shape = {
"b0_small": [224, 112, 112, 56, 28, 14, 14, 7],
"b0": [224, 112, 112, 56, 28, 14, 14, 7],
"b1": [240, 120, 120, 60, 30, 15, 15, 8],
"b2": [260, 130, 130, 65, 33, 17, 17, 9],
"b3": [300, 150, 150, 75, 38, 19, 19, 10],
"b4": [380, 190, 190, 95, 48, 24, 24, 12],
"b5": [256, 128, 128, 64, 32, 16, 16, 8],
"b6": [528, 264, 264, 132, 66, 33, 33, 17],
"b7": [600, 300, 300, 150, 75, 38, 38, 19]
}
def _drop_connect(inputs, prob, is_test):
if is_test:
output = inputs
else:
keep_prob = 1.0 - prob
inputs_shape = paddle.shape(inputs)
random_tensor = keep_prob + paddle.rand(shape=[inputs_shape[0], 1, 1, 1])
binary_tensor = paddle.floor(random_tensor)
output = paddle.multiply(inputs, binary_tensor) / keep_prob
return output
class Conv2ds(nn.Layer):
def __init__(self,
input_channels,
output_channels,
filter_size,
stride=1,
padding=0,
groups=None,
name="conv2d",
act=None,
use_bias=False,
padding_type=None,
model_name=None,
cur_stage=None):
super(Conv2ds, self).__init__()
assert act in [None, "swish", "sigmoid"]
self.act = act
param_attr, bias_attr = initial_type(name=name, use_bias=use_bias)
self.padding_type = padding_type
self.stride = stride
self.filter_size = filter_size
def get_padding(filter_size, stride=1, dilation=1):
padding = ((stride - 1) + dilation * (filter_size - 1)) // 2
return padding
inps = 1 if model_name == None and cur_stage == None else inp_shape[model_name][cur_stage]
self.need_crop = False
if padding_type == "SAME":
top_padding, bottom_padding = cal_padding(inps, stride, filter_size)
left_padding, right_padding = cal_padding(inps, stride, filter_size)
height_padding = bottom_padding
width_padding = right_padding
if top_padding != bottom_padding or left_padding != right_padding:
height_padding = top_padding + stride
width_padding = left_padding + stride
self.need_crop = True
padding = [height_padding, width_padding]
elif padding_type == "VALID":
height_padding = 0
width_padding = 0
padding = [height_padding, width_padding]
elif padding_type == "DYNAMIC":
padding = get_padding(filter_size, stride)
else:
padding = padding_type
groups = 1 if groups is None else groups
print('------')
print(input_channels)
print(output_channels)
print(filter_size)
print('------')
self._conv = Conv2D(
input_channels,
output_channels,
filter_size,
groups=groups,
stride=stride,
# act=act,
padding=padding,
weight_attr=param_attr,
bias_attr=bias_attr)
def forward(self, inputs):
x = self._conv(inputs)
if self.act == "swish":
x = F.swish(x)
elif self.act == "sigmoid":
x = F.sigmoid(x)
if self.need_crop:
x = x[:, :, 1:, 1:]
return x
class ConvBNLayer(nn.Layer):
def __init__(self,
input_channels,
filter_size,
output_channels,
stride=1,
num_groups=1,
padding_type="SAME",
conv_act=None,
bn_act="swish",
use_bn=True,
use_bias=False,
name=None,
conv_name=None,
bn_name=None,
model_name=None,
cur_stage=None):
super(ConvBNLayer, self).__init__()
self._conv = Conv2ds(input_channels=input_channels,
output_channels=output_channels,
filter_size=filter_size,
stride=stride,
groups=num_groups,
act=conv_act,
padding_type=padding_type,
name=conv_name,
use_bias=use_bias,
model_name=model_name,
cur_stage=cur_stage)
self.use_bn = use_bn
if use_bn is True:
bn_name = name + bn_name
param_attr, bias_attr = init_batch_norm_layer(bn_name)
self._bn = BatchNorm(num_channels=output_channels,
act=bn_act,
momentum=0.99,
epsilon=0.001,
moving_mean_name=bn_name + "_mean",
moving_variance_name=bn_name + "_variance",
param_attr=param_attr,
bias_attr=bias_attr)
def forward(self, inputs):
print('ConvBNLayer:')
if self.use_bn:
x = self._conv(inputs)
x = self._bn(x)
print(x.shape)
print('-------')
return x
else:
return self._conv(inputs)
class ExpandConvNorm(nn.Layer):
def __init__(self, input_channels, block_args, padding_type, name=None, model_name=None, cur_stage=None):
super(ExpandConvNorm, self).__init__()
self.oup = block_args.input_filters * block_args.expand_ratio
self.expand_ratio = block_args.expand_ratio
if self.expand_ratio != 1:
self._conv = ConvBNLayer(input_channels,
1,
self.oup,
bn_act=None,
padding_type=padding_type,
name=name,
conv_name=name + "_expand_conv",
bn_name="_bn0",
model_name=model_name,
cur_stage=cur_stage)
def forward(self, inputs):
if self.expand_ratio != 1:
return self._conv(inputs)
else:
return inputs
class DepthwiseConvNorm(nn.Layer):
def __init__(self, input_channels, block_args, padding_type, name=None, model_name=None, cur_stage=None):
super(DepthwiseConvNorm, self).__init__()
self.k = block_args.kernel_size
self.s = block_args.stride
if isinstance(self.s, list) or isinstance(self.s, tuple):
self.s = self.s[0]
oup = block_args.input_filters * block_args.expand_ratio
self._conv = ConvBNLayer(input_channels,
self.k,
oup,
self.s,
num_groups=input_channels,
bn_act=None,
padding_type=padding_type,
name=name,
conv_name=name + "_depthwise_conv",
bn_name="_bn1",
model_name=model_name,
cur_stage=cur_stage)
def forward(self, inputs):
return self._conv(inputs)
class ProjectConvNorm(nn.Layer):
def __init__(self, input_channels, block_args, padding_type, name=None, model_name=None, cur_stage=None):
super(ProjectConvNorm, self).__init__()
final_oup = block_args.output_filters
self._conv = ConvBNLayer(input_channels,
1,
final_oup,
bn_act=None,
padding_type=padding_type,
name=name,
conv_name=name + "_project_conv",
bn_name="_bn2",
model_name=model_name,
cur_stage=cur_stage)
def forward(self, inputs):
return self._conv(inputs)
class SEBlock(nn.Layer):
def __init__(self,
input_channels,
num_squeezed_channels,
oup,
padding_type,
name=None,
model_name=None,
cur_stage=None):
super(SEBlock, self).__init__()
self._pool = AdaptiveAvgPool2D(1)
self._conv1 = Conv2ds(input_channels,
num_squeezed_channels,
1,
use_bias=True,
padding_type=padding_type,
act="swish",
name=name + "_se_reduce")
self._conv2 = Conv2ds(
num_squeezed_channels,
oup,
1,
# act="sigmoid",
act=None,
use_bias=True,
padding_type=padding_type,
name=name + "_se_expand")
def forward(self, inputs):
x = self._pool(inputs)
x = self._conv1(x)
x = self._conv2(x)
out = paddle.multiply(inputs, F.sigmoid(x))
return out
class MbConvBlock(nn.Layer):
def __init__(self,
input_channels,
block_args,
padding_type,
use_se,
name=None,
drop_connect_rate=None,
model_name=None,
cur_stage=None):
super(MbConvBlock, self).__init__()
oup = block_args.input_filters * block_args.expand_ratio
self.block_args = block_args
self.has_se = use_se and (block_args.se_ratio is not None) and (0 < block_args.se_ratio <= 1)
self.id_skip = block_args.id_skip
self.expand_ratio = block_args.expand_ratio
self.drop_connect_rate = drop_connect_rate
if self.expand_ratio != 1:
self._ecn = ExpandConvNorm(input_channels,
block_args,
padding_type=padding_type,
name=name,
model_name=model_name,
cur_stage=cur_stage)
self._dcn = DepthwiseConvNorm(input_channels * block_args.expand_ratio,
block_args,
padding_type=padding_type,
name=name,
model_name=model_name,
cur_stage=cur_stage)
if self.has_se:
num_squeezed_channels = max(1, int(block_args.input_filters * block_args.se_ratio))
self._se = SEBlock(input_channels * block_args.expand_ratio,
num_squeezed_channels,
oup,
padding_type=padding_type,
name=name,
model_name=model_name,
cur_stage=cur_stage)
self._pcn = ProjectConvNorm(input_channels * block_args.expand_ratio,
block_args,
padding_type=padding_type,
name=name,
model_name=model_name,
cur_stage=cur_stage)
def forward(self, inputs):
x = inputs
if self.expand_ratio != 1:
x = self._ecn(x)
x = F.swish(x)
x = self._dcn(x)
x = F.swish(x)
if self.has_se:
x = self._se(x)
x = self._pcn(x)
if self.id_skip and \
self.block_args.stride == 1 and \
self.block_args.input_filters == self.block_args.output_filters:
if self.drop_connect_rate:
x = _drop_connect(x, self.drop_connect_rate, not self.training)
x = paddle.add(x, inputs)
return x
class ConvStemNorm(nn.Layer):
def __init__(self, input_channels, padding_type, _global_params, name=None, model_name=None, cur_stage=None):
super(ConvStemNorm, self).__init__()
output_channels = round_filters(32, _global_params)
self._conv = ConvBNLayer(input_channels,
filter_size=3,
output_channels=output_channels,
stride=2,
bn_act=None,
padding_type=padding_type,
name="",
conv_name="_conv_stem",
bn_name="_bn0",
model_name=model_name,
cur_stage=cur_stage)
def forward(self, inputs):
return self._conv(inputs)
class ExtractFeatures(nn.Layer):
def __init__(self, input_channels, _block_args, _global_params, padding_type, use_se, model_name=None):
super(ExtractFeatures, self).__init__()
self._global_params = _global_params
self._conv_stem = ConvStemNorm(input_channels,
padding_type=padding_type,
_global_params=_global_params,
model_name=model_name,
cur_stage=0)
self.block_args_copy = copy.deepcopy(_block_args)
idx = 0
block_size = 0
for block_arg in self.block_args_copy:
block_arg = block_arg._replace(input_filters=round_filters(block_arg.input_filters, _global_params),
output_filters=round_filters(block_arg.output_filters, _global_params),
num_repeat=round_repeats(block_arg.num_repeat, _global_params))
block_size += 1
for _ in range(block_arg.num_repeat - 1):
block_size += 1
self.conv_seq = []
cur_stage = 1
for block_args in _block_args:
block_args = block_args._replace(input_filters=round_filters(block_args.input_filters, _global_params),
output_filters=round_filters(block_args.output_filters, _global_params),
num_repeat=round_repeats(block_args.num_repeat, _global_params))
drop_connect_rate = self._global_params.drop_connect_rate
if drop_connect_rate:
drop_connect_rate *= float(idx) / block_size
_mc_block = self.add_sublayer(
"_blocks." + str(idx) + ".",
MbConvBlock(block_args.input_filters,
block_args=block_args,
padding_type=padding_type,
use_se=use_se,
name="_blocks." + str(idx) + ".",
drop_connect_rate=drop_connect_rate,
model_name=model_name,
cur_stage=cur_stage))
self.conv_seq.append(_mc_block)
idx += 1
if block_args.num_repeat > 1:
block_args = block_args._replace(input_filters=block_args.output_filters, stride=1)
for _ in range(block_args.num_repeat - 1):
drop_connect_rate = self._global_params.drop_connect_rate
if drop_connect_rate:
drop_connect_rate *= float(idx) / block_size
_mc_block = self.add_sublayer(
"block." + str(idx) + ".",
MbConvBlock(block_args.input_filters,
block_args,
padding_type=padding_type,
use_se=use_se,
name="_blocks." + str(idx) + ".",
drop_connect_rate=drop_connect_rate,
model_name=model_name,
cur_stage=cur_stage))
self.conv_seq.append(_mc_block)
idx += 1
cur_stage += 1
def forward(self, inputs):
print('ExtractFeatures:')
print(inputs.shape)
x = self._conv_stem(inputs)
print(x.shape)
print('******')
x = F.swish(x)
for _mc_block in self.conv_seq:
x = _mc_block(x)
return x
class EfficientNet(nn.Layer):
def __init__(self, name="b0", padding_type="SAME", override_params=None, use_se=True, class_num=768):
super(EfficientNet, self).__init__()
model_name = 'efficientnet-' + name
self.name = name
self._block_args, self._global_params = get_model_params(model_name, override_params)
self.padding_type = padding_type
self.use_se = use_se
self._ef = ExtractFeatures(3,
self._block_args,
self._global_params,
self.padding_type,
self.use_se,
model_name=self.name)
output_channels = round_filters(1280, self._global_params)
if name == "b0_small" or name == "b0" or name == "b1":
oup = 320
elif name == "b2":
oup = 352
elif name == "b3":
oup = 384
elif name == "b4":
oup = 448
elif name == "b5":
oup = 512
elif name == "b6":
oup = 576
elif name == "b7":
oup = 640
self._conv = ConvBNLayer(oup,
1,
output_channels,
bn_act="swish",
padding_type=self.padding_type,
name="",
conv_name="_conv_head",
bn_name="_bn1",
model_name=self.name,
cur_stage=7)
self._pool = AdaptiveAvgPool2D(1)
if self._global_params.dropout_rate:
self._drop = Dropout(p=self._global_params.dropout_rate, mode="upscale_in_train")
# param_attr, bias_attr = init_fc_layer("_fc")
self._fc = Linear(output_channels,
class_num,
weight_attr=paddle.ParamAttr(name='image_trans_w'),
bias_attr=paddle.ParamAttr(name='image_trans_b'))
def forward(self, inputs):
x = self._ef(inputs)
print('EfficientNet:')
print(x.shape)
print(self._conv)
x = self._conv(x)
x = self._pool(x)
if self._global_params.dropout_rate:
x = self._drop(x)
x = paddle.squeeze(x, axis=[2, 3])
x = self._fc(x)
x = F.tanh(x)
return x
def _load_pretrained(pretrained, model, model_url, use_ssld=False):
if pretrained is False:
pass
elif isinstance(pretrained, str):
load_dygraph_pretrain(model, pretrained)
else:
raise RuntimeError("pretrained type is not available. Please use `string` type.")
def EfficientNetB0_small(padding_type='DYNAMIC',
override_params=None,
use_se=False,
pretrained=False,
use_ssld=False,
**kwargs):
model = EfficientNet(name='b0', padding_type=padding_type, override_params=override_params, use_se=use_se, **kwargs)
_load_pretrained(pretrained, model, MODEL_URLS["EfficientNetB0_small"])
return model
def EfficientNetB0(padding_type='SAME', override_params=None, use_se=True, pretrained=False, use_ssld=False, **kwargs):
model = EfficientNet(name='b0', padding_type=padding_type, override_params=override_params, use_se=use_se, **kwargs)
_load_pretrained(pretrained, model, MODEL_URLS["EfficientNetB0"])
return model
def EfficientNetB1(padding_type='SAME', override_params=None, use_se=True, pretrained=False, use_ssld=False, **kwargs):
model = EfficientNet(name='b1', padding_type=padding_type, override_params=override_params, use_se=use_se, **kwargs)
_load_pretrained(pretrained, model, MODEL_URLS["EfficientNetB1"])
return model
def EfficientNetB2(padding_type='SAME', override_params=None, use_se=True, pretrained=False, use_ssld=False, **kwargs):
model = EfficientNet(name='b2', padding_type=padding_type, override_params=override_params, use_se=use_se, **kwargs)
_load_pretrained(pretrained, model, MODEL_URLS["EfficientNetB2"])
return model
def EfficientNetB3(padding_type='SAME', override_params=None, use_se=True, pretrained=False, use_ssld=False, **kwargs):
model = EfficientNet(name='b3', padding_type=padding_type, override_params=override_params, use_se=use_se, **kwargs)
_load_pretrained(pretrained, model, MODEL_URLS["EfficientNetB3"])
return model
def EfficientNetB4(padding_type='SAME', override_params=None, use_se=True, pretrained=False, use_ssld=False, **kwargs):
model = EfficientNet(name='b4', padding_type=padding_type, override_params=override_params, use_se=use_se, **kwargs)
_load_pretrained(pretrained, model, MODEL_URLS["EfficientNetB4"])
return model
def EfficientNetB5(padding_type='SAME', override_params=None, use_se=True, pretrained=False, use_ssld=False, **kwargs):
model = EfficientNet(name='b5', padding_type=padding_type, override_params=override_params, use_se=use_se, **kwargs)
_load_pretrained(pretrained, model, MODEL_URLS["EfficientNetB5"])
return model
def EfficientNetB6(padding_type='SAME', override_params=None, use_se=True, pretrained=False, use_ssld=False, **kwargs):
model = EfficientNet(name='b6', padding_type=padding_type, override_params=override_params, use_se=use_se, **kwargs)
_load_pretrained(pretrained, model, MODEL_URLS["EfficientNetB6"])
return model
def EfficientNetB7(padding_type='SAME', override_params=None, use_se=True, pretrained=False, use_ssld=False, **kwargs):
model = EfficientNet(name='b7', padding_type=padding_type, override_params=override_params, use_se=use_se, **kwargs)
_load_pretrained(pretrained, model, MODEL_URLS["EfficientNetB7"])
return model
# -*- coding: utf-8 -*
"""
ERNIE 网络结构
"""
import logging
import re
import time
import paddle
from paddle import nn
from paddle.nn import functional as F
ACT_DICT = {
'relu': nn.ReLU,
'gelu': nn.GELU,
}
class ErnieModel(nn.Layer):
""" ernie model """
def __init__(self, cfg, name=''):
"""
Fundamental pretrained Ernie model
"""
nn.Layer.__init__(self)
self.cfg = cfg
d_model = cfg['hidden_size']
d_emb = cfg.get('emb_size', cfg['hidden_size'])
d_vocab = cfg['vocab_size']
d_pos = cfg['max_position_embeddings']
# d_sent = cfg.get("sent_type_vocab_size", 4) or cfg.get('type_vocab_size', 4)
if cfg.get('sent_type_vocab_size'):
d_sent = cfg['sent_type_vocab_size']
else:
d_sent = cfg.get('type_vocab_size', 2)
self.n_head = cfg['num_attention_heads']
self.return_additional_info = cfg.get('return_additional_info', False)
self.initializer = nn.initializer.TruncatedNormal(std=cfg['initializer_range'])
self.ln = _build_ln(d_model, name=append_name(name, 'pre_encoder'))
self.word_emb = nn.Embedding(d_vocab,
d_emb,
weight_attr=paddle.ParamAttr(name=append_name(name, 'word_embedding'),
initializer=self.initializer))
self.pos_emb = nn.Embedding(d_pos,
d_emb,
weight_attr=paddle.ParamAttr(name=append_name(name, 'pos_embedding'),
initializer=self.initializer))
# self.sent_emb = nn.Embedding(
# d_sent,
# d_emb,
# weight_attr=paddle.ParamAttr(name=append_name(name, 'sent_embedding'), initializer=self.initializer))
self._use_sent_id = cfg.get('use_sent_id', True)
self._use_sent_id = False
if self._use_sent_id:
self.sent_emb = nn.Embedding(d_sent,
d_emb,
weight_attr=paddle.ParamAttr(name=append_name(name, 'sent_embedding'),
initializer=self.initializer))
self._use_task_id = cfg.get('use_task_id', False)
self._use_task_id = False
if self._use_task_id:
self._task_types = cfg.get('task_type_vocab_size', 3)
logging.info('using task_id, #task_types:{}'.format(self._task_types))
self.task_emb = nn.Embedding(self._task_types,
d_emb,
weight_attr=paddle.ParamAttr(name=append_name(name, 'task_embedding'),
initializer=self.initializer))
prob = cfg['hidden_dropout_prob']
self.dropout = nn.Dropout(p=prob)
self.encoder_stack = ErnieEncoderStack(cfg, append_name(name, 'encoder'))
if cfg.get('has_pooler', True):
self.pooler = _build_linear(cfg['hidden_size'], cfg['hidden_size'], append_name(name, 'pooled_fc'),
self.initializer)
else:
self.pooler = None
self.key_tag = None
self._checkpoints = []
self.train()
def get_checkpoints(self):
"""return checkpoints for recomputing"""
# recompute checkpoints
return self._checkpoints
# FIXME:remove this
def eval(self):
""" eval """
if paddle.in_dynamic_mode():
super(ErnieModel, self).eval()
self.training = False
for l in self.sublayers():
l.training = False
return self
def train(self):
""" train """
if paddle.in_dynamic_mode():
super(ErnieModel, self).train()
self.training = True
for l in self.sublayers():
l.training = True
return self
def forward(self,
src_ids,
sent_ids=None,
pos_ids=None,
input_mask=None,
task_ids=None,
attn_bias=None,
past_cache=None,
use_causal_mask=False):
"""
Args:
src_ids (`Variable` of shape `[batch_size, seq_len]`):
Indices of input sequence tokens in the vocabulary.
sent_ids (optional, `Variable` of shape `[batch_size, seq_len]`):
aka token_type_ids, Segment token indices to indicate first and second portions of the inputs.
if None, assume all tokens come from `segment_a`
pos_ids(optional, `Variable` of shape `[batch_size, seq_len]`):
Indices of positions of each input sequence tokens in the position embeddings.
input_mask(optional `Variable` of shape `[batch_size, seq_len]`):
Mask to avoid performing attention on the padding token indices of the encoder input.
task_ids(optional `Variable` of shape `[batch_size, seq_len]`):
task type for pre_train task type
attn_bias(optional, `Variable` of shape `[batch_size, seq_len, seq_len] or False`):
3D version of `input_mask`, if set, overrides `input_mask`; if set not False, will not apply attention mask
past_cache(optional, tuple of two lists: cached key and cached value,
each is a list of `Variable`s of shape `[batch_size, seq_len, hidden_size]`):
cached key/value tensor that will be concated to generated key/value when performing self attention.
if set, `attn_bias` should not be None.
Returns:
pooled (`Variable` of shape `[batch_size, hidden_size]`):
output logits of pooler classifier
encoded(`Variable` of shape `[batch_size, seq_len, hidden_size]`):
output logits of transformer stack
info (Dictionary):
addtional middle level info, inclues: all hidden stats, k/v caches.
"""
assert len(src_ids.shape) == 2, 'expect src_ids.shape = [batch, sequence], got %s' % (repr(src_ids.shape))
assert attn_bias is not None if past_cache else True, 'if `past_cache` specified; attn_bias must not be None'
d_seqlen = paddle.shape(src_ids)[1]
if pos_ids is None:
pos_ids = paddle.arange(0, d_seqlen, 1, dtype='int32').reshape([1, -1]).cast('int64')
if attn_bias is None:
if input_mask is None:
input_mask = paddle.cast(src_ids != 0, 'float32')
assert len(input_mask.shape) == 2
input_mask = input_mask.unsqueeze(-1)
attn_bias = input_mask.matmul(input_mask, transpose_y=True)
if use_causal_mask:
sequence = paddle.reshape(paddle.arange(0, d_seqlen, 1, dtype='float32') + 1., [1, 1, -1, 1])
causal_mask = (sequence.matmul(1. / sequence, transpose_y=True) >= 1.).cast('float32')
attn_bias *= causal_mask
else:
assert len(attn_bias.shape) == 3, 'expect attn_bias tobe rank 3, got %r' % attn_bias.shape
attn_bias = (1. - attn_bias) * -10000.0
attn_bias = attn_bias.unsqueeze(1).tile([1, self.n_head, 1, 1]) # avoid broadcast =_=
if sent_ids is None:
sent_ids = paddle.zeros_like(src_ids)
src_embedded = self.word_emb(src_ids)
pos_embedded = self.pos_emb(pos_ids)
# sent_embedded = self.sent_emb(sent_ids)
# embedded = src_embedded + pos_embedded + sent_embedded
embedded = src_embedded + pos_embedded
if self._use_sent_id:
sent_embedded = self.sent_emb(sent_ids)
embedded = embedded + sent_embedded
if self._use_task_id:
task_embeded = self.task_emb(task_ids)
embedded = embedded + task_embeded
self._checkpoints.append(embedded.name)
embedded = self.dropout(self.ln(embedded))
(encoded, hidden_list, cache_list, checkpoint_name) = self.encoder_stack(embedded, attn_bias,
past_cache=past_cache, \
key_tag=self.key_tag)
self._checkpoints.extend(checkpoint_name)
if self.pooler is not None:
pooled = F.tanh(self.pooler(encoded[:, 0, :]))
else:
pooled = None
additional_info = {
'hiddens': hidden_list,
'caches': cache_list,
}
if self.return_additional_info:
return pooled, encoded, additional_info
return pooled, encoded
class ErnieEncoderStack(nn.Layer):
""" ernie encoder stack """
def __init__(self, cfg, name=None):
super(ErnieEncoderStack, self).__init__()
n_layers = cfg['num_hidden_layers']
self.block = nn.LayerList([ErnieBlock(cfg, append_name(name, 'layer_%d' % i)) for i in range(n_layers)])
def forward(self, inputs, attn_bias=None, past_cache=None, key_tag=None):
""" forward function """
if past_cache is not None:
assert isinstance(
past_cache,
tuple), 'unknown type of `past_cache`, expect tuple or list. got %s' % repr(type(past_cache))
past_cache = list(zip(*past_cache))
else:
past_cache = [None] * len(self.block)
cache_list_k, cache_list_v, hidden_list = [], [], [inputs]
checkpoint_name = []
for b, p in zip(self.block, past_cache):
inputs, cache = b(inputs, attn_bias=attn_bias, past_cache=p, key_tag=key_tag)
cache_k, cache_v = cache
cache_list_k.append(cache_k)
cache_list_v.append(cache_v)
hidden_list.append(inputs)
checkpoint_name.append(inputs.name)
return [inputs, hidden_list, (cache_list_k, cache_list_v), checkpoint_name]
class ErnieBlock(nn.Layer):
""" ernie block class """
def __init__(self, cfg, name=None):
super(ErnieBlock, self).__init__()
d_model = cfg['hidden_size']
self.attn = AttentionLayer(cfg, name=append_name(name, 'multi_head_att'))
self.ln1 = _build_ln(d_model, name=append_name(name, 'post_att'))
self.ffn = PositionWiseFeedForwardLayer(cfg, name=append_name(name, 'ffn'))
self.ln2 = _build_ln(d_model, name=append_name(name, 'post_ffn'))
prob = cfg.get('intermediate_dropout_prob', cfg['hidden_dropout_prob'])
self.dropout = nn.Dropout(p=prob)
def forward(self, inputs, attn_bias=None, past_cache=None, key_tag=None):
""" forward """
attn_out, cache = self.attn(inputs, inputs, inputs, attn_bias, past_cache=past_cache,
key_tag=key_tag) # self attention
attn_out = self.dropout(attn_out)
hidden = attn_out + inputs
hidden = self.ln1(hidden) # dropout/ add/ norm
ffn_out = self.ffn(hidden)
ffn_out = self.dropout(ffn_out)
hidden = ffn_out + hidden
hidden = self.ln2(hidden)
return hidden, cache
class AttentionLayer(nn.Layer):
""" attention layer """
def __init__(self, cfg, name=None):
super(AttentionLayer, self).__init__()
initializer = nn.initializer.TruncatedNormal(std=cfg['initializer_range'])
d_model = cfg['hidden_size']
n_head = cfg['num_attention_heads']
# assert d_model % n_head == 0
d_model_q = cfg.get('query_hidden_size_per_head', d_model // n_head) * n_head
d_model_v = cfg.get('value_hidden_size_per_head', d_model // n_head) * n_head
self.n_head = n_head
self.d_key = d_model_q // n_head
self.q = _build_linear(d_model, d_model_q, append_name(name, 'query_fc'), initializer)
self.k = _build_linear(d_model, d_model_q, append_name(name, 'key_fc'), initializer)
self.v = _build_linear(d_model, d_model_v, append_name(name, 'value_fc'), initializer)
self.o = _build_linear(d_model_v, d_model, append_name(name, 'output_fc'), initializer)
self.layer_num = int(re.findall(r"\d+", name)[0])
# self.dropout = nn.Dropout(p=cfg['attention_probs_dropout_prob'])
self.dropout_prob = cfg['attention_probs_dropout_prob']
self.dropout = nn.Dropout(p=self.dropout_prob)
def forward(self, queries, keys, values, attn_bias, past_cache, key_tag=None):
""" layer forward function """
assert len(queries.shape) == len(keys.shape) == len(values.shape) == 3
# bsz, q_len, q_dim = queries.shape
# bsz, k_len, k_dim = keys.shape
# bsz, v_len, v_dim = values.shape
# assert k_len == v_len
q = self.q(queries)
k = self.k(keys)
v = self.v(values)
cache = (k, v)
if past_cache is not None:
cached_k, cached_v = past_cache
k = paddle.concat([cached_k, k], 1)
v = paddle.concat([cached_v, v], 1)
# [batch, head, seq, dim]
q = q.reshape([0, 0, self.n_head, q.shape[-1] // self.n_head]).transpose([0, 2, 1, 3])
# [batch, head, seq, dim]
k = k.reshape([0, 0, self.n_head, k.shape[-1] // self.n_head]).transpose([0, 2, 1, 3])
# [batch, head, seq, dim]
v = v.reshape([0, 0, self.n_head, v.shape[-1] // self.n_head]).transpose([0, 2, 1, 3])
q = q.scale(self.d_key**-0.5)
score = q.matmul(k, transpose_y=True)
if attn_bias is not None:
score += attn_bias
score = F.softmax(score)
score = self.dropout(score)
out = score.matmul(v)
out = out.transpose([0, 2, 1, 3])
out = out.reshape([0, 0, out.shape[2] * out.shape[3]])
out = self.o(out)
return out, cache
class PositionWiseFeedForwardLayer(nn.Layer):
""" post wise feed forward layer """
def __init__(self, cfg, name=None):
super(PositionWiseFeedForwardLayer, self).__init__()
initializer = nn.initializer.TruncatedNormal(std=cfg['initializer_range'])
d_model = cfg['hidden_size']
d_ffn = cfg.get('intermediate_size', 4 * d_model)
self.act = ACT_DICT[cfg['hidden_act']]()
self.i = _build_linear(d_model, d_ffn, append_name(name, 'fc_0'), initializer)
self.o = _build_linear(d_ffn, d_model, append_name(name, 'fc_1'), initializer)
prob = cfg.get('intermediate_dropout_prob', 0.)
self.dropout = nn.Dropout(p=prob)
def forward(self, inputs):
""" forward """
hidden = self.act(self.i(inputs))
hidden = self.dropout(hidden)
out = self.o(hidden)
return out
def _build_linear(n_in, n_out, name, init):
"""
"""
return nn.Linear(n_in,
n_out,
weight_attr=paddle.ParamAttr(name='%s.w_0' % name if name is not None else None, initializer=init),
bias_attr='%s.b_0' % name if name is not None else None)
def _build_ln(n_in, name):
"""
"""
return nn.LayerNorm(normalized_shape=n_in,
weight_attr=paddle.ParamAttr(name='%s_layer_norm_scale' % name if name is not None else None,
initializer=nn.initializer.Constant(1.)),
bias_attr=paddle.ParamAttr(name='%s_layer_norm_bias' % name if name is not None else None,
initializer=nn.initializer.Constant(0.)))
def append_name(name, postfix):
""" append name with postfix """
if name is None:
ret = None
elif name == '':
ret = postfix
else:
ret = '%s_%s' % (name, postfix)
return ret
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
import json
import logging
import math
import six
if six.PY2:
from pathlib2 import Path
else:
from pathlib import Path
import numpy as np
import paddle as P
from paddle import nn
from paddle.nn import functional as F
from disco_diffusion_ernievil_base.vit_b_16x.ernievil2.transformers.file_utils import _fetch_from_remote, add_docstring
log = logging.getLogger(__name__)
ACT_DICT = {
'relu': nn.ReLU,
'gelu': nn.GELU,
}
def _get_rel_pos_bias(seq_len, max_len=128, num_buckets=32, bidirectional=True, reset=True):
#max_len = 520
pos = np.array(range(seq_len))
rel_pos = pos[:, None] - pos[None, :]
ret = 0
n = -rel_pos
if bidirectional:
num_buckets //= 2
ret += (n < 0).astype('int32') * num_buckets # mtf.to_int32(mtf.less(n, 0)) * num_buckets
n = np.abs(n)
else:
n = np.max(n, np.zeros_like(n))
# now n is in the range [0, inf)
# half of the buckets are for exact increments in positions
max_exact = num_buckets // 2
is_small = n < max_exact
# The other half of the buckets are for logarithmically bigger bins in positions up to max_distance
val_if_large = max_exact + (np.log(n.astype('float32') / max_exact) / math.log(max_len / max_exact) *
(num_buckets - max_exact)).astype('int32')
tmp = np.full_like(val_if_large, num_buckets - 1)
val_if_large = np.where(val_if_large < tmp, val_if_large, tmp)
ret += np.where(is_small, n, val_if_large)
if reset:
num_buckets *= 2
ret[:, 0] = num_buckets
ret[0, :] = num_buckets // 2
return np.array(ret).reshape([seq_len, seq_len]).astype("int64")
def _build_linear(n_in, n_out, name, init):
return nn.Linear(
n_in,
n_out,
weight_attr=P.ParamAttr(name='%s.w_0' % name if name is not None else None, initializer=init),
bias_attr='%s.b_0' % name if name is not None else None,
)
def _build_ln(n_in, name):
return nn.LayerNorm(
normalized_shape=n_in,
weight_attr=P.ParamAttr(name='%s_layer_norm_scale' % name if name is not None else None,
initializer=nn.initializer.Constant(1.)),
bias_attr=P.ParamAttr(name='%s_layer_norm_bias' % name if name is not None else None,
initializer=nn.initializer.Constant(0.)),
)
def append_name(name, postfix):
if name is None:
ret = None
elif name == '':
ret = postfix
else:
ret = '%s_%s' % (name, postfix)
return ret
class AttentionLayer(nn.Layer):
def __init__(self, cfg, name=None):
super(AttentionLayer, self).__init__()
initializer = nn.initializer.TruncatedNormal(std=cfg['initializer_range'])
d_model = cfg['hidden_size']
n_head = cfg['num_attention_heads']
assert d_model % n_head == 0
d_model_q = cfg.get('query_hidden_size_per_head', d_model // n_head) * n_head
d_model_v = cfg.get('value_hidden_size_per_head', d_model // n_head) * n_head
self.n_head = n_head
self.d_key = d_model_q // n_head
self.q = _build_linear(d_model, d_model_q, append_name(name, 'query_fc'), initializer)
self.k = _build_linear(d_model, d_model_q, append_name(name, 'key_fc'), initializer)
self.v = _build_linear(d_model, d_model_v, append_name(name, 'value_fc'), initializer)
self.o = _build_linear(d_model_v, d_model, append_name(name, 'output_fc'), initializer)
self.dropout = nn.Dropout(p=cfg['attention_probs_dropout_prob'])
def forward(self, queries, keys, values, attn_bias, past_cache):
assert len(queries.shape) == len(keys.shape) == len(values.shape) == 3
#bsz, q_len, q_dim = queries.shape
#bsz, k_len, k_dim = keys.shape
#bsz, v_len, v_dim = values.shape
#assert k_len == v_len
q = self.q(queries)
k = self.k(keys)
v = self.v(values)
cache = (k, v)
if past_cache is not None:
cached_k, cached_v = past_cache
k = P.concat([cached_k, k], 1)
v = P.concat([cached_v, v], 1)
q = q.reshape([0, 0, self.n_head, q.shape[-1] // self.n_head]).transpose([0, 2, 1, 3]) #[batch, head, seq, dim]
k = k.reshape([0, 0, self.n_head, k.shape[-1] // self.n_head]).transpose([0, 2, 1, 3]) #[batch, head, seq, dim]
v = v.reshape([0, 0, self.n_head, v.shape[-1] // self.n_head]).transpose([0, 2, 1, 3]) #[batch, head, seq, dim]
q = q.scale(self.d_key**-0.5)
score = q.matmul(k, transpose_y=True)
if attn_bias is not None:
score += attn_bias
score = F.softmax(score)
score = self.dropout(score)
out = score.matmul(v).transpose([0, 2, 1, 3])
out = out.reshape([0, 0, out.shape[2] * out.shape[3]])
out = self.o(out)
return out, cache
class PositionwiseFeedForwardLayer(nn.Layer):
def __init__(self, cfg, name=None):
super(PositionwiseFeedForwardLayer, self).__init__()
initializer = nn.initializer.TruncatedNormal(std=cfg['initializer_range'])
d_model = cfg['hidden_size']
d_ffn = cfg.get('intermediate_size', 4 * d_model)
self.act = ACT_DICT[cfg['hidden_act']]()
self.i = _build_linear(
d_model,
d_ffn,
append_name(name, 'fc_0'),
initializer,
)
self.o = _build_linear(d_ffn, d_model, append_name(name, 'fc_1'), initializer)
prob = cfg.get('intermediate_dropout_prob', 0.)
self.dropout = nn.Dropout(p=prob)
def forward(self, inputs):
hidden = self.act(self.i(inputs))
hidden = self.dropout(hidden)
out = self.o(hidden)
return out
class ErnieBlock(nn.Layer):
def __init__(self, cfg, name=None):
super(ErnieBlock, self).__init__()
d_model = cfg['hidden_size']
self.attn = AttentionLayer(cfg, name=append_name(name, 'multi_head_att'))
self.ln1 = _build_ln(d_model, name=append_name(name, 'post_att'))
self.ffn = PositionwiseFeedForwardLayer(cfg, name=append_name(name, 'ffn'))
self.ln2 = _build_ln(d_model, name=append_name(name, 'post_ffn'))
prob = cfg.get('intermediate_dropout_prob', cfg['hidden_dropout_prob'])
self.dropout = nn.Dropout(p=prob)
def forward(self, inputs, attn_bias=None, past_cache=None):
attn_out, cache = self.attn(inputs, inputs, inputs, attn_bias, past_cache=past_cache) #self attn
attn_out = self.dropout(attn_out)
hidden = attn_out + inputs
hidden = self.ln1(hidden) # dropout/ add/ norm
ffn_out = self.ffn(hidden)
ffn_out = self.dropout(ffn_out)
hidden = ffn_out + hidden
hidden = self.ln2(hidden)
return hidden, cache
class ErnieEncoderStack(nn.Layer):
def __init__(self, cfg, name=None):
super(ErnieEncoderStack, self).__init__()
n_layers = cfg['num_hidden_layers']
self.block = nn.LayerList([ErnieBlock(cfg, append_name(name, 'layer_%d' % i)) for i in range(n_layers)])
def forward(self, inputs, attn_bias=None, past_cache=None):
if past_cache is not None:
assert isinstance(
past_cache,
tuple), 'unknown type of `past_cache`, expect tuple or list. got %s' % repr(type(past_cache))
past_cache = list(zip(*past_cache))
else:
past_cache = [None] * len(self.block)
cache_list_k, cache_list_v, hidden_list = [], [], [inputs]
for b, p in zip(self.block, past_cache):
inputs, cache = b(inputs, attn_bias=attn_bias, past_cache=p)
cache_k, cache_v = cache
cache_list_k.append(cache_k)
cache_list_v.append(cache_v)
hidden_list.append(inputs)
return inputs, hidden_list, (cache_list_k, cache_list_v)
class PretrainedModel(object):
bce = 'https://ernie-github.cdn.bcebos.com/'
resource_map = {
'ernie-1.0': bce + 'model-ernie1.0.1.tar.gz',
'ernie-2.0-en': bce + 'model-ernie2.0-en.1.tar.gz',
'ernie-2.0-large-en': bce + 'model-ernie2.0-large-en.1.tar.gz',
'ernie-tiny': bce + 'model-ernie_tiny.1.tar.gz',
'ernie-gram-zh': bce + 'model-ernie-gram-zh.1.tar.gz',
'ernie-gram-en': bce + 'model-ernie-gram-en.1.tar.gz',
}
@classmethod
def from_pretrained(cls, pretrain_dir_or_url, force_download=False, **kwargs):
if not Path(pretrain_dir_or_url).exists() and str(pretrain_dir_or_url) in cls.resource_map:
url = cls.resource_map[str(pretrain_dir_or_url)]
log.info('get pretrain dir from %s' % url)
pretrain_dir = _fetch_from_remote(url, force_download)
else:
log.info('pretrain dir %s not in %s, read from local' % (pretrain_dir_or_url, repr(cls.resource_map)))
pretrain_dir = Path(pretrain_dir_or_url)
if not pretrain_dir.exists():
raise ValueError('pretrain dir not found: %s, optional: %s' % (pretrain_dir, cls.resource_map.keys()))
state_dict_path = pretrain_dir / 'saved_weights.pdparams'
config_path = pretrain_dir / 'ernie_config.json'
if not config_path.exists():
raise ValueError('config path not found: %s' % config_path)
name_prefix = kwargs.pop('name', None)
cfg_dict = dict(json.loads(config_path.open().read()), **kwargs)
model = cls(cfg_dict, name=name_prefix)
log.info('loading pretrained model from %s' % pretrain_dir)
#param_path = pretrain_dir / 'params'
#if os.path.exists(param_path):
# raise NotImplementedError()
# log.debug('load pretrained weight from program state')
# F.io.load_program_state(param_path) #buggy in dygraph.gurad, push paddle to fix
if state_dict_path.exists():
m = P.load(str(state_dict_path))
for k, v in model.state_dict().items():
if k not in m:
log.warn('param:%s not set in pretrained model, skip' % k)
m[k] = v # FIXME: no need to do this in the future
model.set_state_dict(m)
else:
raise ValueError('weight file not found in pretrain dir: %s' % pretrain_dir)
return model
class ErnieModel(nn.Layer, PretrainedModel):
def __init__(self, cfg, name=None):
"""
Fundamental pretrained Ernie model
"""
log.debug('init ErnieModel with config: %s' % repr(cfg))
nn.Layer.__init__(self)
d_model = cfg['hidden_size']
d_emb = cfg.get('emb_size', cfg['hidden_size'])
d_vocab = cfg['vocab_size']
d_pos = cfg['max_position_embeddings']
d_sent = cfg.get("sent_type_vocab_size") or cfg['type_vocab_size']
self.d_rel_pos = cfg.get('rel_pos_size', None)
max_seq_len = cfg.get("max_seq_len", 512)
self.n_head = cfg['num_attention_heads']
self.return_additional_info = cfg.get('return_additional_info', False)
initializer = nn.initializer.TruncatedNormal(std=cfg['initializer_range'])
if self.d_rel_pos:
self.rel_pos_bias = _get_rel_pos_bias(max_seq_len)
self.ln = _build_ln(d_model, name=append_name(name, 'pre_encoder'))
self.word_emb = nn.Embedding(d_vocab,
d_emb,
weight_attr=P.ParamAttr(name=append_name(name, 'word_embedding'),
initializer=initializer))
self.pos_emb = nn.Embedding(d_pos,
d_emb,
weight_attr=P.ParamAttr(name=append_name(name, 'pos_embedding'),
initializer=initializer))
self.sent_emb = nn.Embedding(d_sent,
d_emb,
weight_attr=P.ParamAttr(name=append_name(name, 'sent_embedding'),
initializer=initializer))
if self.d_rel_pos:
self.rel_pos_bias_emb = nn.Embedding(self.d_rel_pos,
self.n_head,
weight_attr=P.ParamAttr(name=append_name(name, 'rel_pos_embedding'),
initializer=initializer))
prob = cfg['hidden_dropout_prob']
self.dropout = nn.Dropout(p=prob)
self.encoder_stack = ErnieEncoderStack(cfg, append_name(name, 'encoder'))
if cfg.get('has_pooler', True):
self.pooler = _build_linear(
cfg['hidden_size'],
cfg['hidden_size'],
append_name(name, 'pooled_fc'),
initializer,
)
else:
self.pooler = None
self.train()
#FIXME:remove this
def eval(self):
if P.in_dynamic_mode():
super(ErnieModel, self).eval()
self.training = False
for l in self.sublayers():
l.training = False
return self
def train(self):
if P.in_dynamic_mode():
super(ErnieModel, self).train()
self.training = True
for l in self.sublayers():
l.training = True
return self
def forward(self,
src_ids,
sent_ids=None,
pos_ids=None,
input_mask=None,
attn_bias=None,
past_cache=None,
use_causal_mask=False):
"""
Args:
src_ids (`Variable` of shape `[batch_size, seq_len]`):
Indices of input sequence tokens in the vocabulary.
sent_ids (optional, `Variable` of shape `[batch_size, seq_len]`):
aka token_type_ids, Segment token indices to indicate first and second portions of the inputs.
if None, assume all tokens come from `segment_a`
pos_ids(optional, `Variable` of shape `[batch_size, seq_len]`):
Indices of positions of each input sequence tokens in the position embeddings.
input_mask(optional `Variable` of shape `[batch_size, seq_len]`):
Mask to avoid performing attention on the padding token indices of the encoder input.
attn_bias(optional, `Variable` of shape `[batch_size, seq_len, seq_len] or False`):
3D version of `input_mask`, if set, overrides `input_mask`; if set not False, will not apply attention mask
past_cache(optional, tuple of two lists: cached key and cached value,
each is a list of `Variable`s of shape `[batch_size, seq_len, hidden_size]`):
cached key/value tensor that will be concated to generated key/value when performing self attention.
if set, `attn_bias` should not be None.
Returns:
pooled (`Variable` of shape `[batch_size, hidden_size]`):
output logits of pooler classifier
encoded(`Variable` of shape `[batch_size, seq_len, hidden_size]`):
output logits of transformer stack
info (Dictionary):
addtional middle level info, inclues: all hidden stats, k/v caches.
"""
assert len(src_ids.shape) == 2, 'expect src_ids.shape = [batch, sequecen], got %s' % (repr(src_ids.shape))
assert attn_bias is not None if past_cache else True, 'if `past_cache` is specified; attn_bias should not be None'
d_seqlen = P.shape(src_ids)[1]
if pos_ids is None:
pos_ids = P.arange(0, d_seqlen, 1, dtype='int32').reshape([1, -1]).cast('int64')
if attn_bias is None:
if input_mask is None:
input_mask = P.cast(src_ids != 0, 'float32')
assert len(input_mask.shape) == 2
input_mask = input_mask.unsqueeze(-1)
attn_bias = input_mask.matmul(input_mask, transpose_y=True)
if use_causal_mask:
sequence = P.reshape(P.arange(0, d_seqlen, 1, dtype='float32') + 1., [1, 1, -1, 1])
causal_mask = (sequence.matmul(1. / sequence, transpose_y=True) >= 1.).cast('float32')
attn_bias *= causal_mask
else:
assert len(attn_bias.shape) == 3, 'expect attn_bias tobe rank 3, got %r' % attn_bias.shape
attn_bias = (1. - attn_bias) * -10000.0
attn_bias = attn_bias.unsqueeze(1).tile([1, self.n_head, 1, 1]) # avoid broadcast =_=
attn_bias.stop_gradient = True
if sent_ids is None:
sent_ids = P.zeros_like(src_ids)
if self.d_rel_pos:
rel_pos_ids = self.rel_pos_bias[:d_seqlen, :d_seqlen]
rel_pos_ids = P.to_tensor(rel_pos_ids, dtype='int64')
rel_pos_bias = self.rel_pos_bias_emb(rel_pos_ids).transpose([2, 0, 1])
attn_bias += rel_pos_bias
src_embedded = self.word_emb(src_ids)
pos_embedded = self.pos_emb(pos_ids)
sent_embedded = self.sent_emb(sent_ids)
embedded = src_embedded + pos_embedded + sent_embedded
embedded = self.dropout(self.ln(embedded))
encoded, hidden_list, cache_list = self.encoder_stack(embedded, attn_bias, past_cache=past_cache)
if self.pooler is not None:
pooled = F.tanh(self.pooler(encoded[:, 0, :]))
else:
pooled = None
additional_info = {
'hiddens': hidden_list,
'caches': cache_list,
}
if self.return_additional_info:
return pooled, encoded, additional_info
return pooled, encoded
class ErnieModelForSequenceClassification(ErnieModel):
"""
Ernie Model for text classfication or pointwise ranking tasks
"""
def __init__(self, cfg, name=None):
super(ErnieModelForSequenceClassification, self).__init__(cfg, name=name)
initializer = nn.initializer.TruncatedNormal(std=cfg['initializer_range'])
self.classifier = _build_linear(cfg['hidden_size'], cfg['num_labels'], append_name(name, 'cls'), initializer)
prob = cfg.get('classifier_dropout_prob', cfg['hidden_dropout_prob'])
self.dropout = nn.Dropout(p=prob)
self.train()
@add_docstring(ErnieModel.forward.__doc__)
def forward(self, *args, **kwargs):
"""
Args:
labels (optional, `Variable` of shape [batch_size]):
ground truth label id for each sentence
Returns:
loss (`Variable` of shape []):
Cross entropy loss mean over batch
if labels not set, returns None
logits (`Variable` of shape [batch_size, hidden_size]):
output logits of classifier
"""
labels = kwargs.pop('labels', None)
pooled, encoded = super(ErnieModelForSequenceClassification, self).forward(*args, **kwargs)
hidden = self.dropout(pooled)
logits = self.classifier(hidden)
if labels is not None:
if len(labels.shape) != 1:
labels = labels.squeeze()
loss = F.cross_entropy(logits, labels)
else:
loss = None
return loss, logits
class ErnieModelForTokenClassification(ErnieModel):
"""
Ernie Model for Named entity tasks(NER)
"""
def __init__(self, cfg, name=None):
super(ErnieModelForTokenClassification, self).__init__(cfg, name=name)
initializer = nn.initializer.TruncatedNormal(std=cfg['initializer_range'])
self.classifier = _build_linear(cfg['hidden_size'], cfg['num_labels'], append_name(name, 'cls'), initializer)
prob = cfg.get('classifier_dropout_prob', cfg['hidden_dropout_prob'])
self.dropout = nn.Dropout(p=prob)
self.train()
@add_docstring(ErnieModel.forward.__doc__)
def forward(self, *args, **kwargs):
"""
Args:
labels (optional, `Variable` of shape [batch_size, seq_len]):
ground truth label id for each token
Returns:
loss (`Variable` of shape []):
Cross entropy loss mean over batch and time, ignore positions where label == -100
if labels not set, returns None
logits (`Variable` of shape [batch_size, seq_len, hidden_size]):
output logits of classifier
loss_weights (`Variable` of shape [batch_size, seq_len]):
weigths of loss for each tokens.
ignore_index (int):
when label == `ignore_index`, this token will not contribute to loss
"""
ignore_index = kwargs.pop('ignore_index', -100)
labels = kwargs.pop('labels', None)
loss_weights = kwargs.pop('loss_weights', None)
pooled, encoded = super(ErnieModelForTokenClassification, self).forward(*args, **kwargs)
hidden = self.dropout(encoded) # maybe not?
logits = self.classifier(hidden)
if labels is not None:
if len(labels.shape) != 2:
labels = labels.squeeze()
loss = F.cross_entropy(logits, labels, ignore_index=ignore_index, reduction='none')
if loss_weights is not None:
loss = loss * loss_weights
loss = loss.mean()
else:
loss = None
return loss, logits
class ErnieModelForQuestionAnswering(ErnieModel):
"""
Ernie model for reading comprehension tasks (SQuAD)
"""
def __init__(self, cfg, name=None):
super(ErnieModelForQuestionAnswering, self).__init__(cfg, name=name)
initializer = nn.initializer.TruncatedNormal(std=cfg['initializer_range'])
self.classifier = _build_linear(cfg['hidden_size'], 2, append_name(name, 'cls_mrc'), initializer)
prob = cfg.get('classifier_dropout_prob', cfg['hidden_dropout_prob'])
self.dropout = nn.Dropout(p=prob)
self.train()
@add_docstring(ErnieModel.forward.__doc__)
def forward(self, *args, **kwargs):
"""
Args:
start_pos (optional, `Variable` of shape [batch_size]):
token index of start of answer span in `context`
end_pos (optional, `Variable` of shape [batch_size]):
token index of end of answer span in `context`
Returns:
loss (`Variable` of shape []):
Cross entropy loss mean over batch and time, ignore positions where label == -100
if labels not set, returns None
start_logits (`Variable` of shape [batch_size, hidden_size]):
output logits of start position, use argmax(start_logit) to get start index
end_logits (`Variable` of shape [batch_size, hidden_size]):
output logits of end position, use argmax(end_logit) to get end index
"""
start_pos = kwargs.pop('start_pos', None)
end_pos = kwargs.pop('end_pos', None)
pooled, encoded = super(ErnieModelForQuestionAnswering, self).forward(*args, **kwargs)
encoded = self.dropout(encoded)
encoded = self.classifier(encoded)
start_logit, end_logits = P.unstack(encoded, axis=-1)
if start_pos is not None and end_pos is not None:
if len(start_pos.shape) != 1:
start_pos = start_pos.squeeze()
if len(end_pos.shape) != 1:
end_pos = end_pos.squeeze()
start_loss = F.cross_entropy(start_logit, start_pos)
end_loss = F.cross_entropy(end_logits, end_pos)
loss = (start_loss.mean() + end_loss.mean()) / 2.
else:
loss = None
return loss, start_logit, end_logits
class NSPHead(nn.Layer):
def __init__(self, cfg, name=None):
super(NSPHead, self).__init__()
initializer = nn.initializer.TruncatedNormal(std=cfg['initializer_range'])
self.nsp = _build_linear(cfg['hidden_size'], 2, append_name(name, 'nsp_fc'), initializer)
def forward(self, inputs, labels):
"""
Args:
start_pos (optional, `Variable` of shape [batch_size]):
token index of start of answer span in `context`
end_pos (optional, `Variable` of shape [batch_size]):
token index of end of answer span in `context`
Returns:
loss (`Variable` of shape []):
Cross entropy loss mean over batch and time, ignore positions where label == -100
if labels not set, returns None
start_logits (`Variable` of shape [batch_size, hidden_size]):
output logits of start position
end_logits (`Variable` of shape [batch_size, hidden_size]):
output logits of end position
"""
logits = self.nsp(inputs)
loss = F.cross_entropy(logits, labels)
return loss
class ErnieModelForPretraining(ErnieModel):
"""
Ernie Model for Masked Languate Model pretrain
"""
def __init__(self, cfg, name=None):
super(ErnieModelForPretraining, self).__init__(cfg, name=name)
initializer = nn.initializer.TruncatedNormal(std=cfg['initializer_range'])
d_model = cfg['hidden_size']
d_vocab = cfg['vocab_size']
self.pooler_heads = nn.LayerList([NSPHead(cfg, name=name)])
self.mlm = _build_linear(
d_model,
d_model,
append_name(name, 'mask_lm_trans_fc'),
initializer,
)
self.act = ACT_DICT[cfg['hidden_act']]()
self.mlm_ln = _build_ln(d_model, name=append_name(name, 'mask_lm_trans'))
self.mlm_bias = P.create_parameter(
dtype='float32',
shape=[d_vocab],
attr=P.ParamAttr(name=append_name(name, 'mask_lm_out_fc.b_0'),
initializer=nn.initializer.Constant(value=0.0)),
is_bias=True,
)
self.train()
@add_docstring(ErnieModel.forward.__doc__)
def forward(self, *args, **kwargs):
"""
Args:
nsp_labels (optional, `Variable` of shape [batch_size]):
labels for `next sentence prediction` tasks
mlm_pos (optional, `Variable` of shape [n_mask, 2]):
index of mask_id in `src_ids`, can be obtained from `fluid.layers.where(src_ids==mask_id)`
labels (optional, `Variable` of shape [n_mask]):
labels for `mask language model` tasks, the original token indices in masked position in `src_ids`
Returns:
loss (`Variable` of shape []):
total_loss of `next sentence prediction` and `masked language model`
mlm_loss (`Variable` of shape []):
loss for `masked language model` task
nsp_loss (`Variable` of shape []):
loss for `next sentence prediction` task
"""
mlm_labels = kwargs.pop('labels')
mlm_pos = kwargs.pop('mlm_pos')
nsp_labels = kwargs.pop('nsp_labels')
pooled, encoded = super(ErnieModelForPretraining, self).forward(*args, **kwargs)
if len(mlm_labels.shape) != 1:
mlm_labels = mlm_labels.squeeze()
if len(nsp_labels.shape) == 1:
nsp_labels = nsp_labels.squeeze()
nsp_loss = self.pooler_heads[0](pooled, nsp_labels)
encoded_2d = encoded.gather_nd(mlm_pos)
encoded_2d = self.act(self.mlm(encoded_2d))
encoded_2d = self.mlm_ln(encoded_2d)
logits_2d = encoded_2d.matmul(self.word_emb.weight, transpose_y=True) + self.mlm_bias
mlm_loss = F.cross_entropy(logits_2d, mlm_labels)
total_loss = mlm_loss + nsp_loss
return total_loss, mlm_loss, nsp_loss
class ErnieModelForGeneration(ErnieModel):
"""
Ernie Model for sequence to sequence generation.
"""
resource_map = {
'ernie-gen-base-en': ErnieModel.bce + 'model-ernie-gen-base-en.1.tar.gz',
'ernie-gen-large-en': ErnieModel.bce + 'model-ernie-gen-large-en.1.tar.gz',
'ernie-gen-large-430g-en': ErnieModel.bce + 'model-ernie-gen-large-430g-en.1.tar.gz',
'ernie-1.0': ErnieModel.bce + 'model-ernie1.0.1.tar.gz',
}
def __init__(self, cfg, name=None):
cfg['return_additional_info'] = True
cfg['has_pooler'] = False
super(ErnieModelForGeneration, self).__init__(cfg, name=name)
initializer = nn.initializer.TruncatedNormal(std=cfg['initializer_range'])
d_model = cfg['hidden_size']
d_vocab = cfg['vocab_size']
self.mlm = _build_linear(
d_model,
d_model,
append_name(name, 'mask_lm_trans_fc'),
initializer,
)
self.act = ACT_DICT[cfg['hidden_act']]()
self.mlm_ln = _build_ln(d_model, name=append_name(name, 'mask_lm_trans'))
self.mlm_bias = P.create_parameter(
dtype='float32',
shape=[d_vocab],
attr=P.ParamAttr(name=append_name(name, 'mask_lm_out_fc.b_0'),
initializer=nn.initializer.Constant(value=0.0)),
is_bias=True,
)
self.train()
@add_docstring(ErnieModel.forward.__doc__)
def forward(self, *args, **kwargs):
"""
Args
tgt_labels(`Variable` of shape [batch_size, seqlen] or [batch, seqlen, vocab_size]):
ground trouth target sequence id (hard label) or distribution (soft label)
tgt_pos(`Variable` of shape [n_targets, 2]):
index of tgt_labels in `src_ids`, can be obtained from `fluid.layers.where(src_ids==mask_id)`
encoder_only(Bool):
if set, will not return loss, logits_2d
Returns:
loss(`Variable` of shape []):
cross entropy loss mean over every target label. if `encode_only`, returns None.
logits(`Variable` of shape [n_targets, vocab_size]):
logits for every targets. if `encode_only`, returns None.
info(Dictionary): see `ErnieModel`
"""
tgt_labels = kwargs.pop('tgt_labels', None)
tgt_pos = kwargs.pop('tgt_pos', None)
encode_only = kwargs.pop('encode_only', False)
_, encoded, info = ErnieModel.forward(self, *args, **kwargs)
if encode_only:
return None, None, info
if tgt_labels is None or tgt_pos is None:
encoded = self.act(self.mlm(encoded))
encoded = self.mlm_ln(encoded)
logits = encoded.matmul(self.word_emb.weight, transpose_y=True) + self.mlm_bias
output_ids = logits.cast('float32').argmax(-1)
return output_ids, logits, info
else:
encoded_2d = encoded.gather_nd(tgt_pos)
encoded_2d = self.act(self.mlm(encoded_2d))
encoded_2d = self.mlm_ln(encoded_2d)
logits_2d = encoded_2d.matmul(self.word_emb.weight, transpose_y=True) + self.mlm_bias
assert len(tgt_labels.shape) == 2, 'expect 2d label, got %r' % tgt_labels
loss = F.cross_entropy(logits_2d, tgt_labels, soft_label=True)
return loss, logits_2d, info
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
import logging
import os
import re
import sys
import tempfile
from functools import partial
from pathlib import Path
import six
if six.PY2:
from pathlib2 import Path
else:
from pathlib import Path
from tqdm import tqdm
import numpy as np
from disco_diffusion_ernievil_base.vit_b_16x.ernievil2.transformers.file_utils import _fetch_from_remote
import io
open = partial(io.open, encoding='utf8')
log = logging.getLogger(__name__)
_max_input_chars_per_word = 100
def _wordpiece(token, vocab, unk_token, prefix='##', sentencepiece_prefix=''):
""" wordpiece: helloworld => [hello, ##world] """
chars = list(token)
if len(chars) > _max_input_chars_per_word:
return [unk_token], [(0, len(chars))]
is_bad = False
start = 0
sub_tokens = []
sub_pos = []
while start < len(chars):
end = len(chars)
cur_substr = None
while start < end:
substr = "".join(chars[start:end])
if start == 0:
substr = sentencepiece_prefix + substr
if start > 0:
substr = prefix + substr
if substr in vocab:
cur_substr = substr
break
end -= 1
if cur_substr is None:
is_bad = True
break
sub_tokens.append(cur_substr)
sub_pos.append((start, end))
start = end
if is_bad:
return [unk_token], [(0, len(chars))]
else:
return sub_tokens, sub_pos
class ErnieTokenizer(object):
bce = 'https://ernie-github.cdn.bcebos.com/'
resource_map = {
'ernie-1.0': bce + 'model-ernie1.0.1.tar.gz',
'ernie-2.0-en': bce + 'model-ernie2.0-en.1.tar.gz',
'ernie-2.0-large-en': bce + 'model-ernie2.0-large-en.1.tar.gz',
'ernie-tiny': bce + 'model-ernie_tiny.1.tar.gz',
'ernie-gen-base-en': bce + 'model-ernie-gen-base-en.1.tar.gz',
'ernie-gen-large-en': bce + 'model-ernie-gen-large-en.1.tar.gz',
'ernie-gram-zh': bce + 'model-ernie-gram-zh.1.tar.gz',
'ernie-gram-en': bce + 'model-ernie-gram-en.1.tar.gz',
}
@classmethod
def from_pretrained(cls, pretrain_dir_or_url, force_download=False, **kwargs):
if not Path(pretrain_dir_or_url).exists() and str(pretrain_dir_or_url) in cls.resource_map:
url = cls.resource_map[str(pretrain_dir_or_url)]
log.info('get pretrain dir from %s' % url)
pretrain_dir = _fetch_from_remote(url, force_download=force_download)
else:
log.info('pretrain dir %s not in %s, read from local' % (pretrain_dir_or_url, repr(cls.resource_map)))
pretrain_dir = Path(pretrain_dir_or_url)
if not pretrain_dir.exists():
raise ValueError('pretrain dir not found: %s, optional: %s' % (pretrain_dir, cls.resource_map.keys()))
vocab_path = pretrain_dir / 'vocab.txt'
if not vocab_path.exists():
raise ValueError('no vocab file in pretrain dir: %s' % pretrain_dir)
vocab_dict = {j.strip().split('\t')[0]: i for i, j in enumerate(vocab_path.open(encoding='utf8').readlines())}
t = cls(vocab_dict, **kwargs)
return t
def __init__(self,
vocab,
unk_token='[UNK]',
sep_token='[SEP]',
cls_token='[CLS]',
pad_token='[PAD]',
mask_token='[MASK]',
wordpiece_prefix='##',
sentencepiece_prefix='',
lower=True,
encoding='utf8',
special_token_list=[]):
if not isinstance(vocab, dict):
raise ValueError('expect `vocab` to be instance of dict, got %s' % type(vocab))
self.vocab = vocab
self.lower = lower
self.prefix = wordpiece_prefix
self.sentencepiece_prefix = sentencepiece_prefix
self.pad_id = self.vocab[pad_token]
self.cls_id = cls_token and self.vocab[cls_token]
self.sep_id = sep_token and self.vocab[sep_token]
self.unk_id = unk_token and self.vocab[unk_token]
self.mask_id = mask_token and self.vocab[mask_token]
self.unk_token = unk_token
special_tokens = {pad_token, cls_token, sep_token, unk_token, mask_token} | set(special_token_list)
pat_str = ''
for t in special_tokens:
if t is None:
continue
pat_str += '(%s)|' % re.escape(t)
pat_str += r'([a-zA-Z0-9]+|\S)'
log.debug('regex: %s' % pat_str)
self.pat = re.compile(pat_str)
self.encoding = encoding
def tokenize(self, text):
if len(text) == 0:
return []
if six.PY3 and not isinstance(text, six.string_types):
text = text.decode(self.encoding)
if six.PY2 and isinstance(text, str):
text = text.decode(self.encoding)
res = []
for match in self.pat.finditer(text):
match_group = match.group(0)
if match.groups()[-1]:
if self.lower:
match_group = match_group.lower()
words, _ = _wordpiece(match_group,
vocab=self.vocab,
unk_token=self.unk_token,
prefix=self.prefix,
sentencepiece_prefix=self.sentencepiece_prefix)
else:
words = [match_group]
res += words
return res
def convert_tokens_to_ids(self, tokens):
return [self.vocab.get(t, self.unk_id) for t in tokens]
def truncate(self, id1, id2, seqlen):
len1 = len(id1)
len2 = len(id2)
half = seqlen // 2
if len1 > len2:
len1_truncated, len2_truncated = max(half, seqlen - len2), min(half, len2)
else:
len1_truncated, len2_truncated = min(half, seqlen - len1), max(half, seqlen - len1)
return id1[:len1_truncated], id2[:len2_truncated]
def build_for_ernie(self, text_id, pair_id=[]):
"""build sentence type id, add [CLS] [SEP]"""
text_id_type = np.zeros_like(text_id, dtype=np.int64)
ret_id = np.concatenate([[self.cls_id], text_id, [self.sep_id]], 0)
ret_id_type = np.concatenate([[0], text_id_type, [0]], 0)
if len(pair_id):
pair_id_type = np.ones_like(pair_id, dtype=np.int64)
ret_id = np.concatenate([ret_id, pair_id, [self.sep_id]], 0)
ret_id_type = np.concatenate([ret_id_type, pair_id_type, [1]], 0)
return ret_id, ret_id_type
def encode(self, text, pair=None, truncate_to=None):
text_id = np.array(self.convert_tokens_to_ids(self.tokenize(text)), dtype=np.int64)
text_id_type = np.zeros_like(text_id, dtype=np.int64)
if pair is not None:
pair_id = np.array(self.convert_tokens_to_ids(self.tokenize(pair)), dtype=np.int64)
else:
pair_id = []
if truncate_to is not None:
text_id, pair_id = self.truncate(text_id, [] if pair_id is None else pair_id, truncate_to)
ret_id, ret_id_type = self.build_for_ernie(text_id, pair_id)
return ret_id, ret_id_type
class ErnieTinyTokenizer(ErnieTokenizer):
bce = 'https://ernie-github.cdn.bcebos.com/'
resource_map = {'ernie-tiny': bce + 'model-ernie_tiny.1.tar.gz'}
@classmethod
def from_pretrained(cls, pretrain_dir_or_url, force_download=False, **kwargs):
if not Path(pretrain_dir_or_url).exists() and str(pretrain_dir_or_url) in cls.resource_map:
url = cls.resource_map[str(pretrain_dir_or_url)]
log.info('get pretrain dir from %s' % url)
pretrain_dir = _fetch_from_remote(url, force_download)
else:
log.info('pretrain dir %s not in %s, read from local' % (pretrain_dir_or_url, repr(cls.resource_map)))
pretrain_dir = Path(pretrain_dir_or_url)
if not pretrain_dir.exists():
raise ValueError('pretrain dir not found: %s' % pretrain_dir)
vocab_path = pretrain_dir / 'vocab.txt'
sp_model_path = pretrain_dir / 'subword/spm_cased_simp_sampled.model'
if not vocab_path.exists():
raise ValueError('no vocab file in pretrain dir: %s' % pretrain_dir)
vocab_dict = {j.strip().split('\t')[0]: i for i, j in enumerate(vocab_path.open(encoding='utf8').readlines())}
t = cls(vocab_dict, sp_model_path, **kwargs)
return t
def __init__(self, vocab, sp_model_path, **kwargs):
super(ErnieTinyTokenizer, self).__init__(vocab, **kwargs)
import sentencepiece as spm
import jieba as jb
self.sp_model = spm.SentencePieceProcessor()
self.window_size = 5
self.sp_model.Load(sp_model_path)
self.jb = jb
def cut(self, sentence):
return self.jb.cut(sentence)
def tokenize(self, text):
if len(text) == 0:
return []
if not isinstance(text, six.string_types):
text = text.decode(self.encoding)
if self.lower:
text = text.lower()
res = []
for match in self.cut(text):
res += self.sp_model.EncodeAsPieces(match)
return res
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
import logging
import os
import time
from pathlib import Path
import six
from tqdm import tqdm
if six.PY2:
from pathlib2 import Path
else:
from pathlib import Path
log = logging.getLogger(__name__)
def _fetch_from_remote(url, force_download=False, cached_dir='~/.paddle-ernie-cache'):
import hashlib, tempfile, requests, tarfile
sig = hashlib.md5(url.encode('utf8')).hexdigest()
cached_dir = Path(cached_dir).expanduser()
try:
cached_dir.mkdir()
except OSError:
pass
cached_dir_model = cached_dir / sig
from filelock import FileLock
with FileLock(str(cached_dir_model) + '.lock'):
donefile = cached_dir_model / 'done'
if (not force_download) and donefile.exists():
log.debug('%s cached in %s' % (url, cached_dir_model))
return cached_dir_model
cached_dir_model.mkdir(exist_ok=True)
tmpfile = cached_dir_model / 'tmp'
with tmpfile.open('wb') as f:
r = requests.get(url, stream=True)
total_len = int(r.headers.get('content-length'))
for chunk in tqdm(r.iter_content(chunk_size=1024),
total=total_len // 1024,
desc='downloading %s' % url,
unit='KB'):
if chunk:
f.write(chunk)
f.flush()
log.debug('extacting... to %s' % tmpfile)
with tarfile.open(tmpfile.as_posix()) as tf:
tf.extractall(path=str(cached_dir_model))
donefile.touch()
os.remove(tmpfile.as_posix())
return cached_dir_model
def add_docstring(doc):
def func(f):
f.__doc__ += ('\n======other docs from supper class ======\n%s' % doc)
return f
return func
# copyright (c) 2022 PaddlePaddle Authors. All Rights Reserve.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import numpy as np
import paddle
import paddle.nn as nn
import paddle.nn.functional as F
class MultiModalModel(nn.Layer):
def __init__(self, image_model=None, text_model=None, args=None):
super(MultiModalModel, self).__init__()
self.visual = image_model
self.text_model = text_model
def encode_text(self, input_ids, pos_ids=None):
pool_out, text_embedding = self.text_model(input_ids, pos_ids=pos_ids)
return pool_out
def encode_image(self, img_word):
img_embedding = self.visual(img_word)
return img_embedding[:, 0]
def forward(self, img_word=None, input_ids=None, pos_ids=None):
img_embedding = self.visual(img_word)
img_embedding = img_embedding[:, 0]
pool_out, text_embedding = self.text_model(input_ids, pos_ids=pos_ids)
return img_embedding, pool_out
# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Implement Transformer Class for ViT
"""
import copy
import paddle
import paddle.nn as nn
from disco_diffusion_ernievil_base.vit_b_16x.ernievil2.transformers.droppath import DropPath
class Identity(nn.Layer):
""" Identity layer
The output of this layer is the input without any change.
Use this layer to avoid using 'if' condition in forward methods
"""
def __init__(self):
super().__init__()
def forward(self, x):
return x
class PatchEmbedding(nn.Layer):
"""Patch Embedding and Position Embedding
Apply patch embedding and position embedding on input images.
Attributes:
patch_embddings: impl using a patch_size x patch_size Conv2D operation
position_embddings: a parameter with len = num_patch + 1(for cls_token)
cls_token: token insert to the patch feature for classification
dropout: dropout for embeddings
"""
def __init__(self, image_size=224, patch_size=16, in_channels=3, embed_dim=768, dropout=0.):
super().__init__()
n_patches = (image_size // patch_size) * (image_size // patch_size)
self.patch_embedding = nn.Conv2D(in_channels=in_channels,
out_channels=embed_dim,
kernel_size=patch_size,
stride=patch_size)
self.position_embeddings = paddle.create_parameter(
shape=[1, n_patches + 1, embed_dim],
dtype='float32',
default_initializer=paddle.nn.initializer.TruncatedNormal(std=.02))
self.cls_token = paddle.create_parameter(shape=[1, 1, embed_dim],
dtype='float32',
default_initializer=paddle.nn.initializer.Constant(0))
self.dropout = nn.Dropout(dropout)
def forward(self, x):
cls_tokens = self.cls_token.expand((x.shape[0], -1, -1))
x = self.patch_embedding(x)
x = x.flatten(2)
x = x.transpose([0, 2, 1])
x = paddle.concat((cls_tokens, x), axis=1)
embeddings = x + self.position_embeddings # tensor broadcast
embeddings = self.dropout(embeddings)
return embeddings
class Attention(nn.Layer):
""" Attention module
Attention module for ViT, here q, k, v are assumed the same.
The qkv mappings are stored as one single param.
Attributes:
num_heads: number of heads
attn_head_size: feature dim of single head
all_head_size: feature dim of all heads
qkv: a nn.Linear for q, k, v mapping
scales: 1 / sqrt(single_head_feature_dim)
out: projection of multi-head attention
attn_dropout: dropout for attention
proj_dropout: final dropout before output
softmax: softmax op for attention
"""
def __init__(self, embed_dim, num_heads, attn_head_size=None, qkv_bias=True, dropout=0., attention_dropout=0.):
super().__init__()
assert isinstance(embed_dim,
int), (f"Expected the type of `embed_dim` to be {int}, but received {type(embed_dim)}.")
assert isinstance(num_heads,
int), (f"Expected the type of `num_heads` to be {int}, but received {type(num_heads)}.")
assert embed_dim > 0, (f"Expected `embed_dim` to be greater than 0, but received {embed_dim}")
assert num_heads > 0, (f"Expected `num_heads` to be greater than 0, but received {num_heads}")
self.embed_dim = embed_dim
self.num_heads = num_heads
if attn_head_size is not None:
assert isinstance(attn_head_size, int), (f"Expected the type of `attn_head_size` to be {int}, "
f"but received {type(attn_head_size)}.")
assert attn_head_size > 0, f"Expected `attn_head_size` to be greater than 0," \
f" but received {attn_head_size}."
self.attn_head_size = attn_head_size
else:
self.attn_head_size = embed_dim // num_heads
assert self.attn_head_size * num_heads == embed_dim, (
f"`embed_dim` must be divisible by `num_heads`,"
f" but received embed_dim={embed_dim}, num_heads={num_heads}.")
self.all_head_size = self.attn_head_size * num_heads
w_attr_1, b_attr_1 = self._init_weights()
self.qkv = nn.Linear(
embed_dim,
self.all_head_size * 3, # weights for q, k, and v
weight_attr=w_attr_1,
bias_attr=b_attr_1 if qkv_bias else False)
self.scales = self.attn_head_size**-0.5
w_attr_2, b_attr_2 = self._init_weights()
self.out = nn.Linear(self.all_head_size, embed_dim, weight_attr=w_attr_2, bias_attr=b_attr_2)
self.attn_dropout = nn.Dropout(attention_dropout)
self.proj_dropout = nn.Dropout(dropout)
self.softmax = nn.Softmax(axis=-1)
def _init_weights(self):
weight_attr = paddle.ParamAttr(initializer=nn.initializer.TruncatedNormal(std=.02))
bias_attr = paddle.ParamAttr(initializer=nn.initializer.Constant(0.0))
return weight_attr, bias_attr
def transpose_multihead(self, x):
new_shape = x.shape[:-1] + [self.num_heads, self.attn_head_size]
x = x.reshape(new_shape)
x = x.transpose([0, 2, 1, 3])
return x
def forward(self, x):
qkv = self.qkv(x).chunk(3, axis=-1)
q, k, v = map(self.transpose_multihead, qkv)
attn = paddle.matmul(q, k, transpose_y=True)
attn = attn * self.scales
attn = self.softmax(attn)
attn = self.attn_dropout(attn)
z = paddle.matmul(attn, v)
z = z.transpose([0, 2, 1, 3])
new_shape = z.shape[:-2] + [self.all_head_size]
z = z.reshape(new_shape)
# reshape
z = self.out(z)
z = self.proj_dropout(z)
return z
class Mlp(nn.Layer):
""" MLP module
Impl using nn.Linear and activation is GELU, dropout is applied.
Ops: fc -> act -> dropout -> fc -> dropout
Attributes:
fc1: nn.Linear
fc2: nn.Linear
act: GELU
dropout1: dropout after fc1
dropout2: dropout after fc2
"""
def __init__(self, embed_dim, mlp_ratio, dropout=0.):
super().__init__()
w_attr_1, b_attr_1 = self._init_weights()
self.fc1 = nn.Linear(embed_dim, int(embed_dim * mlp_ratio), weight_attr=w_attr_1, bias_attr=b_attr_1)
w_attr_2, b_attr_2 = self._init_weights()
self.fc2 = nn.Linear(int(embed_dim * mlp_ratio), embed_dim, weight_attr=w_attr_2, bias_attr=b_attr_2)
self.act = nn.GELU()
self.dropout1 = nn.Dropout(dropout)
self.dropout2 = nn.Dropout(dropout)
def _init_weights(self):
weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.TruncatedNormal(std=0.2))
bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(0.0))
return weight_attr, bias_attr
def forward(self, x):
x = self.fc1(x)
x = self.act(x)
x = self.dropout1(x)
x = self.fc2(x)
x = self.dropout2(x)
return x
class EncoderLayer(nn.Layer):
"""Encoder Layer
Encoder layer contains attention, norm, mlp and residual
Attributes:
hidden_size: transformer feature dim
attn_norm: nn.LayerNorm before attention
mlp_norm: nn.LayerNorm before mlp
mlp: mlp modual
attn: attention modual
"""
def __init__(self,
embed_dim,
num_heads,
attn_head_size=None,
qkv_bias=True,
mlp_ratio=4.,
dropout=0.,
attention_dropout=0.,
droppath=0.):
super().__init__()
w_attr_1, b_attr_1 = self._init_weights()
self.attn_norm = nn.LayerNorm(embed_dim, weight_attr=w_attr_1, bias_attr=b_attr_1, epsilon=1e-6)
self.attn = Attention(embed_dim, num_heads, attn_head_size, qkv_bias, dropout, attention_dropout)
self.drop_path = DropPath(droppath) if droppath > 0. else Identity()
w_attr_2, b_attr_2 = self._init_weights()
self.mlp_norm = nn.LayerNorm(embed_dim, weight_attr=w_attr_2, bias_attr=b_attr_2, epsilon=1e-6)
self.mlp = Mlp(embed_dim, mlp_ratio, dropout)
def _init_weights(self):
weight_attr = paddle.ParamAttr(initializer=nn.initializer.Constant(1.0))
bias_attr = paddle.ParamAttr(initializer=nn.initializer.Constant(0.0))
return weight_attr, bias_attr
def forward(self, x):
h = x
x = self.attn_norm(x)
x = self.attn(x)
x = self.drop_path(x)
x = x + h
h = x
x = self.mlp_norm(x)
x = self.mlp(x)
x = self.drop_path(x)
x = x + h
return x
class Encoder(nn.Layer):
"""Transformer encoder
Encoder encoder contains a list of EncoderLayer, and a LayerNorm.
Attributes:
layers: nn.LayerList contains multiple EncoderLayers
encoder_norm: nn.LayerNorm which is applied after last encoder layer
"""
def __init__(self,
embed_dim,
num_heads,
depth,
attn_head_size=None,
qkv_bias=True,
mlp_ratio=4.0,
dropout=0.,
attention_dropout=0.,
droppath=0.):
super().__init__()
# stochatic depth decay
depth_decay = [x.item() for x in paddle.linspace(0, droppath, depth)]
layer_list = []
for i in range(depth):
encoder_layer = EncoderLayer(embed_dim,
num_heads,
attn_head_size=attn_head_size,
qkv_bias=qkv_bias,
mlp_ratio=mlp_ratio,
dropout=dropout,
attention_dropout=attention_dropout,
droppath=depth_decay[i])
layer_list.append(copy.deepcopy(encoder_layer))
self.layers = nn.LayerList(layer_list)
w_attr_1, b_attr_1 = self._init_weights()
self.encoder_norm = nn.LayerNorm(embed_dim, weight_attr=w_attr_1, bias_attr=b_attr_1, epsilon=1e-6)
def _init_weights(self):
weight_attr = paddle.ParamAttr(initializer=nn.initializer.Constant(1.0))
bias_attr = paddle.ParamAttr(initializer=nn.initializer.Constant(0.0))
return weight_attr, bias_attr
def forward(self, x):
for layer in self.layers:
x = layer(x)
out = self.encoder_norm(x)
return out
class VisualTransformer(nn.Layer):
"""ViT transformer
ViT Transformer, classifier is a single Linear layer for finetune,
For training from scratch, two layer mlp should be used.
Classification is done using cls_token.
Args:
image_size: int, input image size, default: 224
patch_size: int, patch size, default: 16
in_channels: int, input image channels, default: 3
num_classes: int, number of classes for classification, default: 1000
embed_dim: int, embedding dimension (patch embed out dim), default: 768
depth: int, number ot transformer blocks, default: 12
num_heads: int, number of attention heads, default: 12
mlp_ratio: float, ratio of mlp hidden dim to embed dim(mlp in dim), default: 4.0
qkv_bias: bool, If True, enable qkv(nn.Linear) layer with bias, default: True
dropout: float, dropout rate for linear layers, default: 0.
attention_dropout: float, dropout rate for attention layers default: 0.
droppath: float, droppath rate for droppath layers, default: 0.
"""
def __init__(self,
image_size=224,
patch_size=16,
in_channels=3,
num_classes=768,
embed_dim=768,
depth=12,
num_heads=12,
attn_head_size=None,
mlp_ratio=4,
qkv_bias=True,
dropout=0.,
attention_dropout=0.,
droppath=0.,
train_from_scratch=False):
super().__init__()
# create patch embedding with positional embedding
self.patch_embedding = PatchEmbedding(image_size, patch_size, in_channels, embed_dim, dropout)
# create multi head self-attention layers
self.encoder = Encoder(embed_dim, num_heads, depth, attn_head_size, qkv_bias, mlp_ratio, dropout,
attention_dropout, droppath)
# classifier head (for training from scracth)
if train_from_scratch:
w_attr_1, b_attr_1 = self._init_weights()
w_attr_2, b_attr_2 = self._init_weights()
self.classifier = nn.Sequential(
nn.Linear(embed_dim, embed_dim, weight_attr=w_attr_1, bias_attr=b_attr_1),
nn.ReLU(),
nn.Dropout(dropout),
nn.Linear(embed_dim, num_classes, weight_attr=w_attr_2, bias_attr=b_attr_2),
nn.Dropout(dropout),
)
else:
# classifier head (for finetuning)
w_attr_1, b_attr_1 = self._init_weights()
self.classifier = nn.Linear(embed_dim, num_classes, weight_attr=w_attr_1, bias_attr=b_attr_1)
def _init_weights(self):
weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.TruncatedNormal(std=.02))
bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(0.0))
return weight_attr, bias_attr
def forward(self, x):
x = self.patch_embedding(x)
x = self.encoder(x)
logits = self.classifier(x[:, 0]) # take only cls_token as classifier
return logits
# return x
def build_vit(config):
"""build vit model from config"""
model = VisualTransformer(image_size=config.DATA.IMAGE_SIZE,
patch_size=config.MODEL.TRANS.PATCH_SIZE,
in_channels=config.DATA.IMAGE_CHANNELS,
num_classes=config.MODEL.NUM_CLASSES,
embed_dim=config.MODEL.TRANS.EMBED_DIM,
depth=config.MODEL.TRANS.DEPTH,
num_heads=config.MODEL.TRANS.NUM_HEADS,
attn_head_size=config.MODEL.TRANS.ATTN_HEAD_SIZE,
mlp_ratio=config.MODEL.TRANS.MLP_RATIO,
qkv_bias=config.MODEL.TRANS.QKV_BIAS,
dropout=config.MODEL.DROPOUT,
attention_dropout=config.MODEL.ATTENTION_DROPOUT,
droppath=config.MODEL.DROPPATH,
train_from_scratch=False)
return model
def ViT_large_patch16_384(**kwargs):
model = VisualTransformer(image_size=384,
patch_size=16,
in_channels=3,
embed_dim=1024,
depth=24,
num_heads=16,
attn_head_size=64,
mlp_ratio=4.0,
qkv_bias=True,
dropout=0.1,
attention_dropout=0.1,
train_from_scratch=False)
return model
def ViT_large_patch16_224(**kwargs):
model = VisualTransformer(image_size=224,
patch_size=16,
in_channels=3,
embed_dim=1024,
depth=24,
num_heads=16,
attn_head_size=64,
mlp_ratio=4.0,
qkv_bias=True,
dropout=0.1,
attention_dropout=0.1,
train_from_scratch=False)
return model
def ViT_base_patch16_224(**kwargs):
model = VisualTransformer(image_size=224,
patch_size=16,
in_channels=3,
embed_dim=768,
depth=12,
num_heads=12,
attn_head_size=64,
mlp_ratio=4.0,
qkv_bias=True,
dropout=0,
attention_dropout=0,
train_from_scratch=False)
return model
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import print_function
import paddle
import paddle.nn as nn
from paddle.utils.download import get_weights_path_from_url
__all__ = []
model_urls = {
'resnet18': ('https://paddle-hapi.bj.bcebos.com/models/resnet18.pdparams', 'cf548f46534aa3560945be4b95cd11c4'),
'resnet34': ('https://paddle-hapi.bj.bcebos.com/models/resnet34.pdparams', '8d2275cf8706028345f78ac0e1d31969'),
'resnet50': ('https://paddle-hapi.bj.bcebos.com/models/resnet50.pdparams', 'ca6f485ee1ab0492d38f323885b0ad80'),
'resnet101': ('https://paddle-hapi.bj.bcebos.com/models/resnet101.pdparams', '02f35f034ca3858e1e54d4036443c92d'),
'resnet152': ('https://paddle-hapi.bj.bcebos.com/models/resnet152.pdparams', '7ad16a2f1e7333859ff986138630fd7a'),
'wide_resnet50_2':
('https://paddle-hapi.bj.bcebos.com/models/wide_resnet50_2.pdparams', '0282f804d73debdab289bd9fea3fa6dc'),
'wide_resnet101_2':
('https://paddle-hapi.bj.bcebos.com/models/wide_resnet101_2.pdparams', 'd4360a2d23657f059216f5d5a1a9ac93'),
}
class BasicBlock(nn.Layer):
expansion = 1
def __init__(self,
inplanes,
planes,
stride=1,
downsample=None,
groups=1,
base_width=64,
dilation=1,
norm_layer=None):
super(BasicBlock, self).__init__()
if norm_layer is None:
norm_layer = nn.BatchNorm2D
if dilation > 1:
raise NotImplementedError("Dilation > 1 not supported in BasicBlock")
self.conv1 = nn.Conv2D(inplanes, planes, 3, padding=1, stride=stride, bias_attr=False)
self.bn1 = norm_layer(planes)
self.relu = nn.ReLU()
self.conv2 = nn.Conv2D(planes, planes, 3, padding=1, bias_attr=False)
self.bn2 = norm_layer(planes)
self.downsample = downsample
self.stride = stride
def forward(self, x):
identity = x
out = self.conv1(x)
out = self.bn1(out)
out = self.relu(out)
out = self.conv2(out)
out = self.bn2(out)
if self.downsample is not None:
identity = self.downsample(x)
out += identity
out = self.relu(out)
return out
class BottleneckBlock(nn.Layer):
expansion = 4
def __init__(self,
inplanes,
planes,
stride=1,
downsample=None,
groups=1,
base_width=64,
dilation=1,
norm_layer=None):
super(BottleneckBlock, self).__init__()
if norm_layer is None:
norm_layer = nn.BatchNorm2D
width = int(planes * (base_width / 64.)) * groups
self.conv1 = nn.Conv2D(inplanes, width, 1, bias_attr=False)
self.bn1 = norm_layer(width)
self.conv2 = nn.Conv2D(width,
width,
3,
padding=dilation,
stride=stride,
groups=groups,
dilation=dilation,
bias_attr=False)
self.bn2 = norm_layer(width)
self.conv3 = nn.Conv2D(width, planes * self.expansion, 1, bias_attr=False)
self.bn3 = norm_layer(planes * self.expansion)
self.relu = nn.ReLU()
self.downsample = downsample
self.stride = stride
def forward(self, x):
identity = x
out = self.conv1(x)
out = self.bn1(out)
out = self.relu(out)
out = self.conv2(out)
out = self.bn2(out)
out = self.relu(out)
out = self.conv3(out)
out = self.bn3(out)
if self.downsample is not None:
identity = self.downsample(x)
out += identity
out = self.relu(out)
return out
class ResNet(nn.Layer):
"""ResNet model from
`"Deep Residual Learning for Image Recognition" <https://arxiv.org/pdf/1512.03385.pdf>`_
Args:
Block (BasicBlock|BottleneckBlock): block module of model.
depth (int): layers of resnet, default: 50.
width (int): base width of resnet, default: 64.
num_classes (int): output dim of last fc layer. If num_classes <=0, last fc layer
will not be defined. Default: 1000.
with_pool (bool): use pool before the last fc layer or not. Default: True.
Examples:
.. code-block:: python
import paddle
from paddle.vision.models import ResNet
from paddle.vision.models.resnet import BottleneckBlock, BasicBlock
resnet50 = ResNet(BottleneckBlock, 50)
wide_resnet50_2 = ResNet(BottleneckBlock, 50, width=64*2)
resnet18 = ResNet(BasicBlock, 18)
x = paddle.rand([1, 3, 224, 224])
out = resnet18(x)
print(out.shape)
"""
def __init__(self, block, depth=50, width=64, num_classes=1000, with_pool=True):
super(ResNet, self).__init__()
layer_cfg = {18: [2, 2, 2, 2], 34: [3, 4, 6, 3], 50: [3, 4, 6, 3], 101: [3, 4, 23, 3], 152: [3, 8, 36, 3]}
layers = layer_cfg[depth]
self.groups = 1
self.base_width = width
self.num_classes = num_classes
self.with_pool = with_pool
self._norm_layer = nn.BatchNorm2D
self.inplanes = 64
self.dilation = 1
self.conv1 = nn.Conv2D(3, self.inplanes, kernel_size=7, stride=2, padding=3, bias_attr=False)
self.bn1 = self._norm_layer(self.inplanes)
self.relu = nn.ReLU()
self.maxpool = nn.MaxPool2D(kernel_size=3, stride=2, padding=1)
self.layer1 = self._make_layer(block, 64, layers[0])
self.layer2 = self._make_layer(block, 128, layers[1], stride=2)
self.layer3 = self._make_layer(block, 256, layers[2], stride=2)
self.layer4 = self._make_layer(block, 512, layers[3], stride=2)
if with_pool:
self.avgpool = nn.AdaptiveAvgPool2D((1, 1))
if num_classes > 0:
self.fc = nn.Linear(512 * block.expansion, num_classes)
def _make_layer(self, block, planes, blocks, stride=1, dilate=False):
norm_layer = self._norm_layer
downsample = None
previous_dilation = self.dilation
if dilate:
self.dilation *= stride
stride = 1
if stride != 1 or self.inplanes != planes * block.expansion:
downsample = nn.Sequential(
nn.Conv2D(self.inplanes, planes * block.expansion, 1, stride=stride, bias_attr=False),
norm_layer(planes * block.expansion),
)
layers = []
layers.append(
block(self.inplanes, planes, stride, downsample, self.groups, self.base_width, previous_dilation,
norm_layer))
self.inplanes = planes * block.expansion
for _ in range(1, blocks):
layers.append(
block(self.inplanes, planes, groups=self.groups, base_width=self.base_width, norm_layer=norm_layer))
return nn.Sequential(*layers)
def forward(self, x):
x = self.conv1(x)
x = self.bn1(x)
x = self.relu(x)
x = self.maxpool(x)
x = self.layer1(x)
x = self.layer2(x)
x = self.layer3(x)
x = self.layer4(x)
if self.with_pool:
x = self.avgpool(x)
if self.num_classes > 0:
x = paddle.flatten(x, 1)
x = self.fc(x)
return x
def _resnet(arch, Block, depth, pretrained, **kwargs):
model = ResNet(Block, depth, **kwargs)
if pretrained:
assert arch in model_urls, "{} model do not have a pretrained model now, you should set pretrained=False".format(
arch)
weight_path = get_weights_path_from_url(model_urls[arch][0], model_urls[arch][1])
param = paddle.load(weight_path)
model.set_dict(param)
return model
def resnet18(pretrained=False, **kwargs):
"""ResNet 18-layer model from
`"Deep Residual Learning for Image Recognition" <https://arxiv.org/pdf/1512.03385.pdf>`_
Args:
pretrained (bool): If True, returns a model pre-trained on ImageNet
Examples:
.. code-block:: python
import paddle
from paddle.vision.models import resnet18
# build model
model = resnet18()
# build model and load imagenet pretrained weight
# model = resnet18(pretrained=True)
x = paddle.rand([1, 3, 224, 224])
out = model(x)
print(out.shape)
"""
return _resnet('resnet18', BasicBlock, 18, pretrained, **kwargs)
def resnet34(pretrained=False, **kwargs):
"""ResNet 34-layer model from
`"Deep Residual Learning for Image Recognition" <https://arxiv.org/pdf/1512.03385.pdf>`_
Args:
pretrained (bool): If True, returns a model pre-trained on ImageNet
Examples:
.. code-block:: python
import paddle
from paddle.vision.models import resnet34
# build model
model = resnet34()
# build model and load imagenet pretrained weight
# model = resnet34(pretrained=True)
x = paddle.rand([1, 3, 224, 224])
out = model(x)
print(out.shape)
"""
return _resnet('resnet34', BasicBlock, 34, pretrained, **kwargs)
def resnet50(pretrained=False, **kwargs):
"""ResNet 50-layer model from
`"Deep Residual Learning for Image Recognition" <https://arxiv.org/pdf/1512.03385.pdf>`_
Args:
pretrained (bool): If True, returns a model pre-trained on ImageNet
Examples:
.. code-block:: python
import paddle
from paddle.vision.models import resnet50
# build model
model = resnet50()
# build model and load imagenet pretrained weight
# model = resnet50(pretrained=True)
x = paddle.rand([1, 3, 224, 224])
out = model(x)
print(out.shape)
"""
return _resnet('resnet50', BottleneckBlock, 50, pretrained, **kwargs)
def resnet101(pretrained=False, **kwargs):
"""ResNet 101-layer model from
`"Deep Residual Learning for Image Recognition" <https://arxiv.org/pdf/1512.03385.pdf>`_
Args:
pretrained (bool): If True, returns a model pre-trained on ImageNet
Examples:
.. code-block:: python
import paddle
from paddle.vision.models import resnet101
# build model
model = resnet101()
# build model and load imagenet pretrained weight
# model = resnet101(pretrained=True)
x = paddle.rand([1, 3, 224, 224])
out = model(x)
print(out.shape)
"""
return _resnet('resnet101', BottleneckBlock, 101, pretrained, **kwargs)
def resnet152(pretrained=False, **kwargs):
"""ResNet 152-layer model from
`"Deep Residual Learning for Image Recognition" <https://arxiv.org/pdf/1512.03385.pdf>`_
Args:
pretrained (bool): If True, returns a model pre-trained on ImageNet
Examples:
.. code-block:: python
import paddle
from paddle.vision.models import resnet152
# build model
model = resnet152()
# build model and load imagenet pretrained weight
# model = resnet152(pretrained=True)
x = paddle.rand([1, 3, 224, 224])
out = model(x)
print(out.shape)
"""
return _resnet('resnet152', BottleneckBlock, 152, pretrained, **kwargs)
def wide_resnet50_2(pretrained=False, **kwargs):
"""Wide ResNet-50-2 model from
`"Wide Residual Networks" <https://arxiv.org/pdf/1605.07146.pdf>`_.
Args:
pretrained (bool): If True, returns a model pre-trained on ImageNet
Examples:
.. code-block:: python
import paddle
from paddle.vision.models import wide_resnet50_2
# build model
model = wide_resnet50_2()
# build model and load imagenet pretrained weight
# model = wide_resnet50_2(pretrained=True)
x = paddle.rand([1, 3, 224, 224])
out = model(x)
print(out.shape)
"""
kwargs['width'] = 64 * 2
return _resnet('wide_resnet50_2', BottleneckBlock, 50, pretrained, **kwargs)
def wide_resnet101_2(pretrained=False, **kwargs):
"""Wide ResNet-101-2 model from
`"Wide Residual Networks" <https://arxiv.org/pdf/1605.07146.pdf>`_.
Args:
pretrained (bool): If True, returns a model pre-trained on ImageNet
Examples:
.. code-block:: python
import paddle
from paddle.vision.models import wide_resnet101_2
# build model
model = wide_resnet101_2()
# build model and load imagenet pretrained weight
# model = wide_resnet101_2(pretrained=True)
x = paddle.rand([1, 3, 224, 224])
out = model(x)
print(out.shape)
"""
kwargs['width'] = 64 * 2
return _resnet('wide_resnet101_2', BottleneckBlock, 101, pretrained, **kwargs)
# coding=utf-8
# Copyright 2018 The Google AI Language Team Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Tokenization classes."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import collections
import unicodedata
import six
#import sentencepiece as sp
def convert_to_unicode(text):
"""Converts `text` to Unicode (if it's not already), assuming utf-8 input."""
if six.PY3:
if isinstance(text, str):
return text
elif isinstance(text, bytes):
return text.decode("utf-8", "ignore")
else:
raise ValueError("Unsupported string type: %s" % (type(text)))
elif six.PY2:
if isinstance(text, str):
return text.decode("utf-8", "ignore")
elif isinstance(text, unicode):
return text
else:
raise ValueError("Unsupported string type: %s" % (type(text)))
else:
raise ValueError("Not running on Python2 or Python 3?")
def load_vocab(vocab_file):
"""Loads a vocabulary file into a dictionary."""
vocab = collections.OrderedDict()
fin = open(vocab_file)
for num, line in enumerate(fin):
items = convert_to_unicode(line.strip()).split("\t")
if len(items) > 2:
break
token = items[0]
index = items[1] if len(items) == 2 else num
token = token.strip()
vocab[token] = int(index)
return vocab
def convert_by_vocab(vocab, items):
"""Converts a sequence of [tokens|ids] using the vocab."""
output = []
for item in items:
output.append(vocab[item])
return output
def convert_tokens_to_ids_include_unk(vocab, tokens, unk_token="[UNK]"):
output = []
for token in tokens:
if token in vocab:
output.append(vocab[token])
else:
output.append(vocab[unk_token])
return output
def convert_tokens_to_ids(vocab, tokens):
return convert_by_vocab(vocab, tokens)
def convert_ids_to_tokens(inv_vocab, ids):
return convert_by_vocab(inv_vocab, ids)
def whitespace_tokenize(text):
"""Runs basic whitespace cleaning and splitting on a peice of text."""
text = text.strip()
if not text:
return []
tokens = text.split()
return tokens
class FullTokenizer(object):
"""Runs end-to-end tokenziation."""
def __init__(self, vocab_file, do_lower_case=True):
self.vocab = load_vocab(vocab_file)
self.inv_vocab = {v: k for k, v in self.vocab.items()}
self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case)
self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab)
def tokenize(self, text):
split_tokens = []
for token in self.basic_tokenizer.tokenize(text):
for sub_token in self.wordpiece_tokenizer.tokenize(token):
split_tokens.append(sub_token)
return split_tokens
def convert_tokens_to_ids(self, tokens):
return convert_by_vocab(self.vocab, tokens)
def convert_ids_to_tokens(self, ids):
return convert_by_vocab(self.inv_vocab, ids)
class CharTokenizer(object):
"""Runs end-to-end tokenziation."""
def __init__(self, vocab_file, do_lower_case=True):
self.vocab = load_vocab(vocab_file)
self.inv_vocab = {v: k for k, v in self.vocab.items()}
self.tokenizer = WordpieceTokenizer(vocab=self.vocab)
def tokenize(self, text):
split_tokens = []
for token in text.lower().split(" "):
for sub_token in self.tokenizer.tokenize(token):
split_tokens.append(sub_token)
return split_tokens
def convert_tokens_to_ids(self, tokens):
return convert_by_vocab(self.vocab, tokens)
def convert_ids_to_tokens(self, ids):
return convert_by_vocab(self.inv_vocab, ids)
class BasicTokenizer(object):
"""Runs basic tokenization (punctuation splitting, lower casing, etc.)."""
def __init__(self, do_lower_case=True):
"""Constructs a BasicTokenizer.
Args:
do_lower_case: Whether to lower case the input.
"""
self.do_lower_case = do_lower_case
def tokenize(self, text):
"""Tokenizes a piece of text."""
text = convert_to_unicode(text)
text = self._clean_text(text)
# This was added on November 1st, 2018 for the multilingual and Chinese
# models. This is also applied to the English models now, but it doesn't
# matter since the English models were not trained on any Chinese data
# and generally don't have any Chinese data in them (there are Chinese
# characters in the vocabulary because Wikipedia does have some Chinese
# words in the English Wikipedia.).
text = self._tokenize_chinese_chars(text)
orig_tokens = whitespace_tokenize(text)
split_tokens = []
for token in orig_tokens:
if self.do_lower_case:
token = token.lower()
token = self._run_strip_accents(token)
split_tokens.extend(self._run_split_on_punc(token))
output_tokens = whitespace_tokenize(" ".join(split_tokens))
return output_tokens
def _run_strip_accents(self, text):
"""Strips accents from a piece of text."""
text = unicodedata.normalize("NFD", text)
output = []
for char in text:
cat = unicodedata.category(char)
if cat == "Mn":
continue
output.append(char)
return "".join(output)
def _run_split_on_punc(self, text):
"""Splits punctuation on a piece of text."""
chars = list(text)
i = 0
start_new_word = True
output = []
while i < len(chars):
char = chars[i]
if _is_punctuation(char):
output.append([char])
start_new_word = True
else:
if start_new_word:
output.append([])
start_new_word = False
output[-1].append(char)
i += 1
return ["".join(x) for x in output]
def _tokenize_chinese_chars(self, text):
"""Adds whitespace around any CJK character."""
output = []
for char in text:
cp = ord(char)
if self._is_chinese_char(cp):
output.append(" ")
output.append(char)
output.append(" ")
else:
output.append(char)
return "".join(output)
def _is_chinese_char(self, cp):
"""Checks whether CP is the codepoint of a CJK character."""
# This defines a "chinese character" as anything in the CJK Unicode block:
# https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
#
# Note that the CJK Unicode block is NOT all Japanese and Korean characters,
# despite its name. The modern Korean Hangul alphabet is a different block,
# as is Japanese Hiragana and Katakana. Those alphabets are used to write
# space-separated words, so they are not treated specially and handled
# like the all of the other languages.
if ((cp >= 0x4E00 and cp <= 0x9FFF) or #
(cp >= 0x3400 and cp <= 0x4DBF) or #
(cp >= 0x20000 and cp <= 0x2A6DF) or #
(cp >= 0x2A700 and cp <= 0x2B73F) or #
(cp >= 0x2B740 and cp <= 0x2B81F) or #
(cp >= 0x2B820 and cp <= 0x2CEAF) or (cp >= 0xF900 and cp <= 0xFAFF) or #
(cp >= 0x2F800 and cp <= 0x2FA1F)): #
return True
return False
def _clean_text(self, text):
"""Performs invalid character removal and whitespace cleanup on text."""
output = []
for char in text:
cp = ord(char)
if cp == 0 or cp == 0xfffd or _is_control(char):
continue
if _is_whitespace(char):
output.append(" ")
else:
output.append(char)
return "".join(output)
class WordpieceTokenizer(object):
"""Runs WordPiece tokenziation."""
def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=100):
self.vocab = vocab
self.unk_token = unk_token
self.max_input_chars_per_word = max_input_chars_per_word
def tokenize(self, text):
"""Tokenizes a piece of text into its word pieces.
This uses a greedy longest-match-first algorithm to perform tokenization
using the given vocabulary.
For example:
input = "unaffable"
output = ["un", "##aff", "##able"]
Args:
text: A single token or whitespace separated tokens. This should have
already been passed through `BasicTokenizer.
Returns:
A list of wordpiece tokens.
"""
text = convert_to_unicode(text)
output_tokens = []
for token in whitespace_tokenize(text):
chars = list(token)
if len(chars) > self.max_input_chars_per_word:
output_tokens.append(self.unk_token)
continue
is_bad = False
start = 0
sub_tokens = []
while start < len(chars):
end = len(chars)
cur_substr = None
while start < end:
substr = "".join(chars[start:end])
if start > 0:
substr = "##" + substr
if substr in self.vocab:
cur_substr = substr
break
end -= 1
if cur_substr is None:
is_bad = True
break
sub_tokens.append(cur_substr)
start = end
if is_bad:
output_tokens.append(self.unk_token)
else:
output_tokens.extend(sub_tokens)
return output_tokens
def _is_whitespace(char):
"""Checks whether `chars` is a whitespace character."""
# \t, \n, and \r are technically contorl characters but we treat them
# as whitespace since they are generally considered as such.
if char == " " or char == "\t" or char == "\n" or char == "\r":
return True
cat = unicodedata.category(char)
if cat == "Zs":
return True
return False
def _is_control(char):
"""Checks whether `chars` is a control character."""
# These are technically control characters but we count them as whitespace
# characters.
if char == "\t" or char == "\n" or char == "\r":
return False
cat = unicodedata.category(char)
if cat.startswith("C"):
return True
return False
def _is_punctuation(char):
"""Checks whether `chars` is a punctuation character."""
cp = ord(char)
# We treat all non-letter/number ASCII as punctuation.
# Characters such as "^", "$", and "`" are not in the Unicode
# Punctuation class but we treat them as punctuation anyways, for
# consistency.
if ((cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or (cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126)):
return True
cat = unicodedata.category(char)
if cat.startswith("P"):
return True
return False
import json
import os
from typing import List
from typing import Union
import numpy as np
import paddle
from disco_diffusion_ernievil_base.vit_b_16x.ernievil2.transformers.clip_vision_transformer import ViT_base_patch16_224
from disco_diffusion_ernievil_base.vit_b_16x.ernievil2.transformers.clip_vision_transformer import ViT_base_patch32_224
from disco_diffusion_ernievil_base.vit_b_16x.ernievil2.transformers.clip_vision_transformer import ViT_large_patch14_224
from disco_diffusion_ernievil_base.vit_b_16x.ernievil2.transformers.efficientnet import EfficientNetB5
from disco_diffusion_ernievil_base.vit_b_16x.ernievil2.transformers.ernie2 import ErnieModel
from disco_diffusion_ernievil_base.vit_b_16x.ernievil2.transformers.multimodal import MultiModalModel
from disco_diffusion_ernievil_base.vit_b_16x.ernievil2.utils.tokenizer import FullTokenizer
__all__ = ['tokenize', 'build_model']
MODEL_NAMES = ['vit_b_16x']
MEAN, STD = (0.48145466, 0.4578275, 0.40821073), (0.26862954, 0.26130258, 0.27577711)
_tokenizer = FullTokenizer(vocab_file=os.path.join(os.path.dirname(__file__),
'../../packages/ernie_base_3.0/vocab.txt'),
do_lower_case=True)
def tokenize(texts: Union[str, List[str]], context_length: int = 64):
"""
Returns the tokenized representation of given input string(s)
Parameters
----------
texts : Union[str, List[str]]
An input string or a list of input strings to tokenize
context_length : int
The context length to use; all baseline models use 24 as the context length
Returns
-------
A two-dimensional tensor containing the resulting tokens, shape = [number of input strings, context_length]
"""
if isinstance(texts, str):
texts = [texts]
all_tokens = []
for text in texts:
all_tokens.append([_tokenizer.vocab['[CLS]']] +
_tokenizer.convert_tokens_to_ids(_tokenizer.tokenize(text))[:context_length - 2] +
[_tokenizer.vocab['[SEP]']])
result = paddle.zeros([len(all_tokens), context_length], dtype='int64')
for i, tokens in enumerate(all_tokens):
assert len(tokens) <= context_length
result[i, :len(tokens)] = paddle.to_tensor(tokens)
return result
def build_model(name='vit_b_16x'):
assert name in MODEL_NAMES, f"model name must be one of {MODEL_NAMES}"
name2model = {'vit_b_16x': build_vit_b_16x_model}
model = name2model[name]()
return model
def build_vit_b_16x_model():
# Define model
image_model = ViT_base_patch16_224()
with open(os.path.join(os.path.dirname(__file__),
'../../packages/ernie_base_3.0/ernie_config.base.json')) as json_file:
config_dict = json.load(json_file)
text_model = ErnieModel(config_dict)
model = MultiModalModel(image_model, text_model)
checkpoint = paddle.load(os.path.join(os.path.dirname(__file__), '../../pre_trained/vit_b_16x.pdparams'))
model.set_state_dict(checkpoint)
model.eval()
return model
# The frequency to save trained models when training.
save_step: 500
# The frequency to fetch and print output when training.
print_step: 10
# The directory for saving model
save_model: "checkpoints"
# The directory for saving inference model
inference_model_dir: "infer_model"
# Set seed for CE or debug
random_seed: 1024
# The data type of input ids.
input_dtype: "int64"
# Device to use.
device: "gpu"
# TODO fix
#batch_size: 2000
batch_size: 100
infer_batch_size: 1500
shuffle_batch: False
# Data shuffle only works when sort_type is pool or none
shuffle: False
# shuffle_seed must be set when shuffle is True and using multi-cards to train.
# Otherwise, the number of batches cannot be guaranteed.
shuffle_seed: 128
# The number of epoches for training
epoch: 50
#learning_rate: 0.00005
learning_rate: 0.00003
beta1: 0.9
beta2: 0.997
eps: 1e-9
# The parameters for learning rate scheduling.
warmup_steps: 1000
# Dropout rates.
dropout: 0.1
# Mixed precision training
use_amp: True
use_pure_fp16: False
scale_loss: 128.0
# Maximum iteration for training.
max_iter: None
do_train: True
max_text_seqlen: 48
vocab_file: "./packages/ernie_base_3.0/vocab.txt"
text_model_config: "./packages/ernie_base_3.0/ernie_config.base.json"
pad_token: 0
cls_token: 1
sep_token: 2
mask_token: 3
unk_token: 17963
{
"attention_probs_dropout_prob": 0.1,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"max_position_embeddings": 2048,
"num_attention_heads": 12,
"num_hidden_layers": 12,
"sent_type_vocab_size": 4,
"task_type_vocab_size": 3,
"vocab_size": 40000
}
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册